Multiple Hypothesis Testing in Microarray Data Analysis

Multiple Hypothesis Testing in Microarray Data Analysis Sandrine Dudoit jointly with Mark van der Laan and Katie Pollard Division of Biostatistics, UC Berkeley www.stat.berkeley.edu/~sandrine Short Course: Practical Statistical Analysis of DNA Microarray Data Instructors: Vincent Carey & Sandrine Dudoit KolleKolle, Denmark October 26-28, 2003 c Copyright 2003, all rights reserved

Outline Motivation: Microarray data analysis. Multiple hypothesis testing framework. Null distribution. Single-step procedures. Control of gfwer via control of FWER. Step-down procedures. Software: R multtest package. Application: Alizadeh et al. (2000) DLBCL microarray data; Golub et al. (1999) ALL/AML microarray data. Ongoing work. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 2

Motivation: Microarray data analysis Cells respond to various treatments and conditions by regulating gene expression, e.g., activating or repressing the transcription of certain genes. DNA microarrays are high-throughput biological assays that can measure DNA or RNA abundance, and hence yield information on gene expression levels, on a genomic scale: transcript (mrna) levels, DNA copy number (CGH), IBD status (GMS), methylation status, SNP genotypes. E.g. In cancer research, microarrays are used to measure transcript levels and DNA copy number in tumor samples for tens of thousands of genes at a time. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 3

Motivation: Microarray data analysis General question: Relate genome-wide microarray measures to biological and clinical covariates and outcomes. That is, make inference about the joint distribution of these variables. Covariates: treatment, dose, time, demographic variables, occurrence of DNA sequence motifs, SNP genotypes, etc. Outcomes: affectedness/unaffectedness, quantitative trait, tumor class, metastasis indicator, response to treatment, time to recurrence, patient survival, etc. polychotomous or continuous; censored or uncensored. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 4

Motivation: Microarray data analysis Differential gene expression: Identify genes whose expression levels are associated with a response or covariate of interest. Estimation: Estimate parameter of interest for the joint distribution of the microarray measures and responses/covariates. Estimate variability of the estimators. Testing: Test for each gene a null hypothesis concerning a gene-specific parameter of interest. E.g. Parameters of interest include means, differences in means, interactions, correlations, slopes, survival probabilities. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 5

Multiple hypothesis testing Large multiplicity problem: thousands of hypotheses are tested simultaneously! Increased chance of false positives (i.e., Type I errors). E.g., chance of at least one p value < α for m independent tests is 1 (1 α) m and converges to one as m increases. For m = 1, 000 and α = 0.01, this chance is 0.9999568! Individual p values of 0.01 no longer correspond to significant findings. Use a multiple testing procedure (MTP) that accounts for the simultaneous test of multiple hypotheses by considering the joint distribution of the test statistics for each hypothesis. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 6

Multiple hypothesis testing Define a suitable set of null hypotheses: Which subset of genes to consider? What is the parameter of interest? Compute an appropriate test statistic for each null hypothesis. Select a Type I error rate that corresponds to a suitable form of control of false positives for the particular application: e.g., FWER, PCER, FDR. Use a multiple testing procedure (MTP) based on a null joint distribution for the test statistics that provides the desired control of the Type I error rate under the true unknown data generating distribution. Use resampling methods to estimate the unknown (null) joint distribution of the test statistics. Report adjusted p values for each hypothesis, which reflect the overall Type I error rate of the MTP. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 7

Model Data. Let X 1,..., X n be n independent and identically distributed (i.i.d.) random d-vectors, X P M, where M is a model (possibly non-parametric) for the data generating distribution P. E.g. X i = (W i, Z i ): gene expression measures W i and biological and clinical outcomes Z i for patient i, i = 1,..., n. The dimension of the data X (d) is usually much larger than the sample size (n): n d. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 8

Parameters of interest Parameters. Functions of the unknown data generating distribution P : µ = (µ(j) : j = 1,..., m), where µ(j) = µ j (P ) IR and typically m d. Parameters can refer to linear models, generalized linear models, survival models (e.g., Cox proportional hazards model), time-series models, dose-response models, etc. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 9

Parameters of interest Location parameters: e.g., means, differences in means, medians. E.g. Assess differential expression for m = d genes. µ(j) = mean expression level of gene j in Population 1. µ(j) µ 2 (j) µ 1 (j) = difference in mean expression level for gene j in Populations 1 and 2. Scale parameters: e.g., covariances, correlations. E.g. Pairwise correlations for m = d (d 1)/2 pairs of genes: γ(j, k), j k = 1,..., d. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 10

Parameters of interest Regression parameters: e.g., slopes, main effects, interactions. E.g. Measure association of gene expression levels with an outcome/covariate for m = d genes. µ(j) = parameter for univariate Cox proportional hazards model or dose-response model for gene j. µ(j) = interaction effect for two drugs on expression level of gene j. µ(j) = linear combination a β(j), e.g., contrast in ANOVA. Survival probabilities. E.g. µ(j)(x) = P r(survival X(j) = x), where X(j) is expression measure for gene j. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 11

Null hypotheses General submodel null hypotheses. Defined in terms of a collection of m submodels, M j M, for the data generating distribution P. H 0j I ( P M j ) vs. H 1j I ( P / M j ), j = 1,..., m. Thus, H 0j is true, i.e., H 0j = 1, if P M j and false otherwise. True null hypotheses S 0 = S 0 (P ) {j : H 0j is true} = {j : P M j }, False null hypotheses m 0 S 0. S c 0 = S c 0(P ) {j : H 0j is false} = {j : P / M j }, m 1 S c 0 = m m 1. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 12

Null hypotheses Single parameter null hypotheses. Each null hypothesis refers to a single parameter, µ(j) = µ j (P ), j = 1,..., m. One-sided tests H 0j = I ( µ(j) µ 0 (j) ) vs. H 1j = I ( µ(j) > µ 0 (j) ), j = 1,..., m. Two-sided tests H 0j = I ( µ(j) = µ 0 (j) ) vs. H 1j = I ( µ(j) µ 0 (j) ), j = 1,..., m. The hypothesized null values, µ 0 (j), are frequently 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 13

Null hypotheses Multiple parameter null hypotheses. Each null hypothesis refers to a collection of parameters, {µ 1 (j),..., µ K (j)}, j = 1,..., m. E.g. Assess differential expression for m = d genes in K > 2 populations, by testing for each gene the equality of mean expression levels in the K populations. H 0j = I ( µ 1 (j) = µ 2 (j) =... = µ K (j) ) vs. H 1j = I ( At least one µ k (j) is different ), j = 1,..., m. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 14

Test statistics Rejection decisions are based on an m-vector of test statistics, T n = (T n (j) : j = 1,..., m) Q n (P ) = Q n, where we assume that large values of T n (j) provide evidence against null hypothesis H 0j, j = 1,..., m. E.g. Reject H 0j if T n (j) > c j, where c = (c j : j = 1,..., m) denotes an m-vector of possibly random cut-offs. For two-sided tests, can take absolute value of test statistics. Notation. Use lower case letters (e.g., x, t n, p n (j)) for realizations of random variables (e.g., X, T n, Pn (j)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 15

Test statistics Single parameter null hypotheses. H 0j = I ( µ(j) µ 0 (j) ) vs. H 1j = I ( µ(j) > µ 0 (j) ), j = 1,..., m. Difference statistics D n (j) ( Estimator Null Value ) = n(µ n (j) µ 0 (j)). t-statistics (i.e., standardized difference) T n (j) Estimator Null Value Standard Error = n µ n(j) µ 0 (j). σ n (j) Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 16

Test statistics Estimators: µ n = (µ n (j) : j = 1,..., m) are assumed to be asymptotically linear estimators of the parameter µ = (µ(j) : j = 1,..., m), with m-dimensional vector influence curve (IC), IC(X P ) = (IC j (X P ) : j = 1,..., m), such that µ n (j) µ(j) = 1 n n IC j (X i P ) + o P (1/ n), i=1 where E[IC(X P )] = 0 and Σ(P ) and ρ(p ), denote, respectively, the covariance and correlation matrices of the vector IC. Standard errors: σ 2 n(j) are assumed to be consistent estimators of the IC variances, σ 2 (j) = E[IC j (X P ) 2 ]. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 17

Test statistics N.B. This general representation includes standard one-sample and two-sample t-statistics, but also test statistics for correlations and regression parameters in linear and non-linear models. E.g. For population mean parameters, µ(j) = E[X(j)], and sample mean estimators, µ n (j) = X n (j) = i X i(j)/n, then IC j (X i P ) = X i (j) µ(j). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 18

Test statistics E.g. One-sample t-statistics. Suppose we have n i.i.d. random d-vectors, X 1,..., X n P, from a population with mean parameter vector µ = E[X]. Null hypotheses. H 0j = I ( µ(j) = µ 0 (j) ), j = 1,..., m. Test statistics. T n (j) = n X n (j) µ 0 (j), j = 1,..., m, σ n (j) where X n (j) = i X i(j)/n and σ 2 n(j) = i (X i(j) X n (j)) 2 /n denote the sample means and variances, respectively. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 19

Test statistics E.g. Two-sample t-statistics. Suppose we have n k i.i.d. random d-vectors, X k,1,..., X k,nk P k, from Population k, with mean parameter vector µ k = E[X k ], k = 1, 2. Null hypotheses. Equal population means H 0j = I ( µ(j) µ 2 (j) µ 1 (j) = 0 ), j = 1,..., m. Test statistics. Two-sample Welch s t-statistics T n (j) = X 2,n (j) X 1,n (j), j = 1,..., m, σ1,n 2 (j)/n 1 + σ2,n 2 (j)/n 2 where X k,n (j) and σk,n 2 (j) denote sample means and variances for Population k, respectively. Can also use equal variance analogue. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 20

Test statistics E.g. (K-sample) F -statistics. Suppose we have n k i.i.d. random d-vectors, X k,1,..., X k,nk P k, from Population k, with mean parameter vector µ k = E[X k ], k = 1,..., K. Null hypotheses. Equal population means Test statistics. T n (j) = H 0j = I ( µ 1 (j) = µ 2 (j) =... = µ K (j) ) vs. H 1j = I ( At least one µ k (j) is different ). 1/(K 1) K k=1 n k( X k,n (j) X n (j)) 2 1/(n K) K nk k=1 i=1 (X k,i(j) X, j = 1,..., m, k,n (j)) 2 where X k,n (j) denote the sample means for Population k and X n (j) = k n k X k,n (j)/ k n k denotes the overall sample means. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 21

Multiple testing procedures A multiple testing procedure (MTP) produces a set S n of rejected hypotheses that estimates S c 0, the set of false null hypotheses S n = S(T n, Q 0, α) {j : H 0j is rejected} {1,..., m}, where the set S n (or Ŝc 0) depends on the data, X 1,..., X n, through the test statistics T n ; the null distribution, Q 0, for the test statistics T n, used to compute cut-offs for these T n (or corresponding p-values); the nominal level α of the MTP, i.e., the desired upper bound for the Type I error rate. E.g. Set of genes identified as differentially expressed based on S n {j : T n (j) > c j }, where c = (c j : j = 1,..., m) denotes an m-vector of possibly random cut-offs, computed under Q 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 22

Single-step vs. stepwise procedures One distinguishes between two main approaches for choosing a cut-off vector c = (c j : j = 1,..., m). Single-step procedures: equivalent adjustments for all H 0j, i.e., cut-offs only depend on null distribution Q 0 and level α: c = c(q 0, α). (Bonferroni) Stepwise procedures: adjustments depend on the observed data, i.e., cut-offs also depend on test statistics T n : c = c(t n, Q 0, α). Step-down: start with most significant hypothesis, as soon as one fails to reject a null hypothesis, no further hypotheses are rejected. (Holm, 1979). Step-up: start with least significant hypothesis, as soon as one rejects a null hypothesis, reject all hypotheses that are more significant. (Hochberg, 1986). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 23

Type I and Type II errors Given sets S n = S(T n, Q 0, α) {j : H 0j is rejected}, S 0 = S 0 (P ) {j : H 0j is true}, two types of errors can be committed by an MTP. Type I error or false positive: S n S 0, i.e., reject (S n ) a true null hypothesis (S 0 ). Type II error or false negative: S c n S c 0, i.e., fail to reject (S c n) a false null hypothesis (S c 0). Cf. power: S n S c 0. Denote the number of Type I errors by V n S n S 0 and the corresponding cumulative distribution function (c.d.f.) by F Vn. E.g. Type I: Say gene is differentially expressed, when in truth it isn t. Type II: Fail to identify a truly differentially expressed gene. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 24

Type I and Type II errors not rejected Null hypotheses rejected Null hypotheses true Sn c S 0 V n = S n S 0 m 0 = S 0 (Type I) false U n = Sn c S0 c S n S0 c m 1 = S0 c (Type II) S c n R n = S n m Adapted from Benjamini & Hochberg (1995). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 25

Type I and Type II errors Let m 0 = S 0 = # true null hypotheses unknown parameter R n = S n = # rejections observable random variable V n = S n S 0 = # Type I errors unobservable random variable U n = Sn c S0 c = # Type II errors unobservable random variable. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 26

Type I error rates Consider Type I error rates, θ(f Vn ), that are functions, i.e., parameters, of the distribution F Vn of the number of false positives. 1. Per-comparison error rate (PCER), or expected proportion of Type I errors among the m tests, P CER E[V n ]/m = vdf Vn (v)/m. 2. Per-family error rate (PFER), or expected number of Type I errors, P F ER E[V n ] = vdf Vn (v). 3. Median-based per-family error rate (mpfer), or median number of Type I errors, mp F ER Median(V n ) = F 1 V n (1/2). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 27

Type I error rates 4. Family-wise error rate (FWER), or probability of at least one Type I error, F W ER P r(v n > 0) = 1 F Vn (0). 5. Generalized family-wise error rate (gfwer), or probability of at least k + 1 Type I errors, k = 0,..., m 0 1, gf W ER P r(v n > k) = 1 F Vn (k). When k = 0, the gfwer is the usual family-wise error rate, FWER. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 28

Type I error rates Make the following two assumptions for the mapping θ : F θ(f ), defining a Type I error rate as a parameter corresponding to a c.d.f F on {0,..., m}. Monotonicity. If F 1 F 2, then θ(f 1 ) θ(f 2 ). Uniform continuity. For two sequences of c.d.f s, {F n } and {G n }, if d(f n, G n ) 0, then θ(f n ) θ(g n ) 0, as n. Here, given two c.d.f s, F 1 and F 2 on {0,..., m}, the distance measure d is defined by d(f 1, F 2 ) max x {0,...,m} F 1 (x) F 2 (x). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 29

Type I error rates N.B. The false discovery rate (FDR) of Benjamini & Hochberg (1995) cannot be represented as a parameter θ(f Vn ), because it also involves the distribution of the number of rejections, R n. The FDR is the expected proportion of Type I errors among the rejected hypotheses, i.e., F DR E[V n /R n ], with the convention that V n /R n = 0, if R n = 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 30

Adjusted p-values The unadjusted p-value for the test of a single hypothesis, H 0j, is based only on the test statistic T n (j) for that hypothesis p n (j) = P (t n (j), Q 0 ) inf {α [0, 1] : Reject H 0j at single test level α}, = P r Q0 (T n (j) t n (j)), j = 1,..., m. That is, P (t n (j), Q 0 ) is the level of the single hypothesis testing procedure at which H 0j would just be rejected, given T n (j) = t n (j). If Q 0 has continuous marginal c.d.f. Q 0j, p n (j) = 1 Q 0j (t n (j)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 31

Adjusted p-values In an MTP, the adjusted p-value for null hypothesis H 0j is defined as p n (j) = P (j, t n, Q 0 ) inf {α [0, 1] : j S(t n, Q 0, α)}, j = 1,..., m. That is, P (j, tn, Q 0 ) is the level of the entire MTP at which H 0j would just be rejected, given T n = t n. E.g. Bonferroni for FWER control: p n (j) = min(mp n (j), 1). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 32

Adjusted p-values We now have two representations for an MTP, in terms of cut-offs for the test statistics T n (j), c j = c j (T n, Q 0, α): S(T n, Q 0, α) = {j : T n (j) > c j }; adjusted p-values, Pn (j) = P (j, T n, Q 0 ): S(T n, Q 0, α) = {j : P n (j) α}. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 33

Adjusted p-values Flexible: The results of an MTP are provided for all levels α = do not need to choose the level ahead of time. Reflect the strength of the evidence against each null hypothesis in terms of the overall Type I error rate of the MTP. Can be defined for any Type I error rate: FWER, PCER, FDR. Plots of sorted adjusted p-values for different Type I error rates allow scientists to decide on an appropriate combination of number of rejections and tolerable false positive rate for a particular application and available resources. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 34

Null distribution In MTPs, the true unknown distribution Q n = Q n (P ), for the test statistics T n, is typically estimated by a null distribution Q 0, in order to derive cut-offs for these test statistics T n (or corresponding adjusted p-values). N.B. The choice of null distribution Q 0 is crucial, in order to ensure that (finite sample or asymptotic) control of the Type I error under the assumed Q 0 does indeed provide the required control under the true distribution Q n (P ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 35

Null distribution For proper control of the Type I error rate, the assumed null distribution Q 0 must be such that where θ(f Vn ) θ(f V0 ) [finite sample control] lim sup θ(f Vn ) θ(f V0 ) [asymptotic control], n V n Number of Type I errors under T n Q n (P ), V 0 Number of Type I errors under T n Q 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 36

Null distribution The required control of the Type I error rate follows from the following null domination conditions F Vn (x) F V0 (x) [finite sample control] lim inf n F V n (x) F V0 (x) [asymptotic control] x {0,..., m}. (1) That is, the S 0 -specific distribution of Q 0 equals or dominates the S 0 -specific distribution of Q n (P ). More specific (i.e., less stringent) forms of null domination can be derived for given definitions of the Type I error rate, i.e., of the parameter θ(f ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 37

Null distribution We propose a general construction for a null distribution Q 0 for the test statistics T n. This null distribution satisfies the asymptotic null domination condition lim inf n F V n (x) F V0 (x), x {0,..., m}, and provides asymptotic control of general Type I error rates θ(f Vn ), lim sup θ(f Vn ) α, n in single-step and step-down procedures, for testing general null hypotheses H 0j = I ( P M j ), corresponding to submodels M j M for the data generating distribution P. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 38

Null distribution Theorem 0. General construction for test statistics null distribution Q 0. Suppose there exists m-vectors of null values θ 0 IR m and τ 0 IR +m, such that, for j S 0, lim sup ET n (j) θ 0 (j) and lim sup V ar[t n (j)] τ 0 (j). n n Then, let ν 0n (j) min ( 1, shifted and scaled statistics Z n (j) ) τ 0 (j) V ar[t n (j)] and define null-value Z n (j) ν 0n (j)(t n (j) + θ 0 (j) ET n (j)). Suppose that Z n D Z Q0 (P ). Then, Q 0 = Q 0 (P ) satisfies the asymptotic null domination condition in (1). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 39

Null distribution t-statistics. For tests of single parameter null hypotheses using t-statistics, the null values are θ 0 (j) = 0 and τ 0 (j) = 1, and the null distribution Q 0 = Q 0 (P ) for T n corresponds to the asymptotic distribution of the standardized estimators µ n. That is, Q 0 (P ) N(0, ρ(p )), where ρ(p ) denotes the correlation matrix of the vector influence curve. E.g. In the special case of tests of means, where µ = E[X], ρ(p ) is simply the correlation matrix for X P. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 40

Null distribution N.B. The following important points distinguish our approach from existing approaches to Type I error rate control in MTP. We are only concerned with control under the true data generating distribution P. The notions of weak and strong control are therefore irrelevant to our approach. We propose a null distribution for the test statistics (T n Q 0 ), and not a data generating null distribution (X P 0 ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 41

Null distribution It is common to define Q 0 in terms of a data generating null distribution, P 0, that satisfies the complete null hypothesis. E.g. Define Q 0 lim n Q n (P 0 ), where P 0 m j=0 M j. However, this practice does not necessarily provide asymptotic control under the true distribution P, as the null distribution Q n (P 0 ) and the true distribution Q n (P ) for the test statistics T n may have different limits: lim n Q n (P 0 ) lim n Q n (P ). In particular, for single parameter null hypotheses, the t-statistics T n have Gaussian asymptotic distributions, but the covariance matrices may differ under the true P and under the null P 0. That is, Σ(P 0 ) Σ(P ), and, in particular, Σ S0 (P 0 ) Σ S0 (P ). As a result null domination, and hence Type I error control, may fail. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 42

Estimation of null distribution In practice, since P is unknown, then so is the null distribution Q 0 = Q 0 (P ). One can estimate Q 0 by Q 0n as follows. Bootstrap null distribution: Z # n Q 0n, based on X # 1,..., X# n P n, very general. Test statistics specific null distribution Single parameter null hypotheses: Q 0n = N(0, ρ n ), where ρ n is an estimator of the correlation matrix ρ(p ) of the vector IC. E.g., for means, ρ(p ) can be estimated directly from the data, in other cases, ρ(p ) can be estimated with the bootstrap. Multiple parameter null hypotheses: F -specific distribution (quadratic forms of Gaussian random variables). Data generating null distribution: Q 0n = Q n (P 0n ), where P 0n is an estimator of P 0 m j=0m j, e.g., using permutation. Problem. May have lim n Q n (P 0n ) lim n Q n (P ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 43

Bootstrap estimation of null distribution Procedure 0. Bootstrap estimation of null distribution Q 0. 1. Generate B (non-parametric or parametric) bootstrap samples. 2. For each bootstrap sample, compute an m-vector of test statistics, T b n = (T b n(j) : j = 1,..., m), which can be arranged in an m B matrix, T = `T b n(j), with rows corresponding to the m hypotheses and columns to the B bootstrap samples. 3. Compute row means and variances of the matrix T, to yield estimates of ET n (j) and V ar[t n (j)], j = 1,..., m. 4. Obtain an m B matrix, Z = `Z b n(j), of null-value shifted and scaled statistics, Z b n(j), by row shifting and scaling the matrix T using the bootstrap estimates of ET n (j) and V ar[t n (j)] and the user-supplied null-values θ 0 (j) and τ 0 (j). 5. The bootstrap estimate Q 0n of the null distribution Q 0 is the empirical distribution of the columns Z b n of matrix Z. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 44

Resampling-based multiple testing procedures Theorem 1. Asymptotic control for procedures based on consistent estimator of null distribution Q 0. Let Q 0n be a consistent estimator of the null distribution Q 0, i.e., Q 0n converges weakly to Q 0 (e.g., from bootstrap). Consider a single-step or step-down MTP, S(T n, Q 0, α), based on cut-offs c(t n, Q 0, α), that provides asymptotic control of the Type I error rate θ(f Vn ) at level α. Let c(t n, Q 0n, α) and S(T n, Q 0n, α) denote the corresponding cut-offs and MTP based on the estimator Q 0n. Then, under regularity conditions for the null distribution Q 0, c j (T n, Q 0n, α) c j (T n, Q 0, α) P 0 as n, j = 1,..., m. Consequently, the MTP S(T n, Q 0n, α) also provides asymptotic control of the Type I error rate θ(f Vn ) at level α. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 45

Single-step multiple testing procedures Given a null distribution Q 0, a mapping θ : F θ(f ) defining the Type I error rate, and a target or nominal level α, consider single-step procedures of the form S(T n, Q 0, α) = {j : T n (j) > c j (Q 0, α)}, based on two definitions for the m-vector of cut-offs c(q 0, α) = (c j (Q 0, α) : j = 1,..., m). Let R(c Q) V (c Q) m I(T n (j) > c j ) = # rejections for T n Q, j=1 j S 0 I(T n (j) > c j ) = # Type I errors for T n Q, R 0 = R(c Q 0 ), R n = R(c Q n ), V 0 = V (c Q 0 ), V n = V (c Q n ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 46

Single-step multiple testing procedures Single-step common cut-off procedure. Choose common cut-off c j (Q 0, α) c 0 as c 0 min{c IR : θ(f R((c,...,c) Q0 )) α}. (2) Single-step common quantile procedure. Define cut-offs c j (Q 0, α) Q 1 0j (1 δ 0), as the (1 δ 0 ) quantiles of the marginal distributions Q 0j. Choose common quantile δ 0 as δ 0 max{δ (0, 1) : θ(f R((Q 1 0j (1 δ):j=1,...,m) Q ) α}. (3) 0) Thus, the common cut-off c 0 or common quantile δ 0 are chosen so that θ(f R0 ) α. E.g. For FWER, P r(r 0 > 0) α, for PCER, E[R 0 ]/m α Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 47

Single-step multiple testing procedures Theorem 2. Asymptotic control of Type I error θ(f Vn ) for single-step procedures. Consider a null distribution Q 0 as in Theorem 0 and a Type I error rate θ(f Vn ) defined in terms of a monotone and uniformly continuous mapping θ : F θ(f ). Define a single-step procedure S n = S(T n, Q 0, α) = {j : T n (j) > c j (Q 0, α)} based on common cut-offs c j (Q 0, α) as in (2) or common quantile cut-offs c j (Q 0, α) as in (3). Then, S n provides asymptotic control of the Type I error rate θ(f Vn ) at level α lim sup θ(f Vn ) α. n Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 48

Single-step multiple testing procedures Sketch of proof. 1. Control the parameter θ(f R0 ) α, corresponding with the observed number of rejections R 0 under the null Q 0. 2. Since θ( ) is monotone and the number of Type I errors V 0 is never larger than the number of rejections R 0, V 0 R 0, then θ(f V0 ) θ(f R0 ). 3. Since θ( ) is uniformly continuous and Q 0 has the asymptotic null domination property, lim inf n F Vn (x) F V0 (x) x, then lim sup θ(f Vn ) α. n Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 49

Single-step multiple testing procedures: Adjusted p-values Single-step common cut-off procedure. Adjusted p-values are given by P n (j) = θ(f R(Cn (j) Q 0 )), for j-specific common cut-off vectors C n (j) = (T n (j),..., T n (j)). Single-step common quantile procedure. Adjusted p-values are given by P n (j) = θ(f R(Cn (j) Q 0 )), for j-specific common quantile cut-off vectors C n (j) = (Q 1 0k (1 P n(j)) : k = 1,..., m) and unadjusted p-values P n (j) = 1 Q 0j (T n (j)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 50

Single-step multiple testing procedures: gfwer Single-step common cut-off procedure for gfwer. p n (j) = P r Q0 (T n(k + 1) t n (j)), where T n(k + 1) denotes the kth most significant test statistic, that is, T n(1) T n(2)... T n(m). Single-step common quantile procedure for gfwer. p n (j) = P r Q0 (P n(k + 1) p n (j)), where P n(k + 1) denotes the kth most significant unadjusted p-value, that is, P n(1) P n(2)... P n(m). For FWER control (k = 0), our general procedure reduces to the single-step maxt and single-step minp procedures, based on the maximum test statistic T n(1) and the minimum unadjusted p-value P n(1), respectively. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 51

Control of gfwer via control of FWER One can derive a gfwer-controlling procedure using a simple modification of an FWER-controlling procedure. The main idea is to follow the FWER-controlling procedure exactly until the first non-rejection and then reject this hypothesis and also the next k 1 most significant hypotheses. Advantages. (i) Only requires working with the FWER. (ii) Guarantees at least k rejected hypotheses. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 52

Control of gfwer based on control of FWER Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 53

Control of gfwer via control of FWER Proof. Let Vn 0 and Vn k denote the number of Type I errors for the FWER- and gfwer-controlling procedures, respectively. Then, Vn k Vn 0 + k, so that P r(vn k > k) P r(vn 0 + k > k) = P r(vn 0 > 0). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 54

Control of gfwer via control of FWER Let O n (j) denote the indices for the ordered test statistics (or unadjusted p-values), so that O n (1) corresponds to the most significant hypothesis, and O n (m) to the least significant hypothesis. That is, T n (O n (1))... T n (O n (m)). k The adjusted p-values, P n, for the gfwer procedure are obtained 0 by simply shifting by k the adjusted p-values, P n, for the FWER procedure P n k 0, j = 1,..., k, (O n (j)) = P n(o 0 n (j k)), j = k + 1,..., m. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 55

Step-down multiple testing procedures Step-down maxt procedure for FWER control. Let T n(j) be the ordered test statistics, T n(1)... T n(m), and O n (j) the indices for these order statistics, so that T n(j) = T n (O n (j)). For Z Q 0 and subsets O n (j) {O n (j),..., O n (m)}, define quantiles C n (j) (1 α) quantile of max k On (j) Z(k) and step-down cut-offs C n(1) C n (1), C n(j) 8 < C n (j), if Tn(j 1) > Cn(j 1) :, otherwise, j = 2,..., m. Reject H 0,On (j) if T n(j) > C n(j), j = 1,..., m. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 56

Step-down multiple testing procedures The adjusted p-value for hypothesis H 0,On (j) is given by P n (O n (j)) = max k=1,...,j «ff jp r Q0 max Z(l) T n(o n (k)). l {O n (k),...,o n (m)} That is, the MTP is based on the distribution of successive maxima of the test statistics T n (j), over subsets determined by their observed ordering. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 57

Step-down multiple testing procedures An analogue of the step-down maxt procedure can be defined based on the distribution of successive minima of the unadjusted p-values P n (j). Step-down minp procedure for FWER control. The adjusted p-value for hypothesis H 0,On (j) is given by P n (O n (j)) = max k=1,...,j jp r Q0 min l {O n (k),...,o n (m)} «ff Q 0l (Z(l)) P n (O n (k)), where O n (j) denote the indices of the ordered unadjusted p-values, i.e., P n (O n (1))... P n (O n (m)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 58

Step-down multiple testing procedures Theorem 3. Asymptotic control of FWER (and gfwer) for step-down procedures. Consider a null distribution Q 0 as in Theorem 0. Assume that, with probability one in the limit, the most significant test statistics correspond to the false null hypotheses, i.e., lim n P r ( {O n (1),..., O n (m 1 )} = S c 0) = 1. Then, the step-down maxt and step-down minp procedures provide asymptotic control of the FWER at level α lim sup P r(v n > 0) α. n Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 59

Alizadeh et al. (2000) DLBCL microarray dataset Target mrna samples from n = 40 Diffuse Large B-Cell Lymphoma (DLBCL) patients. Expression levels of m = 13, 412 clones measured with cdna microarrays (relative to a pooled control). Patients belong to two molecularly distinct disease groups: n 1 = 21 Activated B-like (Population 1); n 2 = 19 Germinal center (GC) B-like (Population 2). Survival time T measured on each patient. Pre-processing: log 2 (R/G); replace missing values with average log 2 (R/G) for that gene; truncate ratios exceeding 20-fold to ± log 2 (20). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 60

Alizadeh et al. (2000) DLBCL microarray dataset Model: n k i.i.d. random m-vectors, X k,1,..., X k,nk P k, of expression measures from patient Population k, with mean parameter vector µ k = E[X k ], k = 1, 2. Null hypotheses: Equal mean expression levels in GC B-like (Popn. 2) and activated B-like (Popn. 1) patients. H 0j = I`µ(j) µ 2 (j) µ 1 (j) = 0, j = 1,..., m. Test statistics: Welch s two-sample t-statistics (unequal variances). Type I error rate: F W ER = P r(v n > 0), α = 0.05. Multiple testing procedures: 1. Single-step common quantile (minp), with non-parametric bootstrap null distribution, Q 0n. 2. Single-step common quantile (minp), with permutation null distribution, Q n (P 0n ). 3. Bonferroni, with nominal t-distribution. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 61

Alizadeh et al. (2000) DLBCL microarray dataset Table 1: Test difference in mean expression levels for GC B-like and activated B-like patients. Number of rejected null hypotheses (out of m = 13, 412) for three MTPs for FWER control. MTP Cut-offs Null distribution # rejections (R n ) 1 Single-step minp Bootstrap 186 2 Single-step minp Permutation 287 3 Bonferroni Nominal t 32 All 32 genes identified as differentially expressed by MTP3 were also identified by MTP 1 and MTP 2. MTP1 and MTP2 identified 156 genes in common. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 62

Alizadeh et al. (2000) DLBCL microarray dataset Model: Logistic regression model for each gene P r(gc B-like X(j)) = eβ 0(j)+β 1 (j) X(j), j = 1,..., m. 1 + e β 0(j)+β 1 (j) X(j) Null hypotheses: No association between expression measure X(j) of gene j and patient group: H 0j = I`β 1 (j) = 0, j = 1,..., m. Test statistics: T n (j) = nβ 1n (j). Type I error rate: gf W ER = P r(v n > k), k = 1,..., 200, α = 0.05. Multiple testing procedures: 1. Direct gfwer single-step common quantile procedure, i.e., single-step based on P n(k + 1), with non-parametric bootstrap null distribution, Q 0n. 2. Via FWER single-step common quantile procedure, i.e., single-step minp based on P n(1), with non-parametric bootstrap null distribution, Q 0n. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 63

Alizadeh et al. (2000) DLBCL microarray dataset Table 2: Test for slope in logistic regression model for DLBCL class and expression measures. Number of rejected null hypotheses (out of m = 13, 412) for two MTPs for gfwer control. k 0 10 50 100 200 Direct gfwer R k n 303 303 303 471 553 Via FWER R 0 n + k 303 313 353 403 503 Here, R k n denotes the number of rejected hypotheses for a procedure directly controlling the gfwer. For FWER, k = 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 64

Golub et al. (1999) ALL/AML microarray dataset Target mrna samples from n = 38 leukemia patients (learning set). Expression levels of d = 6, 817 human genes measured with Affymetrix hu6800 chip. Patients belong to two molecularly distinct disease groups: n 1 = 27 acute lymphoblastic leukemia (ALL) (Population 1); n 2 = 11 acute myeloid leukemia (AML) (Population 2). Pre-processing: thresholding: floor of 100 and ceiling of 16,000; filtering: exclusion of genes with max / min 5 or (max min) 500; base 10 logarithmic transformation. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 65

Golub et al. (1999) ALL/AML microarray dataset Model: n k i.i.d. random m-vectors, X k,1,..., X k,nk P k, of expression measures from patient Population k, with mean parameter vector µ k = E[X k ], k = 1, 2. Null hypotheses: Equal mean expression levels in AML (Popn. 2) and ALL (Popn. 1) patients. H 0j = I`µ(j) µ 2 (j) µ 1 (j) = 0, j = 1,..., m. Test statistics: Welch s two-sample t-statistics. Multiple testing procedures: Permutation null distribution. FWER: Bonferroni, Holm (1979), Hochberg (1986), step-down maxt. FDR: Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001). PCER: SAM, unadjusted p-values. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 66

Golub et al. (1999) ALL/AML microarray dataset Sorted adjusted p values 0.0 0.2 0.4 0.6 0.8 1.0 Bonferroni Holm Hochberg maxt BH BY rawp SAM Efron SAM Tusher 0 500 1000 1500 2000 2500 3000 Number of rejected hypotheses Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 67

Golub et al. (1999) ALL/AML microarray dataset Sorted adjusted p values 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Bonferroni Holm Hochberg maxt BH BY rawp SAM Efron SAM Tusher 0 20 40 60 80 100 Number of rejected hypotheses Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 68

Software: multtest www.bioconductor.org Bioconductor R package: multtest. Authors: Yongchao Ge, Katie Pollard, Sandrine Dudoit. Test statistics. t-statistics and F -statistics (one- and two-factor) for linear models and Cox PH model. Includes non-parametric and weighted statistics. Null distributions. Permutation and bootstrap. Multiple testing procedures. FWER control: Bonferroni, Holm (1979), Hochberg (1986), single-step and step-down maxt and minp. FDR control: Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 69

Software: multtest www.bioconductor.org Output: Parameter estimates, test statistics, unadjusted and adjusted p-values, cut-off vector, confidence intervals, null distribution. Plots: Type I error rate vs. # rejections, # rejections vs. adjusted p-values, adjusted p-values vs. test statistics ( volcano ). Software design: Closures: new tests can easily be added. Class/method OOP. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 70

Ongoing work Applications to more complex testing problems: e.g., parameters for censored survival time models, contrasts in designed microarray experiments. Comparison of different null distributions: general proposal from Theorem 0 vs. test statistic specific. Comparison of single-step vs. step-down procedures. Test optimality in terms of power: extensions of univariate concepts? Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 71

References www.bepress.com/ucbbiostat www.stat.berkeley.edu/~sandrine www.stat.berkeley.edu/~laan S. Dudoit, M. J. van der Laan, and K. S. Pollard (2003). Bootstrap-based multiple testing: Part I. Single-step procedures (In preparation). M. J. van der Laan, S. Dudoit, and K. S. Pollard (2003). Bootstrap-based multiple testing: Part II. Step-down procedures (In preparation). K. S. Pollard and M. J. van der Laan (2003). Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data. Division of Biostatistics, UC Berkeley, Technical Report #121. S. Dudoit, J. P. Shaffer, and J. C. Boldrick (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, Vol. 18, No. 1, p. 71 103. Y. Ge, S. Dudoit, and T. P. Speed (2003). Resampling-based multiple testing for microarray data analysis. TEST, Vol. 12, No. 1, p. 1 44. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 72