Multiple Hypothesis Testing in Microarray Data Analysis

Size: px
Start display at page:

Download "Multiple Hypothesis Testing in Microarray Data Analysis"

Transcription

1 Multiple Hypothesis Testing in Microarray Data Analysis Sandrine Dudoit jointly with Mark van der Laan and Katie Pollard Division of Biostatistics, UC Berkeley Short Course: Practical Statistical Analysis of DNA Microarray Data Instructors: Vincent Carey & Sandrine Dudoit KolleKolle, Denmark October 26-28, 2003 c Copyright 2003, all rights reserved

2 Outline Motivation: Microarray data analysis. Multiple hypothesis testing framework. Null distribution. Single-step procedures. Control of gfwer via control of FWER. Step-down procedures. Software: R multtest package. Application: Alizadeh et al. (2000) DLBCL microarray data; Golub et al. (1999) ALL/AML microarray data. Ongoing work. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 2

3 Motivation: Microarray data analysis Cells respond to various treatments and conditions by regulating gene expression, e.g., activating or repressing the transcription of certain genes. DNA microarrays are high-throughput biological assays that can measure DNA or RNA abundance, and hence yield information on gene expression levels, on a genomic scale: transcript (mrna) levels, DNA copy number (CGH), IBD status (GMS), methylation status, SNP genotypes. E.g. In cancer research, microarrays are used to measure transcript levels and DNA copy number in tumor samples for tens of thousands of genes at a time. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 3

4 Motivation: Microarray data analysis General question: Relate genome-wide microarray measures to biological and clinical covariates and outcomes. That is, make inference about the joint distribution of these variables. Covariates: treatment, dose, time, demographic variables, occurrence of DNA sequence motifs, SNP genotypes, etc. Outcomes: affectedness/unaffectedness, quantitative trait, tumor class, metastasis indicator, response to treatment, time to recurrence, patient survival, etc. polychotomous or continuous; censored or uncensored. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 4

5 Motivation: Microarray data analysis Differential gene expression: Identify genes whose expression levels are associated with a response or covariate of interest. Estimation: Estimate parameter of interest for the joint distribution of the microarray measures and responses/covariates. Estimate variability of the estimators. Testing: Test for each gene a null hypothesis concerning a gene-specific parameter of interest. E.g. Parameters of interest include means, differences in means, interactions, correlations, slopes, survival probabilities. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 5

6 Multiple hypothesis testing Large multiplicity problem: thousands of hypotheses are tested simultaneously! Increased chance of false positives (i.e., Type I errors). E.g., chance of at least one p value < α for m independent tests is 1 (1 α) m and converges to one as m increases. For m = 1, 000 and α = 0.01, this chance is ! Individual p values of 0.01 no longer correspond to significant findings. Use a multiple testing procedure (MTP) that accounts for the simultaneous test of multiple hypotheses by considering the joint distribution of the test statistics for each hypothesis. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 6

7 Multiple hypothesis testing Define a suitable set of null hypotheses: Which subset of genes to consider? What is the parameter of interest? Compute an appropriate test statistic for each null hypothesis. Select a Type I error rate that corresponds to a suitable form of control of false positives for the particular application: e.g., FWER, PCER, FDR. Use a multiple testing procedure (MTP) based on a null joint distribution for the test statistics that provides the desired control of the Type I error rate under the true unknown data generating distribution. Use resampling methods to estimate the unknown (null) joint distribution of the test statistics. Report adjusted p values for each hypothesis, which reflect the overall Type I error rate of the MTP. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 7

8 Model Data. Let X 1,..., X n be n independent and identically distributed (i.i.d.) random d-vectors, X P M, where M is a model (possibly non-parametric) for the data generating distribution P. E.g. X i = (W i, Z i ): gene expression measures W i and biological and clinical outcomes Z i for patient i, i = 1,..., n. The dimension of the data X (d) is usually much larger than the sample size (n): n d. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 8

9 Parameters of interest Parameters. Functions of the unknown data generating distribution P : µ = (µ(j) : j = 1,..., m), where µ(j) = µ j (P ) IR and typically m d. Parameters can refer to linear models, generalized linear models, survival models (e.g., Cox proportional hazards model), time-series models, dose-response models, etc. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 9

10 Parameters of interest Location parameters: e.g., means, differences in means, medians. E.g. Assess differential expression for m = d genes. µ(j) = mean expression level of gene j in Population 1. µ(j) µ 2 (j) µ 1 (j) = difference in mean expression level for gene j in Populations 1 and 2. Scale parameters: e.g., covariances, correlations. E.g. Pairwise correlations for m = d (d 1)/2 pairs of genes: γ(j, k), j k = 1,..., d. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 10

11 Parameters of interest Regression parameters: e.g., slopes, main effects, interactions. E.g. Measure association of gene expression levels with an outcome/covariate for m = d genes. µ(j) = parameter for univariate Cox proportional hazards model or dose-response model for gene j. µ(j) = interaction effect for two drugs on expression level of gene j. µ(j) = linear combination a β(j), e.g., contrast in ANOVA. Survival probabilities. E.g. µ(j)(x) = P r(survival X(j) = x), where X(j) is expression measure for gene j. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 11

12 Null hypotheses General submodel null hypotheses. Defined in terms of a collection of m submodels, M j M, for the data generating distribution P. H 0j I ( P M j ) vs. H 1j I ( P / M j ), j = 1,..., m. Thus, H 0j is true, i.e., H 0j = 1, if P M j and false otherwise. True null hypotheses S 0 = S 0 (P ) {j : H 0j is true} = {j : P M j }, False null hypotheses m 0 S 0. S c 0 = S c 0(P ) {j : H 0j is false} = {j : P / M j }, m 1 S c 0 = m m 1. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 12

13 Null hypotheses Single parameter null hypotheses. Each null hypothesis refers to a single parameter, µ(j) = µ j (P ), j = 1,..., m. One-sided tests H 0j = I ( µ(j) µ 0 (j) ) vs. H 1j = I ( µ(j) > µ 0 (j) ), j = 1,..., m. Two-sided tests H 0j = I ( µ(j) = µ 0 (j) ) vs. H 1j = I ( µ(j) µ 0 (j) ), j = 1,..., m. The hypothesized null values, µ 0 (j), are frequently 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 13

14 Null hypotheses Multiple parameter null hypotheses. Each null hypothesis refers to a collection of parameters, {µ 1 (j),..., µ K (j)}, j = 1,..., m. E.g. Assess differential expression for m = d genes in K > 2 populations, by testing for each gene the equality of mean expression levels in the K populations. H 0j = I ( µ 1 (j) = µ 2 (j) =... = µ K (j) ) vs. H 1j = I ( At least one µ k (j) is different ), j = 1,..., m. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 14

15 Test statistics Rejection decisions are based on an m-vector of test statistics, T n = (T n (j) : j = 1,..., m) Q n (P ) = Q n, where we assume that large values of T n (j) provide evidence against null hypothesis H 0j, j = 1,..., m. E.g. Reject H 0j if T n (j) > c j, where c = (c j : j = 1,..., m) denotes an m-vector of possibly random cut-offs. For two-sided tests, can take absolute value of test statistics. Notation. Use lower case letters (e.g., x, t n, p n (j)) for realizations of random variables (e.g., X, T n, Pn (j)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 15

16 Test statistics Single parameter null hypotheses. H 0j = I ( µ(j) µ 0 (j) ) vs. H 1j = I ( µ(j) > µ 0 (j) ), j = 1,..., m. Difference statistics D n (j) ( Estimator Null Value ) = n(µ n (j) µ 0 (j)). t-statistics (i.e., standardized difference) T n (j) Estimator Null Value Standard Error = n µ n(j) µ 0 (j). σ n (j) Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 16

17 Test statistics Estimators: µ n = (µ n (j) : j = 1,..., m) are assumed to be asymptotically linear estimators of the parameter µ = (µ(j) : j = 1,..., m), with m-dimensional vector influence curve (IC), IC(X P ) = (IC j (X P ) : j = 1,..., m), such that µ n (j) µ(j) = 1 n n IC j (X i P ) + o P (1/ n), i=1 where E[IC(X P )] = 0 and Σ(P ) and ρ(p ), denote, respectively, the covariance and correlation matrices of the vector IC. Standard errors: σ 2 n(j) are assumed to be consistent estimators of the IC variances, σ 2 (j) = E[IC j (X P ) 2 ]. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 17

18 Test statistics N.B. This general representation includes standard one-sample and two-sample t-statistics, but also test statistics for correlations and regression parameters in linear and non-linear models. E.g. For population mean parameters, µ(j) = E[X(j)], and sample mean estimators, µ n (j) = X n (j) = i X i(j)/n, then IC j (X i P ) = X i (j) µ(j). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 18

19 Test statistics E.g. One-sample t-statistics. Suppose we have n i.i.d. random d-vectors, X 1,..., X n P, from a population with mean parameter vector µ = E[X]. Null hypotheses. H 0j = I ( µ(j) = µ 0 (j) ), j = 1,..., m. Test statistics. T n (j) = n X n (j) µ 0 (j), j = 1,..., m, σ n (j) where X n (j) = i X i(j)/n and σ 2 n(j) = i (X i(j) X n (j)) 2 /n denote the sample means and variances, respectively. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 19

20 Test statistics E.g. Two-sample t-statistics. Suppose we have n k i.i.d. random d-vectors, X k,1,..., X k,nk P k, from Population k, with mean parameter vector µ k = E[X k ], k = 1, 2. Null hypotheses. Equal population means H 0j = I ( µ(j) µ 2 (j) µ 1 (j) = 0 ), j = 1,..., m. Test statistics. Two-sample Welch s t-statistics T n (j) = X 2,n (j) X 1,n (j), j = 1,..., m, σ1,n 2 (j)/n 1 + σ2,n 2 (j)/n 2 where X k,n (j) and σk,n 2 (j) denote sample means and variances for Population k, respectively. Can also use equal variance analogue. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 20

21 Test statistics E.g. (K-sample) F -statistics. Suppose we have n k i.i.d. random d-vectors, X k,1,..., X k,nk P k, from Population k, with mean parameter vector µ k = E[X k ], k = 1,..., K. Null hypotheses. Equal population means Test statistics. T n (j) = H 0j = I ( µ 1 (j) = µ 2 (j) =... = µ K (j) ) vs. H 1j = I ( At least one µ k (j) is different ). 1/(K 1) K k=1 n k( X k,n (j) X n (j)) 2 1/(n K) K nk k=1 i=1 (X k,i(j) X, j = 1,..., m, k,n (j)) 2 where X k,n (j) denote the sample means for Population k and X n (j) = k n k X k,n (j)/ k n k denotes the overall sample means. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 21

22 Multiple testing procedures A multiple testing procedure (MTP) produces a set S n of rejected hypotheses that estimates S c 0, the set of false null hypotheses S n = S(T n, Q 0, α) {j : H 0j is rejected} {1,..., m}, where the set S n (or Ŝc 0) depends on the data, X 1,..., X n, through the test statistics T n ; the null distribution, Q 0, for the test statistics T n, used to compute cut-offs for these T n (or corresponding p-values); the nominal level α of the MTP, i.e., the desired upper bound for the Type I error rate. E.g. Set of genes identified as differentially expressed based on S n {j : T n (j) > c j }, where c = (c j : j = 1,..., m) denotes an m-vector of possibly random cut-offs, computed under Q 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 22

23 Single-step vs. stepwise procedures One distinguishes between two main approaches for choosing a cut-off vector c = (c j : j = 1,..., m). Single-step procedures: equivalent adjustments for all H 0j, i.e., cut-offs only depend on null distribution Q 0 and level α: c = c(q 0, α). (Bonferroni) Stepwise procedures: adjustments depend on the observed data, i.e., cut-offs also depend on test statistics T n : c = c(t n, Q 0, α). Step-down: start with most significant hypothesis, as soon as one fails to reject a null hypothesis, no further hypotheses are rejected. (Holm, 1979). Step-up: start with least significant hypothesis, as soon as one rejects a null hypothesis, reject all hypotheses that are more significant. (Hochberg, 1986). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 23

24 Type I and Type II errors Given sets S n = S(T n, Q 0, α) {j : H 0j is rejected}, S 0 = S 0 (P ) {j : H 0j is true}, two types of errors can be committed by an MTP. Type I error or false positive: S n S 0, i.e., reject (S n ) a true null hypothesis (S 0 ). Type II error or false negative: S c n S c 0, i.e., fail to reject (S c n) a false null hypothesis (S c 0). Cf. power: S n S c 0. Denote the number of Type I errors by V n S n S 0 and the corresponding cumulative distribution function (c.d.f.) by F Vn. E.g. Type I: Say gene is differentially expressed, when in truth it isn t. Type II: Fail to identify a truly differentially expressed gene. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 24

25 Type I and Type II errors not rejected Null hypotheses rejected Null hypotheses true Sn c S 0 V n = S n S 0 m 0 = S 0 (Type I) false U n = Sn c S0 c S n S0 c m 1 = S0 c (Type II) S c n R n = S n m Adapted from Benjamini & Hochberg (1995). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 25

26 Type I and Type II errors Let m 0 = S 0 = # true null hypotheses unknown parameter R n = S n = # rejections observable random variable V n = S n S 0 = # Type I errors unobservable random variable U n = Sn c S0 c = # Type II errors unobservable random variable. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 26

27 Type I error rates Consider Type I error rates, θ(f Vn ), that are functions, i.e., parameters, of the distribution F Vn of the number of false positives. 1. Per-comparison error rate (PCER), or expected proportion of Type I errors among the m tests, P CER E[V n ]/m = vdf Vn (v)/m. 2. Per-family error rate (PFER), or expected number of Type I errors, P F ER E[V n ] = vdf Vn (v). 3. Median-based per-family error rate (mpfer), or median number of Type I errors, mp F ER Median(V n ) = F 1 V n (1/2). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 27

28 Type I error rates 4. Family-wise error rate (FWER), or probability of at least one Type I error, F W ER P r(v n > 0) = 1 F Vn (0). 5. Generalized family-wise error rate (gfwer), or probability of at least k + 1 Type I errors, k = 0,..., m 0 1, gf W ER P r(v n > k) = 1 F Vn (k). When k = 0, the gfwer is the usual family-wise error rate, FWER. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 28

29 Type I error rates Make the following two assumptions for the mapping θ : F θ(f ), defining a Type I error rate as a parameter corresponding to a c.d.f F on {0,..., m}. Monotonicity. If F 1 F 2, then θ(f 1 ) θ(f 2 ). Uniform continuity. For two sequences of c.d.f s, {F n } and {G n }, if d(f n, G n ) 0, then θ(f n ) θ(g n ) 0, as n. Here, given two c.d.f s, F 1 and F 2 on {0,..., m}, the distance measure d is defined by d(f 1, F 2 ) max x {0,...,m} F 1 (x) F 2 (x). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 29

30 Type I error rates N.B. The false discovery rate (FDR) of Benjamini & Hochberg (1995) cannot be represented as a parameter θ(f Vn ), because it also involves the distribution of the number of rejections, R n. The FDR is the expected proportion of Type I errors among the rejected hypotheses, i.e., F DR E[V n /R n ], with the convention that V n /R n = 0, if R n = 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 30

31 Adjusted p-values The unadjusted p-value for the test of a single hypothesis, H 0j, is based only on the test statistic T n (j) for that hypothesis p n (j) = P (t n (j), Q 0 ) inf {α [0, 1] : Reject H 0j at single test level α}, = P r Q0 (T n (j) t n (j)), j = 1,..., m. That is, P (t n (j), Q 0 ) is the level of the single hypothesis testing procedure at which H 0j would just be rejected, given T n (j) = t n (j). If Q 0 has continuous marginal c.d.f. Q 0j, p n (j) = 1 Q 0j (t n (j)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 31

32 Adjusted p-values In an MTP, the adjusted p-value for null hypothesis H 0j is defined as p n (j) = P (j, t n, Q 0 ) inf {α [0, 1] : j S(t n, Q 0, α)}, j = 1,..., m. That is, P (j, tn, Q 0 ) is the level of the entire MTP at which H 0j would just be rejected, given T n = t n. E.g. Bonferroni for FWER control: p n (j) = min(mp n (j), 1). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 32

33 Adjusted p-values We now have two representations for an MTP, in terms of cut-offs for the test statistics T n (j), c j = c j (T n, Q 0, α): S(T n, Q 0, α) = {j : T n (j) > c j }; adjusted p-values, Pn (j) = P (j, T n, Q 0 ): S(T n, Q 0, α) = {j : P n (j) α}. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 33

34 Adjusted p-values Flexible: The results of an MTP are provided for all levels α = do not need to choose the level ahead of time. Reflect the strength of the evidence against each null hypothesis in terms of the overall Type I error rate of the MTP. Can be defined for any Type I error rate: FWER, PCER, FDR. Plots of sorted adjusted p-values for different Type I error rates allow scientists to decide on an appropriate combination of number of rejections and tolerable false positive rate for a particular application and available resources. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 34

35 Null distribution In MTPs, the true unknown distribution Q n = Q n (P ), for the test statistics T n, is typically estimated by a null distribution Q 0, in order to derive cut-offs for these test statistics T n (or corresponding adjusted p-values). N.B. The choice of null distribution Q 0 is crucial, in order to ensure that (finite sample or asymptotic) control of the Type I error under the assumed Q 0 does indeed provide the required control under the true distribution Q n (P ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 35

36 Null distribution For proper control of the Type I error rate, the assumed null distribution Q 0 must be such that where θ(f Vn ) θ(f V0 ) [finite sample control] lim sup θ(f Vn ) θ(f V0 ) [asymptotic control], n V n Number of Type I errors under T n Q n (P ), V 0 Number of Type I errors under T n Q 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 36

37 Null distribution The required control of the Type I error rate follows from the following null domination conditions F Vn (x) F V0 (x) [finite sample control] lim inf n F V n (x) F V0 (x) [asymptotic control] x {0,..., m}. (1) That is, the S 0 -specific distribution of Q 0 equals or dominates the S 0 -specific distribution of Q n (P ). More specific (i.e., less stringent) forms of null domination can be derived for given definitions of the Type I error rate, i.e., of the parameter θ(f ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 37

38 Null distribution We propose a general construction for a null distribution Q 0 for the test statistics T n. This null distribution satisfies the asymptotic null domination condition lim inf n F V n (x) F V0 (x), x {0,..., m}, and provides asymptotic control of general Type I error rates θ(f Vn ), lim sup θ(f Vn ) α, n in single-step and step-down procedures, for testing general null hypotheses H 0j = I ( P M j ), corresponding to submodels M j M for the data generating distribution P. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 38

39 Null distribution Theorem 0. General construction for test statistics null distribution Q 0. Suppose there exists m-vectors of null values θ 0 IR m and τ 0 IR +m, such that, for j S 0, lim sup ET n (j) θ 0 (j) and lim sup V ar[t n (j)] τ 0 (j). n n Then, let ν 0n (j) min ( 1, shifted and scaled statistics Z n (j) ) τ 0 (j) V ar[t n (j)] and define null-value Z n (j) ν 0n (j)(t n (j) + θ 0 (j) ET n (j)). Suppose that Z n D Z Q0 (P ). Then, Q 0 = Q 0 (P ) satisfies the asymptotic null domination condition in (1). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 39

40 Null distribution t-statistics. For tests of single parameter null hypotheses using t-statistics, the null values are θ 0 (j) = 0 and τ 0 (j) = 1, and the null distribution Q 0 = Q 0 (P ) for T n corresponds to the asymptotic distribution of the standardized estimators µ n. That is, Q 0 (P ) N(0, ρ(p )), where ρ(p ) denotes the correlation matrix of the vector influence curve. E.g. In the special case of tests of means, where µ = E[X], ρ(p ) is simply the correlation matrix for X P. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 40

41 Null distribution N.B. The following important points distinguish our approach from existing approaches to Type I error rate control in MTP. We are only concerned with control under the true data generating distribution P. The notions of weak and strong control are therefore irrelevant to our approach. We propose a null distribution for the test statistics (T n Q 0 ), and not a data generating null distribution (X P 0 ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 41

42 Null distribution It is common to define Q 0 in terms of a data generating null distribution, P 0, that satisfies the complete null hypothesis. E.g. Define Q 0 lim n Q n (P 0 ), where P 0 m j=0 M j. However, this practice does not necessarily provide asymptotic control under the true distribution P, as the null distribution Q n (P 0 ) and the true distribution Q n (P ) for the test statistics T n may have different limits: lim n Q n (P 0 ) lim n Q n (P ). In particular, for single parameter null hypotheses, the t-statistics T n have Gaussian asymptotic distributions, but the covariance matrices may differ under the true P and under the null P 0. That is, Σ(P 0 ) Σ(P ), and, in particular, Σ S0 (P 0 ) Σ S0 (P ). As a result null domination, and hence Type I error control, may fail. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 42

43 Estimation of null distribution In practice, since P is unknown, then so is the null distribution Q 0 = Q 0 (P ). One can estimate Q 0 by Q 0n as follows. Bootstrap null distribution: Z # n Q 0n, based on X # 1,..., X# n P n, very general. Test statistics specific null distribution Single parameter null hypotheses: Q 0n = N(0, ρ n ), where ρ n is an estimator of the correlation matrix ρ(p ) of the vector IC. E.g., for means, ρ(p ) can be estimated directly from the data, in other cases, ρ(p ) can be estimated with the bootstrap. Multiple parameter null hypotheses: F -specific distribution (quadratic forms of Gaussian random variables). Data generating null distribution: Q 0n = Q n (P 0n ), where P 0n is an estimator of P 0 m j=0m j, e.g., using permutation. Problem. May have lim n Q n (P 0n ) lim n Q n (P ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 43

44 Bootstrap estimation of null distribution Procedure 0. Bootstrap estimation of null distribution Q Generate B (non-parametric or parametric) bootstrap samples. 2. For each bootstrap sample, compute an m-vector of test statistics, T b n = (T b n(j) : j = 1,..., m), which can be arranged in an m B matrix, T = `T b n(j), with rows corresponding to the m hypotheses and columns to the B bootstrap samples. 3. Compute row means and variances of the matrix T, to yield estimates of ET n (j) and V ar[t n (j)], j = 1,..., m. 4. Obtain an m B matrix, Z = `Z b n(j), of null-value shifted and scaled statistics, Z b n(j), by row shifting and scaling the matrix T using the bootstrap estimates of ET n (j) and V ar[t n (j)] and the user-supplied null-values θ 0 (j) and τ 0 (j). 5. The bootstrap estimate Q 0n of the null distribution Q 0 is the empirical distribution of the columns Z b n of matrix Z. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 44

45 Resampling-based multiple testing procedures Theorem 1. Asymptotic control for procedures based on consistent estimator of null distribution Q 0. Let Q 0n be a consistent estimator of the null distribution Q 0, i.e., Q 0n converges weakly to Q 0 (e.g., from bootstrap). Consider a single-step or step-down MTP, S(T n, Q 0, α), based on cut-offs c(t n, Q 0, α), that provides asymptotic control of the Type I error rate θ(f Vn ) at level α. Let c(t n, Q 0n, α) and S(T n, Q 0n, α) denote the corresponding cut-offs and MTP based on the estimator Q 0n. Then, under regularity conditions for the null distribution Q 0, c j (T n, Q 0n, α) c j (T n, Q 0, α) P 0 as n, j = 1,..., m. Consequently, the MTP S(T n, Q 0n, α) also provides asymptotic control of the Type I error rate θ(f Vn ) at level α. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 45

46 Single-step multiple testing procedures Given a null distribution Q 0, a mapping θ : F θ(f ) defining the Type I error rate, and a target or nominal level α, consider single-step procedures of the form S(T n, Q 0, α) = {j : T n (j) > c j (Q 0, α)}, based on two definitions for the m-vector of cut-offs c(q 0, α) = (c j (Q 0, α) : j = 1,..., m). Let R(c Q) V (c Q) m I(T n (j) > c j ) = # rejections for T n Q, j=1 j S 0 I(T n (j) > c j ) = # Type I errors for T n Q, R 0 = R(c Q 0 ), R n = R(c Q n ), V 0 = V (c Q 0 ), V n = V (c Q n ). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 46

47 Single-step multiple testing procedures Single-step common cut-off procedure. Choose common cut-off c j (Q 0, α) c 0 as c 0 min{c IR : θ(f R((c,...,c) Q0 )) α}. (2) Single-step common quantile procedure. Define cut-offs c j (Q 0, α) Q 1 0j (1 δ 0), as the (1 δ 0 ) quantiles of the marginal distributions Q 0j. Choose common quantile δ 0 as δ 0 max{δ (0, 1) : θ(f R((Q 1 0j (1 δ):j=1,...,m) Q ) α}. (3) 0) Thus, the common cut-off c 0 or common quantile δ 0 are chosen so that θ(f R0 ) α. E.g. For FWER, P r(r 0 > 0) α, for PCER, E[R 0 ]/m α Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 47

48 Single-step multiple testing procedures Theorem 2. Asymptotic control of Type I error θ(f Vn ) for single-step procedures. Consider a null distribution Q 0 as in Theorem 0 and a Type I error rate θ(f Vn ) defined in terms of a monotone and uniformly continuous mapping θ : F θ(f ). Define a single-step procedure S n = S(T n, Q 0, α) = {j : T n (j) > c j (Q 0, α)} based on common cut-offs c j (Q 0, α) as in (2) or common quantile cut-offs c j (Q 0, α) as in (3). Then, S n provides asymptotic control of the Type I error rate θ(f Vn ) at level α lim sup θ(f Vn ) α. n Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 48

49 Single-step multiple testing procedures Sketch of proof. 1. Control the parameter θ(f R0 ) α, corresponding with the observed number of rejections R 0 under the null Q Since θ( ) is monotone and the number of Type I errors V 0 is never larger than the number of rejections R 0, V 0 R 0, then θ(f V0 ) θ(f R0 ). 3. Since θ( ) is uniformly continuous and Q 0 has the asymptotic null domination property, lim inf n F Vn (x) F V0 (x) x, then lim sup θ(f Vn ) α. n Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 49

50 Single-step multiple testing procedures: Adjusted p-values Single-step common cut-off procedure. Adjusted p-values are given by P n (j) = θ(f R(Cn (j) Q 0 )), for j-specific common cut-off vectors C n (j) = (T n (j),..., T n (j)). Single-step common quantile procedure. Adjusted p-values are given by P n (j) = θ(f R(Cn (j) Q 0 )), for j-specific common quantile cut-off vectors C n (j) = (Q 1 0k (1 P n(j)) : k = 1,..., m) and unadjusted p-values P n (j) = 1 Q 0j (T n (j)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 50

51 Single-step multiple testing procedures: gfwer Single-step common cut-off procedure for gfwer. p n (j) = P r Q0 (T n(k + 1) t n (j)), where T n(k + 1) denotes the kth most significant test statistic, that is, T n(1) T n(2)... T n(m). Single-step common quantile procedure for gfwer. p n (j) = P r Q0 (P n(k + 1) p n (j)), where P n(k + 1) denotes the kth most significant unadjusted p-value, that is, P n(1) P n(2)... P n(m). For FWER control (k = 0), our general procedure reduces to the single-step maxt and single-step minp procedures, based on the maximum test statistic T n(1) and the minimum unadjusted p-value P n(1), respectively. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 51

52 Control of gfwer via control of FWER One can derive a gfwer-controlling procedure using a simple modification of an FWER-controlling procedure. The main idea is to follow the FWER-controlling procedure exactly until the first non-rejection and then reject this hypothesis and also the next k 1 most significant hypotheses. Advantages. (i) Only requires working with the FWER. (ii) Guarantees at least k rejected hypotheses. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 52

53 Control of gfwer based on control of FWER Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 53

54 Control of gfwer via control of FWER Proof. Let Vn 0 and Vn k denote the number of Type I errors for the FWER- and gfwer-controlling procedures, respectively. Then, Vn k Vn 0 + k, so that P r(vn k > k) P r(vn 0 + k > k) = P r(vn 0 > 0). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 54

55 Control of gfwer via control of FWER Let O n (j) denote the indices for the ordered test statistics (or unadjusted p-values), so that O n (1) corresponds to the most significant hypothesis, and O n (m) to the least significant hypothesis. That is, T n (O n (1))... T n (O n (m)). k The adjusted p-values, P n, for the gfwer procedure are obtained 0 by simply shifting by k the adjusted p-values, P n, for the FWER procedure P n k 0, j = 1,..., k, (O n (j)) = P n(o 0 n (j k)), j = k + 1,..., m. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 55

56 Step-down multiple testing procedures Step-down maxt procedure for FWER control. Let T n(j) be the ordered test statistics, T n(1)... T n(m), and O n (j) the indices for these order statistics, so that T n(j) = T n (O n (j)). For Z Q 0 and subsets O n (j) {O n (j),..., O n (m)}, define quantiles C n (j) (1 α) quantile of max k On (j) Z(k) and step-down cut-offs C n(1) C n (1), C n(j) 8 < C n (j), if Tn(j 1) > Cn(j 1) :, otherwise, j = 2,..., m. Reject H 0,On (j) if T n(j) > C n(j), j = 1,..., m. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 56

57 Step-down multiple testing procedures The adjusted p-value for hypothesis H 0,On (j) is given by P n (O n (j)) = max k=1,...,j «ff jp r Q0 max Z(l) T n(o n (k)). l {O n (k),...,o n (m)} That is, the MTP is based on the distribution of successive maxima of the test statistics T n (j), over subsets determined by their observed ordering. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 57

58 Step-down multiple testing procedures An analogue of the step-down maxt procedure can be defined based on the distribution of successive minima of the unadjusted p-values P n (j). Step-down minp procedure for FWER control. The adjusted p-value for hypothesis H 0,On (j) is given by P n (O n (j)) = max k=1,...,j jp r Q0 min l {O n (k),...,o n (m)} «ff Q 0l (Z(l)) P n (O n (k)), where O n (j) denote the indices of the ordered unadjusted p-values, i.e., P n (O n (1))... P n (O n (m)). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 58

59 Step-down multiple testing procedures Theorem 3. Asymptotic control of FWER (and gfwer) for step-down procedures. Consider a null distribution Q 0 as in Theorem 0. Assume that, with probability one in the limit, the most significant test statistics correspond to the false null hypotheses, i.e., lim n P r ( {O n (1),..., O n (m 1 )} = S c 0) = 1. Then, the step-down maxt and step-down minp procedures provide asymptotic control of the FWER at level α lim sup P r(v n > 0) α. n Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 59

60 Alizadeh et al. (2000) DLBCL microarray dataset Target mrna samples from n = 40 Diffuse Large B-Cell Lymphoma (DLBCL) patients. Expression levels of m = 13, 412 clones measured with cdna microarrays (relative to a pooled control). Patients belong to two molecularly distinct disease groups: n 1 = 21 Activated B-like (Population 1); n 2 = 19 Germinal center (GC) B-like (Population 2). Survival time T measured on each patient. Pre-processing: log 2 (R/G); replace missing values with average log 2 (R/G) for that gene; truncate ratios exceeding 20-fold to ± log 2 (20). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 60

61 Alizadeh et al. (2000) DLBCL microarray dataset Model: n k i.i.d. random m-vectors, X k,1,..., X k,nk P k, of expression measures from patient Population k, with mean parameter vector µ k = E[X k ], k = 1, 2. Null hypotheses: Equal mean expression levels in GC B-like (Popn. 2) and activated B-like (Popn. 1) patients. H 0j = I`µ(j) µ 2 (j) µ 1 (j) = 0, j = 1,..., m. Test statistics: Welch s two-sample t-statistics (unequal variances). Type I error rate: F W ER = P r(v n > 0), α = Multiple testing procedures: 1. Single-step common quantile (minp), with non-parametric bootstrap null distribution, Q 0n. 2. Single-step common quantile (minp), with permutation null distribution, Q n (P 0n ). 3. Bonferroni, with nominal t-distribution. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 61

62 Alizadeh et al. (2000) DLBCL microarray dataset Table 1: Test difference in mean expression levels for GC B-like and activated B-like patients. Number of rejected null hypotheses (out of m = 13, 412) for three MTPs for FWER control. MTP Cut-offs Null distribution # rejections (R n ) 1 Single-step minp Bootstrap Single-step minp Permutation Bonferroni Nominal t 32 All 32 genes identified as differentially expressed by MTP3 were also identified by MTP 1 and MTP 2. MTP1 and MTP2 identified 156 genes in common. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 62

63 Alizadeh et al. (2000) DLBCL microarray dataset Model: Logistic regression model for each gene P r(gc B-like X(j)) = eβ 0(j)+β 1 (j) X(j), j = 1,..., m. 1 + e β 0(j)+β 1 (j) X(j) Null hypotheses: No association between expression measure X(j) of gene j and patient group: H 0j = I`β 1 (j) = 0, j = 1,..., m. Test statistics: T n (j) = nβ 1n (j). Type I error rate: gf W ER = P r(v n > k), k = 1,..., 200, α = Multiple testing procedures: 1. Direct gfwer single-step common quantile procedure, i.e., single-step based on P n(k + 1), with non-parametric bootstrap null distribution, Q 0n. 2. Via FWER single-step common quantile procedure, i.e., single-step minp based on P n(1), with non-parametric bootstrap null distribution, Q 0n. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 63

64 Alizadeh et al. (2000) DLBCL microarray dataset Table 2: Test for slope in logistic regression model for DLBCL class and expression measures. Number of rejected null hypotheses (out of m = 13, 412) for two MTPs for gfwer control. k Direct gfwer R k n Via FWER R 0 n + k Here, R k n denotes the number of rejected hypotheses for a procedure directly controlling the gfwer. For FWER, k = 0. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 64

65 Golub et al. (1999) ALL/AML microarray dataset Target mrna samples from n = 38 leukemia patients (learning set). Expression levels of d = 6, 817 human genes measured with Affymetrix hu6800 chip. Patients belong to two molecularly distinct disease groups: n 1 = 27 acute lymphoblastic leukemia (ALL) (Population 1); n 2 = 11 acute myeloid leukemia (AML) (Population 2). Pre-processing: thresholding: floor of 100 and ceiling of 16,000; filtering: exclusion of genes with max / min 5 or (max min) 500; base 10 logarithmic transformation. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 65

66 Golub et al. (1999) ALL/AML microarray dataset Model: n k i.i.d. random m-vectors, X k,1,..., X k,nk P k, of expression measures from patient Population k, with mean parameter vector µ k = E[X k ], k = 1, 2. Null hypotheses: Equal mean expression levels in AML (Popn. 2) and ALL (Popn. 1) patients. H 0j = I`µ(j) µ 2 (j) µ 1 (j) = 0, j = 1,..., m. Test statistics: Welch s two-sample t-statistics. Multiple testing procedures: Permutation null distribution. FWER: Bonferroni, Holm (1979), Hochberg (1986), step-down maxt. FDR: Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001). PCER: SAM, unadjusted p-values. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 66

67 Golub et al. (1999) ALL/AML microarray dataset Sorted adjusted p values Bonferroni Holm Hochberg maxt BH BY rawp SAM Efron SAM Tusher Number of rejected hypotheses Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 67

68 Golub et al. (1999) ALL/AML microarray dataset Sorted adjusted p values Bonferroni Holm Hochberg maxt BH BY rawp SAM Efron SAM Tusher Number of rejected hypotheses Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 68

69 Software: multtest Bioconductor R package: multtest. Authors: Yongchao Ge, Katie Pollard, Sandrine Dudoit. Test statistics. t-statistics and F -statistics (one- and two-factor) for linear models and Cox PH model. Includes non-parametric and weighted statistics. Null distributions. Permutation and bootstrap. Multiple testing procedures. FWER control: Bonferroni, Holm (1979), Hochberg (1986), single-step and step-down maxt and minp. FDR control: Benjamini & Hochberg (1995), Benjamini & Yekutieli (2001). Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 69

70 Software: multtest Output: Parameter estimates, test statistics, unadjusted and adjusted p-values, cut-off vector, confidence intervals, null distribution. Plots: Type I error rate vs. # rejections, # rejections vs. adjusted p-values, adjusted p-values vs. test statistics ( volcano ). Software design: Closures: new tests can easily be added. Class/method OOP. Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 70

71 Ongoing work Applications to more complex testing problems: e.g., parameters for censored survival time models, contrasts in designed microarray experiments. Comparison of different null distributions: general proposal from Theorem 0 vs. test statistic specific. Comparison of single-step vs. step-down procedures. Test optimality in terms of power: extensions of univariate concepts? Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 71

72 References S. Dudoit, M. J. van der Laan, and K. S. Pollard (2003). Bootstrap-based multiple testing: Part I. Single-step procedures (In preparation). M. J. van der Laan, S. Dudoit, and K. S. Pollard (2003). Bootstrap-based multiple testing: Part II. Step-down procedures (In preparation). K. S. Pollard and M. J. van der Laan (2003). Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data. Division of Biostatistics, UC Berkeley, Technical Report #121. S. Dudoit, J. P. Shaffer, and J. C. Boldrick (2003). Multiple hypothesis testing in microarray experiments. Statistical Science, Vol. 18, No. 1, p Y. Ge, S. Dudoit, and T. P. Speed (2003). Resampling-based multiple testing for microarray data analysis. TEST, Vol. 12, No. 1, p Sandrine Dudoit KolleKolle, October 26-28, 2003 Page 72

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 13 Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates Sandrine Dudoit Mark

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 164 Multiple Testing Procedures: R multtest Package and Applications to Genomics Katherine

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 3, Issue 1 2004 Article 14 Multiple Testing. Part II. Step-Down Procedures for Control of the Family-Wise Error Rate Mark J. van der Laan

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2005 Paper 168 Multiple Testing Procedures and Applications to Genomics Merrill D. Birkner Katherine

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Step-down FDR Procedures for Large Numbers of Hypotheses

Step-down FDR Procedures for Large Numbers of Hypotheses Step-down FDR Procedures for Large Numbers of Hypotheses Paul N. Somerville University of Central Florida Abstract. Somerville (2004b) developed FDR step-down procedures which were particularly appropriate

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2005 Paper 198 Quantile-Function Based Null Distribution in Resampling Based Multiple Testing Mark J.

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007

PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007 Cohort study s formulations PB HLTH 240A: Advanced Categorical Data Analysis Fall 2007 Srine Dudoit Division of Biostatistics Department of Statistics University of California, Berkeley www.stat.berkeley.edu/~srine

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 147 Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data Sunduz

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2009 Paper 249 Resampling-Based Multiple Hypothesis Testing with Applications to Genomics: New Developments

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Looking at the Other Side of Bonferroni

Looking at the Other Side of Bonferroni Department of Biostatistics University of Washington 24 May 2012 Multiple Testing: Control the Type I Error Rate When analyzing genetic data, one will commonly perform over 1 million (and growing) hypothesis

More information

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao

ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE. By Wenge Guo and M. Bhaskara Rao ON STEPWISE CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE By Wenge Guo and M. Bhaskara Rao National Institute of Environmental Health Sciences and University of Cincinnati A classical approach for dealing

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2009 Paper 245 Joint Multiple Testing Procedures for Graphical Model Selection with Applications to

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo

PROCEDURES CONTROLLING THE k-fdr USING. BIVARIATE DISTRIBUTIONS OF THE NULL p-values. Sanat K. Sarkar and Wenge Guo PROCEDURES CONTROLLING THE k-fdr USING BIVARIATE DISTRIBUTIONS OF THE NULL p-values Sanat K. Sarkar and Wenge Guo Temple University and National Institute of Environmental Health Sciences Abstract: Procedures

More information

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Controlling Bayes Directional False Discovery Rate in Random Effects Model 1 Sanat K. Sarkar a, Tianhui Zhou b a Temple University, Philadelphia, PA 19122, USA b Wyeth Pharmaceuticals, Collegeville, PA

More information

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES

A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES A NEW APPROACH FOR LARGE SCALE MULTIPLE TESTING WITH APPLICATION TO FDR CONTROL FOR GRAPHICALLY STRUCTURED HYPOTHESES By Wenge Guo Gavin Lynch Joseph P. Romano Technical Report No. 2018-06 September 2018

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

False Discovery Rate

False Discovery Rate False Discovery Rate Peng Zhao Department of Statistics Florida State University December 3, 2018 Peng Zhao False Discovery Rate 1/30 Outline 1 Multiple Comparison and FWER 2 False Discovery Rate 3 FDR

More information

Resampling-based Multiple Testing with Applications to Microarray Data Analysis

Resampling-based Multiple Testing with Applications to Microarray Data Analysis Resampling-based Multiple Testing with Applications to Microarray Data Analysis DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Control of Generalized Error Rates in Multiple Testing

Control of Generalized Error Rates in Multiple Testing Institute for Empirical Research in Economics University of Zurich Working Paper Series ISSN 1424-0459 Working Paper No. 245 Control of Generalized Error Rates in Multiple Testing Joseph P. Romano and

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Control of Directional Errors in Fixed Sequence Multiple Testing

Control of Directional Errors in Fixed Sequence Multiple Testing Control of Directional Errors in Fixed Sequence Multiple Testing Anjana Grandhi Department of Mathematical Sciences New Jersey Institute of Technology Newark, NJ 07102-1982 Wenge Guo Department of Mathematical

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

SOME STEP-DOWN PROCEDURES CONTROLLING THE FALSE DISCOVERY RATE UNDER DEPENDENCE

SOME STEP-DOWN PROCEDURES CONTROLLING THE FALSE DISCOVERY RATE UNDER DEPENDENCE Statistica Sinica 18(2008), 881-904 SOME STEP-DOWN PROCEDURES CONTROLLING THE FALSE DISCOVERY RATE UNDER DEPENDENCE Yongchao Ge 1, Stuart C. Sealfon 1 and Terence P. Speed 2,3 1 Mount Sinai School of Medicine,

More information

A Large-Sample Approach to Controlling the False Discovery Rate

A Large-Sample Approach to Controlling the False Discovery Rate A Large-Sample Approach to Controlling the False Discovery Rate Christopher R. Genovese Department of Statistics Carnegie Mellon University Larry Wasserman Department of Statistics Carnegie Mellon University

More information

STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE. National Institute of Environmental Health Sciences and Temple University

STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE. National Institute of Environmental Health Sciences and Temple University STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE Wenge Guo 1 and Sanat K. Sarkar 2 National Institute of Environmental Health Sciences and Temple University Abstract: Often in practice

More information

Christopher J. Bennett

Christopher J. Bennett P- VALUE ADJUSTMENTS FOR ASYMPTOTIC CONTROL OF THE GENERALIZED FAMILYWISE ERROR RATE by Christopher J. Bennett Working Paper No. 09-W05 April 2009 DEPARTMENT OF ECONOMICS VANDERBILT UNIVERSITY NASHVILLE,

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Introductory Econometrics

Introductory Econometrics Session 4 - Testing hypotheses Roland Sciences Po July 2011 Motivation After estimation, delivering information involves testing hypotheses Did this drug had any effect on the survival rate? Is this drug

More information

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses

On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses On Procedures Controlling the FDR for Testing Hierarchically Ordered Hypotheses Gavin Lynch Catchpoint Systems, Inc., 228 Park Ave S 28080 New York, NY 10003, U.S.A. Wenge Guo Department of Mathematical

More information

SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES

SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES A thesis submitted to the faculty of San Francisco State University In partial fulfillment of The requirements for The

More information

Topic 3: Hypothesis Testing

Topic 3: Hypothesis Testing CS 8850: Advanced Machine Learning Fall 07 Topic 3: Hypothesis Testing Instructor: Daniel L. Pimentel-Alarcón c Copyright 07 3. Introduction One of the simplest inference problems is that of deciding between

More information

Resampling-Based Control of the FDR

Resampling-Based Control of the FDR Resampling-Based Control of the FDR Joseph P. Romano 1 Azeem S. Shaikh 2 and Michael Wolf 3 1 Departments of Economics and Statistics Stanford University 2 Department of Economics University of Chicago

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume, Issue 006 Article 9 A Method to Increase the Power of Multiple Testing Procedures Through Sample Splitting Daniel Rubin Sandrine Dudoit

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

Estimation of the False Discovery Rate

Estimation of the False Discovery Rate Estimation of the False Discovery Rate Coffee Talk, Bioinformatics Research Center, Sept, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS Statistica Sinica 19 (2009), 125-143 EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS Debashis Ghosh Penn State University Abstract: There is much recent interest

More information

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses

Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses Adaptive Filtering Multiple Testing Procedures for Partial Conjunction Hypotheses arxiv:1610.03330v1 [stat.me] 11 Oct 2016 Jingshu Wang, Chiara Sabatti, Art B. Owen Department of Statistics, Stanford University

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Chapter 1. Stepdown Procedures Controlling A Generalized False Discovery Rate

Chapter 1. Stepdown Procedures Controlling A Generalized False Discovery Rate Chapter Stepdown Procedures Controlling A Generalized False Discovery Rate Wenge Guo and Sanat K. Sarkar Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park,

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Lecture 8 Inequality Testing and Moment Inequality Models

Lecture 8 Inequality Testing and Moment Inequality Models Lecture 8 Inequality Testing and Moment Inequality Models Inequality Testing In the previous lecture, we discussed how to test the nonlinear hypothesis H 0 : h(θ 0 ) 0 when the sample information comes

More information

MULTIPLE TESTING PROCEDURES AND SIMULTANEOUS INTERVAL ESTIMATES WITH THE INTERVAL PROPERTY

MULTIPLE TESTING PROCEDURES AND SIMULTANEOUS INTERVAL ESTIMATES WITH THE INTERVAL PROPERTY MULTIPLE TESTING PROCEDURES AND SIMULTANEOUS INTERVAL ESTIMATES WITH THE INTERVAL PROPERTY BY YINGQIU MA A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New

More information

Doing Cosmology with Balls and Envelopes

Doing Cosmology with Balls and Envelopes Doing Cosmology with Balls and Envelopes Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

Statistics Applied to Bioinformatics. Tests of homogeneity

Statistics Applied to Bioinformatics. Tests of homogeneity Statistics Applied to Bioinformatics Tests of homogeneity Two-tailed test of homogeneity Two-tailed test H 0 :m = m Principle of the test Estimate the difference between m and m Compare this estimation

More information

Mixtures of multiple testing procedures for gatekeeping applications in clinical trials

Mixtures of multiple testing procedures for gatekeeping applications in clinical trials Research Article Received 29 January 2010, Accepted 26 May 2010 Published online 18 April 2011 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.4008 Mixtures of multiple testing procedures

More information

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap

Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and the Bootstrap University of Zurich Department of Economics Working Paper Series ISSN 1664-7041 (print) ISSN 1664-705X (online) Working Paper No. 254 Multiple Testing of One-Sided Hypotheses: Combining Bonferroni and

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004

Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004 Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004 Multiple testing methods to control the False Discovery Rate (FDR),

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

FDR and ROC: Similarities, Assumptions, and Decisions

FDR and ROC: Similarities, Assumptions, and Decisions EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers

More information

Applying the Benjamini Hochberg procedure to a set of generalized p-values

Applying the Benjamini Hochberg procedure to a set of generalized p-values U.U.D.M. Report 20:22 Applying the Benjamini Hochberg procedure to a set of generalized p-values Fredrik Jonsson Department of Mathematics Uppsala University Applying the Benjamini Hochberg procedure

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

A STATISTICAL FRAMEWORK FOR TESTING FUNCTIONAL CATEGORIES IN MICROARRAY DATA

A STATISTICAL FRAMEWORK FOR TESTING FUNCTIONAL CATEGORIES IN MICROARRAY DATA The Annals of Applied Statistics 2008, Vol. 2, No. 1, 286 315 DOI: 10.1214/07-AOAS146 Institute of Mathematical Statistics, 2008 A STATISTICAL FRAMEWORK FOR TESTING FUNCTIONAL CATEGORIES IN MICROARRAY

More information

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs with Remarks on Family Selection Dissertation Defense April 5, 204 Contents Dissertation Defense Introduction 2 FWER Control within

More information

Hunting for significance with multiple testing

Hunting for significance with multiple testing Hunting for significance with multiple testing Etienne Roquain 1 1 Laboratory LPMA, Université Pierre et Marie Curie (Paris 6), France Séminaire MODAL X, 19 mai 216 Etienne Roquain Hunting for significance

More information

Two-stage stepup procedures controlling FDR

Two-stage stepup procedures controlling FDR Journal of Statistical Planning and Inference 38 (2008) 072 084 www.elsevier.com/locate/jspi Two-stage stepup procedures controlling FDR Sanat K. Sarar Department of Statistics, Temple University, Philadelphia,

More information

Multiple Endpoints: A Review and New. Developments. Ajit C. Tamhane. (Joint work with Brent R. Logan) Department of IE/MS and Statistics

Multiple Endpoints: A Review and New. Developments. Ajit C. Tamhane. (Joint work with Brent R. Logan) Department of IE/MS and Statistics 1 Multiple Endpoints: A Review and New Developments Ajit C. Tamhane (Joint work with Brent R. Logan) Department of IE/MS and Statistics Northwestern University Evanston, IL 60208 ajit@iems.northwestern.edu

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS

REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS REPRODUCIBLE ANALYSIS OF HIGH-THROUGHPUT EXPERIMENTS Ying Liu Department of Biostatistics, Columbia University Summer Intern at Research and CMC Biostats, Sanofi, Boston August 26, 2015 OUTLINE 1 Introduction

More information

Multiple testing: Intro & FWER 1

Multiple testing: Intro & FWER 1 Multiple testing: Intro & FWER 1 Mark van de Wiel mark.vdwiel@vumc.nl Dep of Epidemiology & Biostatistics,VUmc, Amsterdam Dep of Mathematics, VU 1 Some slides courtesy of Jelle Goeman 1 Practical notes

More information

Familywise Error Rate Controlling Procedures for Discrete Data

Familywise Error Rate Controlling Procedures for Discrete Data Familywise Error Rate Controlling Procedures for Discrete Data arxiv:1711.08147v1 [stat.me] 22 Nov 2017 Yalin Zhu Center for Mathematical Sciences, Merck & Co., Inc., West Point, PA, U.S.A. Wenge Guo Department

More information

Tools and topics for microarray analysis

Tools and topics for microarray analysis Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Stat 710: Mathematical Statistics Lecture 31

Stat 710: Mathematical Statistics Lecture 31 Stat 710: Mathematical Statistics Lecture 31 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 31 April 13, 2009 1 / 13 Lecture 31:

More information

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1

Linear Combinations. Comparison of treatment means. Bruce A Craig. Department of Statistics Purdue University. STAT 514 Topic 6 1 Linear Combinations Comparison of treatment means Bruce A Craig Department of Statistics Purdue University STAT 514 Topic 6 1 Linear Combinations of Means y ij = µ + τ i + ǫ ij = µ i + ǫ ij Often study

More information

Confidence Thresholds and False Discovery Control

Confidence Thresholds and False Discovery Control Confidence Thresholds and False Discovery Control Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics

More information

Large-Scale Multiple Testing of Correlations

Large-Scale Multiple Testing of Correlations Large-Scale Multiple Testing of Correlations T. Tony Cai and Weidong Liu Abstract Multiple testing of correlations arises in many applications including gene coexpression network analysis and brain connectivity

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Peak Detection for Images

Peak Detection for Images Peak Detection for Images Armin Schwartzman Division of Biostatistics, UC San Diego June 016 Overview How can we improve detection power? Use a less conservative error criterion Take advantage of prior

More information

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Political Science 236 Hypothesis Testing: Review and Bootstrapping Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The

More information

Exact and Approximate Stepdown Methods For Multiple Hypothesis Testing

Exact and Approximate Stepdown Methods For Multiple Hypothesis Testing Exact and Approximate Stepdown Methods For Multiple Hypothesis Testing Joseph P. Romano Department of Statistics Stanford University Michael Wolf Department of Economics and Business Universitat Pompeu

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Gene expression analysis with the parametric bootstrap

Gene expression analysis with the parametric bootstrap Biostatistics (2001), 2, 4,pp. 445 461 Printed in Great Britain Gene expression analysis with the parametric bootstrap MARK J. VAN DER LAAN, JENNY BRYAN Division of Biostatistics, University of California,

More information

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013

Outline. Topic 19 - Inference. The Cell Means Model. Estimates. Inference for Means Differences in cell means Contrasts. STAT Fall 2013 Topic 19 - Inference - Fall 2013 Outline Inference for Means Differences in cell means Contrasts Multiplicity Topic 19 2 The Cell Means Model Expressed numerically Y ij = µ i + ε ij where µ i is the theoretical

More information

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Holger Schwender, Andreas Krause, and Katja Ickstadt Abstract Microarrays enable to measure the expression levels of tens

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5)

STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons (Ch. 4-5) STAT 5200 Handout #7a Contrasts & Post hoc Means Comparisons Ch. 4-5) Recall CRD means and effects models: Y ij = µ i + ϵ ij = µ + α i + ϵ ij i = 1,..., g ; j = 1,..., n ; ϵ ij s iid N0, σ 2 ) If we reject

More information

Bumpbars: Inference for region detection. Yuval Benjamini, Hebrew University

Bumpbars: Inference for region detection. Yuval Benjamini, Hebrew University Bumpbars: Inference for region detection Yuval Benjamini, Hebrew University yuvalbenj@gmail.com WHOA-PSI-2017 Collaborators Jonathan Taylor Stanford Rafael Irizarry Dana Farber, Harvard Amit Meir U of

More information