Resampling-based Multiple Testing with Applications to Microarray Data Analysis

Size: px

Start display at page:

Download "Resampling-based Multiple Testing with Applications to Microarray Data Analysis"

Bruce Potter
5 years ago
Views:

1 Resampling-based Multiple Testing with Applications to Microarray Data Analysis DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Dongmei Li, B.A., M.S. * * * * * The Ohio State University 2009 Dissertation Committee: Approved by Dr. Jason C. Hsu, Adviser Dr. Elizabeth Stasny Dr. William Notz Dr. Steve MacEachern Adviser Graduate Program in Biostatistics The Ohio State University

2 c Copyright by Dongmei Li 2009

3 ABSTRACT In microarray data analysis, resampling methods are widely used to discover significantly differentially expressed genes under different biological conditions when the distributions of test statistics are unknown. When sample size is small, however, simultaneous testing of thousands, or even millions, of null hypotheses in microarray data analysis brings challenges to the multiple hypothesis testing field. We study small sample behavior of three commonly used resampling methods, including permutation tests, post-pivot resampling methods, and pre-pivot resampling methods in multiple hypothesis testing. We show the model-based pre-pivot resampling methods have the largest maximum number of unique resampled test statistic values, which tend to produce more reliable P-values than the other two resampling methods. To avoid problems with the application of the three resampling methods in practice, we propose new conditions, based on the Partitioning Principle, to control the multiple testing error rates in fixed-effects general linear models. Meanwhile, from both theoretical results and simulation studies, we show the discrepancies between the true expected values of order statistics and the expected values of order statistics estimated by permutation in the Significant Analysis of Microarrays (SAM) procedure. Moreover, we show the conditions for SAM to control the expected number of false ii

4 rejections in the permutation-based SAM procedure. We also propose a more powerful adaptive two-step procedure to control the expected number of false rejections with larger critical values than the Bonferroni procedure. iii

5 This is dedicated to my dear husband Zidian Xie, my cute daughter Catherine Xie, my cute son Matthew Xie, and my dear parents. iv

6 ACKNOWLEDGMENTS I would like to express my heartfelt gratitude to my advisor Professor Jason C. Hsu for his encouragement, constant guidance and extreme patience. Without his advice, it would have been impossible for me to finish this dissertation. A special thanks goes to Professor Elizabeth Stasny, Graduate Studies Chairs in Statistics, who carefully proofread my papers and gave me tons of help during my Ph.D. study. I would also like to thank my other committee members, Professor William Notz and Professor Steve MacEachern for their thoughtful questions and advice. I am enormously grateful to my parents, my husband and my kids for their support and love, especially my husband Zidian Xie, who always support me whenever I need him. v

7 VITA B.A. Pomology, Laiyang Agriculture College, China M.S. Biophysics, China Agriculture University, China M.S. Statistics, The Ohio State University, U.S.A present...Graduate Teaching and Research Associate, The Ohio State University. PUBLICATIONS Research Publications Violeta Calian, Dongmei Li, and Jason C. Hsu. Partitioning to Uncover Conditions for Permutation Tests to Control Multiple Testing Error Rates. Biometrical Journal, 50 (5): , DOI: /bimj FIELDS OF STUDY Major Field: Biostatistics vi

8 TABLE OF CONTENTS Page Abstract Dedication Acknowledgments Vita List of Tables List of Figures ii iv v vi x xi Chapters: 1. Multiple hypotheses testing and resampling methods Multiple hypotheses testing Introduction Two definitions of Type I error rate Familywise Error Rate (FWER) False Discovery Rate (FDR) Multiple testing principles Resampling methods Permutation tests Bootstrap methods Small sample behavior of resampling methods Tomato microarray example Conditions for getting adjusted P-values of zero using the post-pivot resampling method vii

9 2.2.1 Conditions for getting adjusted P-values of zero with a sample size of two Conditions for getting adjusted P-values of zero with a sample size of three Conditions for getting adjusted P-values of zero using the pre-pivot resampling method Discreteness of resampled test statistics distributions Paired samples Two independent samples Multiple independent samples General linear mixed-effects models Conditions for resampling methods to control multiple testing error rates Two-group comparison Permutation tests Post-pivot resampling method Pre-pivot resampling method Fixed-effects general linear model Estimating the test statistic s null distribution Permutation tests Pre-pivot resampling method Post-pivot resampling method Estimating critical values for strong control of FWER Permutation tests Pre-pivot resampling method Post-pivot resampling method Shortcuts of partitioning tests using resampling methods Permutation tests Pre-pivot resampling method Post-pivot resampling method Conditions for Significant Analysis of Microarrays (SAM) to control the empirical FDR Introduction to Significant Analysis of Microarrays (SAM) method Discrepancies between true expected values of order statistics and expected values estimated by permutation Effect of unequal variances-covariance matrices and sample sizes Effect of higher order cumulants with equal sample sizes.. 88 viii

10 4.3 Conditions for controlling the expected number of false rejections in SAM An adaptive two-step procedure controlling the expected number of false rejections Discussion Concluding remarks References ix

11 LIST OF TABLES Table Page 1.1 Summary of possible outcomes from testing k null hypotheses Adjusted P-values calculated from formula (2.1) for the permutation test, post-pivot resampling method and pre-pivot resampling method Maximum number of unique resampled test statistic values for the permutation test, post-pivot resampling method and pre-pivot resampling method x

12 LIST OF FIGURES Figure Page 2.1 Null distribution of max i=1,2,3 T i for k = 3 and n = 3. Observed test statistics and resampled test statistics from permutation test, postpivot resampling and pre-pivot resampling methods Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal variance and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal correlations and sample sizes. Dashed line in the Q-Q plot is the 45 degree diagonal line Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal skewness. Dashed line in the Q-Q plot is the 45 degree diagonal line Q-Q plot of the true expected values of order statistics against the expected values estimated by permutation for unequal third order cross cumulants. Dashed line in the Q-Q plot is the 45 degree diagonal line. 97 xi

13 CHAPTER 1 MULTIPLE HYPOTHESES TESTING AND RESAMPLING METHODS 1.1 Multiple hypotheses testing Introduction With the rapid development of biotechnology, microarray technology became widely used in biomedical and biological fields to identify differentially expressed genes and transcription factor binding sites, and map complex traits using single nucleotide polymorphisms (SNPs) (Kulesh et al. (1987), Schena et al. (1995), Lashkari et al. (1997), Pollack et al. (1999), Buck and Lieb (2004), Mei et al. (2000), Hehir- Kwa et al. (2007)). Having thousands, even millions, of genes on a small array makes multiple comparisons a hot topic in today s statistics field because thousands, even millions, of hypotheses need to be tested simultaneously. Without multiplicity adjustment, if each hypothesis is tested at level α, the probability of rejecting at least one true null hypothesis will increase enormously when testing multiple hypotheses. If, for example, 20 hypotheses are tested simultaneously and each hypothesis is tested at 5%, the probability of rejecting at least one true null hypothesis will be 64%, assuming all the test statistics are independent. Therefore, 1

14 in order to make the multiplicity adjustment, a multiple hypotheses testing procedure need to control a certain type of error rate at a level of α. A popular multiple testing error rate being controlled in many multiple hypotheses testing procedures is the family-wise error rate (FWER) (Hochberg and Tamhane (1987), Shaffer (1995)), which is defined as the probability of at least one false rejection. Another less stringent multiple testing error rate commonly used is the false discovery rate (FDR) (Benjamini and Hochberg (1995)), which is defined as the proportion of falsely rejected null hypotheses Two definitions of Type I error rate Suppose k genes are probed to compare expression levels between high risk and low risk patients. Let µ Hi, µ Li, i = 1,..., k, denote the expected (logarithms of) expression levels of the ith gene of a randomly sampled patient from the high risk and low risk groups respectively. Let θ i = µ Hi µ Li denote the difference of expected (logarithm of) expression levels of the ith gene between the high risk group and the low risk group. To determine which of the genes are differentially expressed in expectation between the high risk and low risk patients, we need to test the following null hypotheses: H 0i : θ i = 0, i = 1,..., k. (1.1) There are two different ways to define the Type I error rate when testing a single null hypothesis. Let θ = (θ 1, θ 2,...,θ k ), and let Σ denote generically all nuisance parameters that the observed expression levels depend on, such as covariance of the expression levels for each of the high risk group and low risk group. Let θ 0 = (θ1 0,...,θ0 k ) and Σ 0 be a collection of all (unknown) true parameter values. A traditional definition of 2

15 the Type I error rate given by Casella and Berger (1990) or Berger (1993) is sup θi =0P θ,σ {Reject H 0i }, where the supremum is taken over all possible θ and Σ subject to θ i = 0. Another definition of the Type I error rate, given by Pollard and van der Laan (2005), is where θ 0 i = 0, θ 0 = (θ 0 1,...,θ0 k parameter values. P θ 0,ΣΣ 0 Σ 0{Reject H 0i}, ), and Σ0 Σ 0 represents the set of all (unknown) true The first definition of Type I error rate is more widely used than the second definition. The second definition of Type I error rate can only be controlled asymptotically since the true parameter values are unknown in microarray data analysis Familywise Error Rate (FWER) When we are testing k null hypotheses simultaneously, the summary of possible outcomes is shown in Table 1.1. Table 1.1: Summary of possible outcomes from testing k null hypotheses Number not rejected Number rejected True null hypotheses U V k 0 Non-true null hypotheses T S k k 0 Total k R R k In Table 1.1, V denotes the number of incorrectly rejected true null hypotheses when testing k null hypotheses; R denotes the number of hypotheses rejected among 3

16 those k null hypotheses; k 0 denotes the number of true null hypotheses; and k k 0 denotes the number of false null hypotheses. FWER is defined as the probability of rejecting at least one true null hypothesis (at least one false rejection). FWER has the following expression: FWER = P {V 1}. (1.2) There are two kinds of control of FWER. One is strong control of FWER, which controls the probability of at least one false rejection under any combination of true and false null hypotheses (controls the supremum). The other is weak control of FWER, which controls the probability of at least one false rejection under the complete null hypothesis H C 0 : k i=1h 0i with k 0 = k (Westfall and Young (1993), Lehmann and Romano (2005)). In microarray experiments, since it is rare that no gene is differentially expressed, to control FWER strongly is more appropriate than weakly. Strong control of FWER is desired to minimize the number of false rejections in some cases, such as selecting genes to build diagnostic or prognostic chips for diseases. An example is the MammaPrint developed by Agendia, which is based on the well-known Amsterdam 70-gene breast cancer gene signature (van t Veer et al. (2002), van de Vijver et al. (2002), Buyse et al. (2006), Glas et al. (2006)). MammaPrint is used to predict whether existing breast cancer will metastasize (spread to other parts of a patient s body). The multiple testing procedure proposed by Pollard and van der Laan (2005) has a strong asymptotic control of FWER. It controls the error rate α n for a sample of size n. It has limsup n α n α under the true data generating distribution when the sample size n goes to infinity. 4

17 1.1.4 False Discovery Rate (FDR) The concept of false discovery rate (FDR) was first proposed by Benjamini and Hochberg (1995) to reduce the stringency of strong FWER control. FDR is more widely used than FWER in bioinformatics studies because the investigators are more interested in finding all potential genes that are differentially expressed even if some genes could be falsely identified (Benjamini and Yekutieli (2001), Storey (2002), Storey and Tibshirani (2003b), Storey and Tibshirani (2003a), Benjamini et al. (2006), Strimmer (2008)). FDR is defined as the expected proportion of erroneously rejected null hypotheses among all rejected null hypotheses FDR = E( V R > 0)Pr(R > 0). R FDR: Benjamini and Hochberg (1995) also presented four alternative formulations of (1) Positive FDR pfdr = E( V R > 0). R The pfdr is recommended by Storey (2002) who argued that pfdr is a more appropriate error measure to use compared to FDR. (2) Conditional FDR cfdr = E( V R = r), R where r is the observed number of rejected null hypotheses. (3) Marginal FDR mfdr = E(V )/E(R). 5

18 (4) Empirical FDR Fdr = E(V )/r. Benjamini and Hochberg (1995) argued that all four FDRs can not be controlled when all null hypotheses are true (k 0 = k). If k 0 = k and even if a single null hypothesis is rejected, V/R = 1 and FDR cannot be controlled. Controlling pfdr, cfdr, mfdr and Fdr has the same problem-they are identically 1 when k 0 = k. Tsai et al. (2003) showed that pfdr, cfdr and mfdr are equivalent under the Bayesian framework, in which the number of true null hypotheses is modeled as a random variable. The significant analysis of microarray (SAM) method that will be discussed in chapter 4 estimates the empirical FDR Multiple testing principles A general principle of multiple testing is the Partitioning Principle proposed by Stefansson et al. (1988), and further refined by Finner and Strassburger (2002). Both Holm (1979) s step-down method and Hochberg (1988) s step-up method are special cases of partition testing (Huang and Hsu (2007)). The principle of partition testing is to partition the parameter space into disjoint subspaces, test each partitioning null hypothesis at level α, and collate the results across the subspaces, as follows: Let P = {1,..., k}, and consider testing H 0i : θ i = 0, i = 1,...,k. To control FWER strongly, the Partitioning Principle states: P1: For each I {1,...,k}, I, form H0I : θ i = 0 for all i I and θ j 0 for j / I. In total, there are 2 k parameter subspaces and 2 k 1 null hypotheses to be tested. 6

19 P2: Test each H0I at level α. Since all the null hypotheses are disjoint, at most one null hypothesis is true. Therefore, no multiplicity adjustment is required for each H0I. P3: For each i, infer θ i 0 if and only if all H0I with i I are rejected since H 0i is the union of H 0I with i I. Taking k = 3 as an example, the parameter space Θ = {θ 1, θ 2, θ 3 } will be partitioned into eight disjoint subspaces: Θ 1 = {θ 1 = 0 and θ 2 = 0 and θ 3 = 0} Θ 2 = {θ 1 = 0 and θ 2 = 0 and θ 3 0} Θ 3 = {θ 1 = 0 and θ 2 0 and θ 3 = 0} Θ 7 = {θ 1 0 and θ 2 0 and θ 3 = 0} Θ 8 = {θ 1 0 and θ 2 0 and θ 3 0} Next, we will test each of the following H0I s at level α: H0{123} : θ 1 = 0 and θ 2 = 0 and θ 3 = 0 H0{12} : θ 1 = 0 and θ 2 = 0 and θ 3 0 H0{13} : θ 1 = 0 and θ 2 0 and θ 3 = 0 H0{2} : θ 1 0 and θ 2 = 0 and θ 3 0 H0{3} : θ 1 0 and θ 2 0 and θ 3 = 0 Finally, infer θ i 0 if and only if all H0I involving θ i = 0 are rejected. Another multiple testing principle similar to the Partitioning Principle, is the closed testing principle (Marcus et al. (1976)). The closed testing principle states: 7

20 C1: For each I {1,..., k}, form the intersection null hypothesis H 0I : θ i = 0 for all i I. C2: Test each H 0I at level α. C3: For each i, infer θ i 0 if and only if all H 0I with i I are rejected. Compared to the partition testing procedure, the closed testing procedure tests less restrictive hypotheses. However, the closed testing procedure still controls FWER strongly because a level-α test for H 0I is also a level-α test for H0I. To test H 0 : θ i = 0 (i = 1,...,k) using the test statistic T i = ˆθ i (i = 1,...,k), we will test 2 k 1 null hypotheses in accordance with the Partitioning Principle. Here is a typical partitioning null hypothesis: H0{12 t} : θ 1 = 0 and and θ t = 0 and θ t+1 0 and and θ k 0 (1 t k). The above null hypothesis can be simplified as H 0{12 t} : θ 1 = 0 and θ 2 = 0 and and θ t = 0 (1 t k) according to the closed testing principle. It still controls FWER strongly because a level-α test for H 0{12 t} is also a level-α test for H0{12 t}. The test statistic for testing H 0{12 t} is max i=1,...,t T i = max i=1,...,t ˆθ i because H 0{12 t} is an Union-Intersection test (Casella and Berger (1990)), and the rejection region for a Union-Intersection test is i {1,...,t} { T i > c} = {max i=1,...,t T i > c} (where c is the critical value for testing H 0{12 t} ). 1.2 Resampling methods Resampling methods can be used to estimate the precision of sample statistics (mean, median, percentiles), perform significance tests, and validate models (Westfall 8

21 and Young (1993), Efron and Tibshirani (1994), Davison and Hinkley (1997), Good (2005)). The commonly used resampling techniques include permutation tests and bootstrap methods. Two different bootstrap methods, the post-pivot resampling method and the pre-pivot resampling method, will be introduced in this section. Westfall and Young (1993) introduced procedures using resamplings to adjust P- values in multiple testings to control multiple testing error rates Permutation tests A permutation test is a type of non-parametric statistical significance test in which a reference distribution is constructed by calculating all possible values of test statistics from permuted observations under a null hypothesis. The theory of permutation tests is based on the works of Fisher and Pitman in the 1930s (Good (2005)). Compared to parametric testing procedures, the fewer distributional assumptions and the simpler procedures make permutation tests more attractive to many researchers and statisticians. For example, when comparing the means of two populations, a two-sample t-test assumes that the sampling distribution of the difference between sample averages is normal, which is not true in most cases. The t-test is only valid when both populations have independent or joint normal distributions. In contrast, the permutation test is distribution-free so that it can give exact P-values when the sample size is small. The permutation test permutes the labels of observations between two groups, and obtains the P-value by calculating the proportion of test statistic values from resamples that are as extreme or more extreme than the 9

22 observed test statistic value. In microarray data analysis, when the correlations between genes are considered in the joint distribution of test statistics, the parametric form of a multivariate t distribution becomes very complex and difficult to calculate. In contrast, the permutation test is easy to conduct and avoids complex calculations. To carry out a permutation test based on a test statistic that measures the size of an effect of interest, we proceed as follows: 1. Compute the test statistic for the observed data set. 2. Permute the original data in a way that matches the null hypothesis to get permuted resamples, and construct the reference distribution using the test statistics calculated from permuted resamples. 3. Calculate the critical value of a level α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of permutation test statistics that are as extreme or more extreme than the observed test statistic. Permutation tests can be used in a wide variety of settings. For example, Fisher s exact test (a permutation test) is used to detect the association between a row variable and a column variable for small, sparse, or unbalanced data sets. Ein-Dor et al. (2005) used a permutation test for selecting genes which expression profiles are significantly correlated with breast cancer survival status. Based on random permutations of time points, Ptitsyn et al. (2006) applied the permutation test to identifying a periodic pattern in relatively short time series using microarray technology. The periodic process is important for modulating and coordinating the transcription of genes governing key metabolic pathways. Churchill and Doerge (1994) used a permutation 10

23 test based on the permutation of observed quantitative traits to determine the quantitative trait loci. To identify significant changes in gene expression in microarray experiments, Tusher et al. (2001) used permutations of the repeated measurements in the significance analysis of microarrays (SAM) procedure. For two-group comparisons, permuting the labels of observations between two groups requires an assumption that two populations are identical when the null hypothesis is true-that is, not only are their means the same, but also their spreads and shapes. Pollard and van der Laan (2005) demonstrated that, if both the correlation structures and the sample sizes are different between two populations, then a permutation test does not control the type I error rate at its nominal significance level for detecting differentially expressed genes between two groups. When comparing two groups and finding significant predictor variables in fixed-effects general linear models, the conditions for permutation tests to control multiple testing error rates will be further discussed in chapter 3. For testing hypotheses about a single population, comparing populations that differ even under the null hypothesis, or testing general relationships, permutation tests cannot be used because we do not know how to resample in a way that matches the null hypothesis in these settings. Hence, bootstrap methods should be used instead Bootstrap methods The bootstrap method was first introduced by Efron (1979) and further discussed by Efron and Tibshirani (1994). 11

24 The bootstrap method is a way of approximating the sampling distribution from just one sample. Instead of taking many simple random samples from the population to find the sampling distribution of a sample statistic, the bootstrap method repeatedly resamples with replacement from one random sample. The bootstrap distribution of a statistic collects values of the statistic from many resamples, and gives information about the sampling distribution of the statistic. For example, the bootstrap distribution of a sample mean is obtained from the resampled means calculated from hundreds of resamples with replacement from a single original sample. The bootstrap distribution of a sample mean has the following mean and standard error: mean boot = X boot = 1 B X SE boot = 1 ( X mean boot ) B 1 2 where X is the sample mean of each bootstrap resample and B is the number of resamples. Since a bootstrap distribution of a statistic generates from a single original sample, it is centered at the value of the sample statistic rather than the parameter value. Bootstrap distributions include two sources of random variation: one is from choosing an original sample at random from the population, and the other is from choosing bootstrap resamples at random from the original sample, which introduces little additional variation. Bootstrap methods are asymptotically valid (as original sample size goes to ). Efron (1979) showed that the bootstrap method can (asymptotically) correctly estimate the variance of a sample median, and the error rates in a linear discrimination 12

25 problem (outperforming cross-validation). Freedman (1981) showed that the bootstrap approximation to the distribution of least square estimates is valid. Hall (1986) showed the bootstrap method s reduction of error coverage probability, from O(n 1/2 ) to O(n 1 ), which makes the bootstrap method one order more accurate than the delta method. Bootstrap methods are widely used in all kinds of data analysis. Davison and Hinkley (1997) illustrated the application of bootstrap methods in stratified data; finite populations; censored and missing data; linear, nonlinear, and smooth regression models; classification; time series and spatial problems. For example, by using Efron s bootstrap resampling method, Liu et al. (2004) analyzed the performance of artificial neural networks (ANNs) in the area of feature classification for the analysis of mammographic masses to achieve more accurate results. The feature classification in mammography is used to discover the salient information that can be used to discriminate benign from malignant masses. In microarray data analysis, there are two commonly used bootstrap methods, including the post-pivot resampling method and the pre-pivot resampling method. Both methods can control FWER asymptotically and give similar results in a fixedeffects general linear model with i.i.d. errors. In two-group comparisons, the null distribution estimated by the pre-pivot resampling method has more resampled test statistic values than that estimated by the post-pivot resampling method under a reasonable assumption (the distributions of the errors are exchangeable) for microarray data. 13

26 Post-pivot resampling method The post-pivot resampling method was introduced by Pollard and van der Laan (2005) to estimate the null distribution of test statistics in multiple hypotheses testing to achieve asymptotic multiple testing error rate control. The post-pivot resampling method obtains the asymptotically correct null distribution of the test statistic (based on the true data generating distribution) from centered and/or scaled resampled test statistics. In microarray data analysis with two or more treatment groups, the post-pivot resampling method resamples the observed data within each group, calculates the resampled test statistics from each resample, centers and/or scales the resampled test statistics (subtracts the average of resampled test statistics and/or divides the standard deviation of resampled test statistics), and estimates the test statistic s null distribution from the centered and/or scaled resampled test statistics. To carry out a hypothesis test based on a test statistic that measures the location difference between two populations, the post-pivot resampling method proceeds as follows: 1. Compute the test statistic for the observed data set. 2. Resample the data with replacement within each group to obtain bootstrap resamples, compute the test statistic for each resampled data set, and construct the reference distribution using the centered and/or scaled resampled test statistics. 3. Calculate the critical value of a level-α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of bootstrapped test statistics that are as extreme or more extreme than the observed test statistic. 14

27 Pre-pivot resampling method The pre-pivot resampling method fits a model to the observed data first and then estimates the test statistic s null distribution by bootstrapping the centered residuals (subtract the sample mean of residuals) (Freedman (1981)). Under an assumption that the model fits the data well, the pre-pivot resampling method can provide asymptotically valid results, i.e., it can control multiple testing error rates asymptotically when testing multiple null hypotheses. In microarray data analysis, the pre-pivot resampling method estimates the null distributions of test statistics by bootstrapping residuals from a probe level or a gene level model with treatment effects. The way that the residuals are re-sampled with replacement (bootstrapped) depends on the assumptions about the residuals. The residuals can be re-sampled across treatments under the assumption of same distributions across treatments, but not across genes. If the distributions are the same across genes, then residuals across treatments and genes can be pooled together for resampling with replacement. To carry out a hypothesis test based on a test statistic that measures the location difference of two populations, the pre-pivot resampling method has the following procedure: 1. Compute the test statistic for the observed data set. 2. Fit a one-way model to the observed data, and compute the residuals from the one-way model (subtract the sample mean from each observation within each group). 3. Combine the residuals of two groups together under an assumption that the distributions of the residuals are the same for these two groups. 15

28 4. Resample the pooled residuals with replacement to get bootstrapped residuals, and center the bootstrapped residuals at the average (subtract the average of bootstrapped residuals) if the average of those bootstrapped residuals is not zero. 5. Add the centered bootstrapped residuals from each resample back to the oneway model, and recompute the test statistic for each resample. Then, the test statistics from all resamples form the reference distribution. 6. Calculate the critical value of a level-α test based on the upper α percentile of the reference distribution, or obtain the P-value by computing the proportion of bootstrapped test statistics as extreme or more extreme than the observed test statistic. 16

29 CHAPTER 2 SMALL SAMPLE BEHAVIOR OF RESAMPLING METHODS Resampling techniques are popular in microarray data analysis. In this chapter, we will discuss the small sample behavior of three popular resampling techniques for multiple testing: the permutation test, the post-pivot resampling method, and the pre-pivot resampling method. We will show that when the sample size is small, for matched pairs, a permutation test is unlikely to give small P-values, while both post-pivot and pre-pivot resampling methods might give P-values of zero for the same data, even adjusting for multiplicity. The discreteness of the test statistics null distributions estimated by the above three resampling methods will be compared based on the maximum number of unique test statistic values. 2.1 Tomato microarray example A biology professor in the Department of Horticulture and Crop Science at the Ohio State University wishes to identify differentially expressed genes between control tomato plants and mutant tomato plants at different tomato fruit developmental stages (flower bud, flower, and fruit). Lee et al. (2000) recommended that at least three replicates should be used in designing experiments by using cdna microarrays, 17

30 particularly when gene expression data from single specimens will be analyzed. In the tomato microarray experiment, there are three paired samples at each stage (three plants in the control group and three plants in the mutant group). Suppose we only have three genes at the fruit stage and wish to learn which genes are differently expressed between the mutant group and the control group using the single step maxt method, a method based on resampling techniques, for the multiplicity adjustment. Let X ij (i =1, 2, 3, and j =1, 2, 3) denote the gene expression levels for the ith gene, jth sample in the control group, and Y ij (i =1, 2, 3, and j =1, 2, 3) denote the gene expression levels for the ith gene, jth sample in the treatment group. For the ith gene, X ij i.i.d. F Xi, and Y ij i.i.d. F Yi. Let d ij = x ij y ij denote the observed paired difference for the ith gene, jth paired sample, θ i denote the true paired difference between the paired samples. To identify the differentially expressed genes among these three genes, we will test the null hypotheses H 0 : θ i = 0 (i =1, 2, 3) using the test statistics T i = d i (i =1, 2, 3). The raw P-values are calculated according to the following formula using resampling methods: Raw P i = {b : T i,b T i }, for i = 1,...,k. B The single step maxt method based on resampling techniques will be used to calculate the adjusted P-values for adjusting multiplicity when we are testing three null hypotheses simultaneously. The formula for calculating maxt adjusted P-values with monotonicity enforced is (cf Westfall and Young (1993)): Adjusted P i = {b : max i=1,2,3 T i,b T i }, for i = 1, 2, 3, (2.1) B 18

31 where T i,b denotes the resampled test statistic for the ith gene, bth resampling, and B is the total number of resamplings (b = 1,..., B). Figure 2.1 shows the absolute values of the observed test statistics T i and the maximums of the absolute values of resampled test statistics max i=1,2,3 T i,b from three resampling methods. The dots denote the observed test statistics; the rectangles denote the maximums of resampled test statistics from the permutation test; the diamonds denote the maximums of resampled test statistics from the post-pivot resampling method; and the triangles denote the maximums of resampled test statistics from the pre-pivot resampling method. As shown in Figure 2.1, the permutation test always produces permutated test statistics that are greater than or equal to the observed test statistic. Thus, it is unlikely that the permutation test gives zero adjusted P-values. In contrast, for either the pre-pivot or post-pivot resampling method, there is a high probability that the observed test statistic is far from the resampled test statistics. Therefore, we might get zero adjusted P-values using these two resampling methods. Based on the formula of the single step maxt method for calculating adjusted P-values, we can get the adjusted P-values for all three genes. Table 2.1 summarizes the adjusted P-values obtained from the permutation test, the post-pivot resampling method, and the pre-pivot resampling method for three tomato fruit genes based on Figure 2.1. Based on the null distribution of max T estimated from the permutation test (the rectangles), we can observe that the adjusted P-value for gene 1 is 0.75 since 6 out of 8 max T values (square in Figure 2.1) are greater than or equal to T 1 (dot in Figure 2.1). Similarly, the adjusted P-values for gene 2 and gene 3 are both 0.25 based on the permutation test. Using the post-pivot resampling method, the adjusted 19

32 Figure 2.1: Null distribution of max i=1,2,3 T i for k = 3 and n = 3. Observed test statistics and resampled test statistics from permutation test, post-pivot resampling and pre-pivot resampling methods. P-value for gene 1 is 0.30 since 3 out of 10 max T values (diamond in Figure 2.1) are greater than or equal to T 1 (dot in Figure 2.1). For gene 2 and gene 3, however, there is no resampled max T value from the post-pivot resampling method that is greater than or equal to either T 2 or T 3 (dots in Figure 2.1). Thus, the adjusted P-values for gene 2 and gene 3 are both zero using the post-pivot resampling method. We obtain the same adjusted P-values from the pre-pivot resampling method as from the post-pivot resampling method for all three fruit genes. Table 2.1: Adjusted P-values calculated from formula (2.1) for the permutation test, post-pivot resampling method and pre-pivot resampling method Permutation Post-pivot resampling Pre-pivot resampling gene 1 6/8=0.75 3/10 =0.30 3/10 =0.30 gene 2 2/8=0.25 0/10 =0 0/10 =0 gene 3 2/8=0.25 0/10 =0 0/10 =0 20

33 Strikingly, for matched pairs, the permuted test statistics (unstandardized or standardized) with complete enumerations always have a mean of zero. The reason is that one sample in each pair can be assigned either zero or one as their group label. When the labels are switched, the signs of the test statistics are also switched. Thus, the positive signs and negative signs cancel each other out so that the mean of all permuted test statistics will be equal to zero. For standardized test statistics, since the MSEs are always the same for paired permuted samples when labels switched, the mean of all permuted test statistics is also zero. 2.2 Conditions for getting adjusted P-values of zero using the post-pivot resampling method The tomato microarray example suggests that P-values of zero may occur often even after multiplicity adjustment. Therefore, we need to explore the conditions for getting an adjusted P-value of zero using the post-pivot and pre-pivot resampling methods for paired samples with small sample sizes (2 or 3 each) Conditions for getting adjusted P-values of zero with a sample size of two To expand three genes in our tomato microarray example to k genes, let X ij (i = 1, 2,..., k; and j = 1, 2,..., n) denote the gene expression levels for the ith gene, jth sample in the control group, and Y ij (i = 1, 2,..., k and j = 1, 2,..., n) denote the gene expression levels for the ith gene jth sample in the mutant group. For the ith gene, X ij i.i.d. F Xi, and Y ij i.i.d. F Yi. Assume d ij = x ij y ij are the observed paired differences for the ith gene in the jth paired sample. We wish to determine which genes are differentially expressed 21

34 among those k genes by testing the k null hypotheses H 0 : θ i = 0 (i = 1,...,k) using the test statistics T i = d i. When the sample size n is two, the observed differences are d ij = x ij y ij (i =1, 2,..., k and j =1 and 2). For the first two genes, we have the following observation matrix ( d11 d 12 d 21 d 22 ). The observed test statistics are T 1 = (d 11 + d 12 )/2 and T 2 = (d 21 + d 22 )/2. The resampled test statistics matrix is shown as follows using the post-pivot resampling method: ( d d d 12 2 d d 21 +d d 11 +d 12 d 2 12 d 21 +d 22 d 2 22 We can get the following matrix after subtracting the average in each row: ( d11 d d 12 d d 21 d d 22 d To get a raw P-value of zero for the first gene, we need to have { d 11 +d 12 2 > d 11 d 12 d 11+d 12 2 > 0. 2 Similarly, we need to have the following relationship to have a raw P-value of zero for the second gene: { d 21 +d 22 2 > d 21 d 22 d 21+d 22 2 > 0. 2 Therefore, the necessary and sufficient conditions for getting a raw P-value of zero for the ith gene are: ). ). either { di1 > 0 d i2 > 0 22

35 or { di1 < 0 d i2 < 0 for i=1, 2. Using the single step maxt method, we can get the necessary and sufficient conditions for getting an adjusted P-value of zero for the first gene as follows: either or d 11 > 0 d 12 > 0 d 11 + d 12 > d 21 d 22 d 11 < 0 d 12 < 0 d 11 + d 12 < d 21 d 22. Similarly, we can get the necessary and sufficient conditions for getting an adjusted P-value of zero for the second gene as follows: either or d 21 > 0 d 22 > 0 d 21 + d 22 > d 11 d 12 d 21 < 0 d 22 < 0 d 21 + d 22 < d 11 d 12. In other words, to have both raw P-values of zero and adjusted P-values of zero with a sample size of two for two genes, the conditions are: 1. To have raw P-values of zero, the necessary and sufficient condition is that both observations are in the same direction (either both are bigger than zero or both are smaller than zero). 2. To have adjusted P-values of zero, the necessary and sufficient conditions that need to be satisfied are: (a) Both observations for the same gene are in the same direction. 23

36 (b) The sum of two observations for one gene is either bigger than the absolute difference of two observations of the other gene (in the positive direction) or smaller than the negative value of the absolute difference of two observations of the other gene (in the negative direction). If k genes are considered, the necessary and sufficient conditions for the ith gene to have a raw P-value of zero with a sample size of two are: either or { di1 > 0 d i2 > 0 { di1 < 0 d i2 < 0, for i = 1, 2,..., k. For getting an adjusted P-value of zero for the ith gene with a sample size of two, the necessary and sufficient conditions are: either or for i = 1, 2,..., k. d i1 > 0 d i2 > 0 d i1 + d i2 > max j i,j=1,2,...,n d j1 d j2 d i1 < 0 d i2 < 0 d i1 + d i2 < max j i,j=1,2,...,n d j1 d j2, Conditions for getting adjusted P-values of zero with a sample size of three When the sample size increases from two to three for each group, the observed differences are d ij = x ij y ij (i = 1, 2,..., k and j = 1, 2 and 3). The observed 24

37 difference matrix for the first two genes is: ( ) d11 d 12 d 13. d 21 d 22 d 23 T 1 = (d 11 + d 12 + d 13 )/3 and T 2 = (d 21 + d 22 + d 23 )/3 will be our observed test statistics for the first two genes when the sample size is three, and there will be = 27 complete bootstrap resampled test statistics. The ten bootstrap resamples that will give ten unique test statistic values are: , where 1 is the label for the first paired difference, 2 is the label for the second paired difference, and 3 is the label for the third paired difference. If the bootstrap resamplings all come from the first paired difference, then we will have the following resampled difference matrix for the first two genes: ( d11 d 11 d 11 d 21 d 21 d 21 The resampled test statistics computed from the above difference matrix are T 1,b=1 = d 11 and T 2,b=1 = d 21. If the bootstrap resamplings include the first paired difference twice and the second paired difference once, then the resampled difference matrix is: ). ( ) d11 d 11 d 12. d 21 d 21 d 22 25

38 The resampled test statistics computed from the above difference matrix are T 1,b=2 = (2d 11 + d 12 )/3 and T 2,b=2 = (2d 21 + d 22 )/3. In the post-pivot resampling method, we subtract the average of all resampled test statistics, which is T 1 = (d 11 +d 12 +d 13 )/3 for the first gene and T 2 = (d 21 +d 22 +d 23 )/3 for the second gene respectively, from each resampled test statistic to get the reference distribution Z b for both genes: ( 2d11 d 12 d d 21 d 22 d 23 3 d 11 d d 12 d 11 d d 21 d d 22 d 21 d d 13 d d 23 d 22 2d 13 d 11 d d 23 d 21 d 22 3 ). According to the formula for calculating raw P-values, if all Z 1,b < T 1, the raw P-value of the first gene is equal to zero. To have Z 1,b < T 1, the following relationships need to be satisfied: (d 11 d 13 )/3 < (d 11 + d 12 + d 13 )/3 (d 11 d 12 )/3 < (d 11 + d 12 + d 13 )/3 (d 12 d 13 )/3 < (d 11 + d 12 + d 13 )/3 (2d 11 d 12 d 13 )/3 < (d 11 + d 12 + d 13 )/3 (2d 12 d 11 d 13 )/3 < (d 11 + d 12 + d 13 )/3 (2d 13 d 11 d 12 )/3 < (d 11 + d 12 + d 13 )/3 0 < (d 11 + d 12 + d 13 )/3 From the above equations, we derive the following necessary and sufficient conditions for the first gene to have a raw P-value of zero: either or d 11 > max( 2d 12, 2d 13 ) d 12 > max( 2d 11, 2d 13 ) d 13 > max( 2d 11, 2d 12 ) d 11 + d 12 > d 13 2 d 11 + d 13 > d 12 2 d 11 + d 13 > d 12 2 d 11 < 0 d 12 < 0 d 13 < 0, 26

39 For the second gene, to have Z 2,b < T 2, the following relationships need to be satisfied: (d 21 d 23 )/3 < (d 21 + d 22 + d 23 )/3 (d 21 d 22 )/3 < (d 21 + d 22 + d 23 )/3 (d 22 d 23 )/3 < (d 21 + d 22 + d 23 )/3 (2d 21 d 22 d 23 )/3 < (d 21 + d 22 + d 23 )/3 (2d 22 d 21 d 23 )/3 < (d 21 + d 22 + d 23 )/3 (2d 23 d 21 d 22 )/3 < (d 21 + d 22 + d 23 )/3 0 < (d 21 + d 22 + d 23 )/3 From the above equations, the necessary and sufficient conditions for the second gene to have a raw P-value of zero are: either or d 21 > max( 2d 22, 2d 23 ) d 22 > max( 2d 21, 2d 23 ) d 23 > max( 2d 21, 2d 22 ) d 21 + d 22 > d 23 2 d 21 + d 23 > d 22 2 d 21 + d 23 > d 22 2 d 21 < 0 d 22 < 0 d 23 < 0, If we expand the two-genes case to k-genes case, the necessary and sufficient conditions for the ith gene to have a raw P-value of zero are shown as follows using the post-pivot resampling method: either or for i = 1, 2,..., k. d i1 > max i=1,2,...,k ( 2d i2, 2d i3 ) d i2 > max i=1,2,...,k ( 2d i1, 2d i3 ) d i3 > max i=1,2,...,k ( 2d i1, 2d i2 ) d i1 + d i2 > d i3 2 d i1 + d i3 > d i2 2 d i1 + d i3 > d i2 2 d i1 < 0 d i2 < 0 d i3 < 0, 27

40 To have an adjusted P-value of zero for the first gene when we only have two genes, the following relationships need to be satisfied: max( (d 11 d 13 )/3, (d 21 d 23 )/3 ) < (d 11 + d 12 + d 13 )/3 max( (d 11 d 12 )/3, (d 21 d 22 )/3 ) < (d 11 + d 12 + d 13 )/3 max( (d 12 d 13 )/3, (d 22 d 23 )/3 ) < (d 11 + d 12 + d 13 )/3 max( (2d 11 d 12 d 13 )/3, (2d 21 d 22 d 23 )/3 ) < (d 11 + d 12 + d 13 )/3 max( (2d 12 d 11 d 13 )/3, (2d 22 d 21 d 23 )/3 ) < (d 11 + d 12 + d 13 )/3 max( (2d 13 d 11 d 12 )/3, (2d 23 d 21 d 22 )/3 ) < (d 11 + d 12 + d 13 )/3 0 < (d 11 + d 12 + d 13 )/3 The above equations give us the following necessary and sufficient conditions for getting an adjusted P-value of zero for the first gene: either or d 11 > max( 2d 12, 2d 13 ) d 12 > max( 2d 11, 2d 13 ) d 13 > max( 2d 11, 2d 12 ) d 11 + d 12 > d 13 2 d 11 + d 13 > d 12 2 d 11 + d 13 > d 12 2 d 11 + d 12 + d 13 > max( d 21 d 23 + d 21 d 22, d 21 d 22 + d 22 d 23, d 21 d 23 + d 22 d 23 ) d 11 < 0 d 12 < 0 d 13 < 0 d 11 + d 12 + d 13 < max( d 21 d 23 + d 21 d 22, d 21 d 22 + d 22 d 23, d 21 d 23 + d 22 d 23 ), If we have k genes instead of two genes, we need to solve the following equations to get an adjusted P-value of zero for the ith gene (i = 1,...,k): max i=1,...,k ( (d i1 d i3 )/3 ) < (d i1 + d i2 + d i3 )/3 max i=1,...,k ( (d i1 d i2 )/3 ) < (d i1 + d i2 + d i3 )/3 max i=1,...,k ( (d i2 d i3 )/3 ) < (d i1 + d i2 + d i3 )/3 max i=1,...,k ( (2d i1 d i2 d i3 )/3 ) < (d i1 + d i2 + d i3 )/3 max i=1,...,k ( (2d i2 d i1 d i3 )/3 ) < (d i1 + d i2 + d i3 )/3 max i=1,...,k ( (2d i3 d i1 d i2 )/3 ) < (d i1 + d i2 + d i3 )/3 0 < (d 11 + d 12 + d 13 )/3 28

41 The following necessary and sufficient conditions are derived to get an adjusted P-value of zero for the ith gene when the sample size is three in each group. Either d i1 > max i=1,2,...,k ( 2d i2, 2d i3 ) d i2 > max i=1,2,...,k ( 2d i1, 2d i3 ) d i3 > max i=1,2,...,k ( 2d i1, 2d i2 ) d i1 + d i2 > d i3 2 d i1 + d i3 > d i2 2 d i1 + d i3 > d i2 2 d i1 + d i2 + d i3 > max l i,l=1,2,...,k ( d l1 d l3 + d l1 d l2, d l1 d l2 + d l2 d l3, d l1 d l3 + d l2 d l3 ) or d i1 < 0 d i2 < 0 d i3 < 0 d i1 + d i2 + d i3 < max l i,l=1,2,...,k ( d l1 d l3 + d l1 d l2, d l1 d l2 + d l2 d l3, d l1 d l3 + d l2 d l3 ), for i = 1, 2,..., k. 2.3 Conditions for getting adjusted P-values of zero using the pre-pivot resampling method For paired data, two groups comparison is equivalent to a one-sample problem. The pre-pivot resampling method subtracts the difference of the two groups means first, and then resamples the residuals with replacement for paired data. Since (x i x) (y i ȳ) = (x i y i ) ( x ȳ), the test statistic s null distribution estimated by the pre-pivot resampling method is the same as that by the post-pivot resampling method for paired data, as shown below. With a sample size of n, the observed test statistics will be d i = (d i1+d i2 + +d in ) n for the ith gene. For the post-pivot resampling method, there are ten unique bootstrap test statistics calculated from the resamples for each gene when the sample size is 29

42 three (n = 3). The bootstrap resampled test statistics matrix T b is: 2d d 11 +d 12 d d 12 +d 13 d d 11 +2d d d d 21 +d 22 d d 22 +d 23 d d 21 +2d d d d k1 +d k2 d k1 k1 +d k2 +d k3 d d k1 +2d k3 3 3 k2 d 3 k3 The estimated mean vector Ê(T b ) is d 11 +d 12 +d 13 3 d 21 +d 22 +d d k1 +d k2 +d k3 3. The estimated null distribution matrix Z b is 2d 11 d 12 d 13 d 11 d 13 2d 0 12 d 11 d d 21 d 22 d 23 d 21 d 23 2d 0 22 d 21 d d k1 d k2 d k3 d k1 d k3 0 2d k2 d k1 d k d 13 d 12 3 d 23 d d k3 d k2 3 2d 13 d 11 d d 23 d 21 d d k3 d k1 d k2 3 The residuals for paired data sets using the pre-pivot resampling method are shown. as follows: d 11 ( x 1 ȳ 1 ) d 12 ( x 1 ȳ 1 ) d 13 ( x 1 ȳ 1 ) d 21 ( x 2 ȳ 2 ) d 22 ( x 2 ȳ 2 ) d 23 ( x 2 ȳ 2 )... d k1 ( x k ȳ k ) d k2 ( x k ȳ k ) d k3 ( x k ȳ k ), where x i ȳ i = x i1+x i2 +x i3 y i1+y i2 +y i3 = d i1+d i2 +d i3. The number of unique bootstrap resampled test statistics from the pre-pivot resampling method is ( n+n 1 n ) = (2n 1)! n!(n 1)!, which is the same as that from the post-pivot resampling method since n data points are resampled from n paired differences with replacement for both the post-pivot and pre-pivot resampling methods. Therefore, there are ten unique resampled test statistic values for each gene when the sample size n is three. The calculated bootstrap test statistics matrix, which is the estimated test statistic s null distribution, is showed as 30

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca