SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES

Size: px

Start display at page:

Download "SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES"

Dylan Little
6 years ago
Views:

1 SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES A thesis submitted to the faculty of San Francisco State University In partial fulfillment of The requirements for The degree Master of Arts in Mathematics by Vera Klimkovsky San Francisco, California May, 2006

2 Copyright by Vera Klimkovsky 2006

3 CERTIFICATION OF APPROVAL I certify that I have read Simulation studies and implementation of bootstrap-based multiple testing procedures by Vera Klimkovsky, and that in my opinion this work meets the criteria for approving a thesis in partial fulfillment of the requirements for the degree: Master of Arts in Mathematics at San Francisco State University. Mohammad R. Kafai Professor of Mathematics Eric Hayashi Professor of Mathematics Sergei Ovchinnikov Professor of Mathematics

4 SIMULATION STUDIES AND IMPLEMENTATION OF BOOTSTRAP-BASED MULTIPLE TESTING PROCEDURES Vera Klimkovsky San Francisco State University 2006 In this project, I use Statistical Analysis System (SAS) software to implement and perform simulation studies of new multiple testing procedures proposed recently. The key feature of these procedures is to use bootstrap re-sampling techniques (parametric or nonparametric) to obtain a consistent estimator of the test statistics null distribution to derive the cut-offs. Theoretically, it has been shown that the test statistics null distribution will be asymptotically multivariate normal for the asymptotically linear estimators of the parameters under consideration, and the bootstrap estimated null distribution will provide asymptotic control of type I error rate. I certify that the Abstract is a correct representation of the content of this thesis. Chair, Thesis Committee Date

5 v ACKNOWLEDGEMENTS First of all, I would like to acknowledge my thesis advisor Dr. Mohammad Kafai whose passion for the Theory of Probability and Statistics inspired my interest in the subject matter. I like to thank Dr. Mohammad Kafai for his continuous support and encouragement. My special thanks go to all my thesis committee, Mohammad Kafai, Eric Hayashi, and Sergei Ovchinnikov, for their time to carefully review earlier versions of this thesis, and valuable suggestions which led to a substantially improved final version. Also, many thanks to Dr. David Meredith and my thesis advisors for being so patient while I took my time to work on this project. Thank you for believing in me. In addition, I would like to thank all my teachers of mathematics at San Francisco State University for inspiration, motivation, encouragement and support in studying mathematics throughout the years of my graduate work.

6 Contents List of Tables List of Figures ix x 1 Introduction Why multiple hypothesis testing? Example: HLA-disease associations Thesis overview Methods of multiple hypothesis testing Background: Analysis of Variance Multiple Hypothesis Testing Basic Concepts Defined Classification of methods Single-Step Procedures LSD Test Bonferroni procedure Šidák s Method Tukey s Method Tukey-Kramer method Scheffé Method Stepwise Procedures Holm s procedure Shaffer method Choosing the right method Bootstrap-based resampling techniques Introduction vi

7 vii 3.2 Basic definitions Nonparametric bootstrap Parametric bootstrap Bootstrap estimate of standard error Bootstrap estimate of bias Bootstrap confidence intervals The bootstrap-t interval Bootstrap-based hypothesis testing Robustness and failure of bootstrap Jackknife as an approximation to Bootstrap Jackknife samples and estimates Jackknife or Bootstrap? Proposed bootstrap multiple testing procedures Microarrays as a Motivating Factor Background Microarray Experiment Multiple Hypothesis Testing Model Null hypotheses Hypothesis testing Multiple Testing Procedures Single-step common-quantile procedure Single-step common-cut-off procedure Proposed test statistics null distribution Bootstrap-based single step procedures Bootstrap estimation of the null distribution Advantages of the Proposed Procedures Simulation Studies Formulation of the Problem Objectives of Experiments Tests about the Mean Normal Distribution Models Poisson Distribution Models

8 viii 6 Implementation Software Program Design and Implementation Details Conclusion Summary of the Proposed Methods What has been done in the study Areas Left for Future investigation A SAS code 81 A.1 Supplydata sas code A.2 Procedure 3 sas code A.3 Procedure 1 sas code

9 List of Tables 3.1 Evidence against H Expression data from two groups of subjects: cancer patients and healthy controls. The data are already normalized [10] n realizations of a random g-vector Type I and Type II errors Summary of postulated and simulated models Summary of test statistic null distribution, theoretical and bootstrap estimated Summary of test statistic null distribution, theoretical and bootstrap estimated The results of the twenty simultaneous tests about the mean vector of the multivariate normal distribution Summary of postulated and simulated models Summary of test statistic null distribution (from poisson), theoretical and bootstrap estimated The results of the twenty simultaneous tests about the mean vector of multivariate poission distribution ix

10 List of Figures 5.1 Empirical Normal distribution Estimated bootstrap distibution Empirical Possion Distribution (λ = 2.5) Estimated Bootstrap Distribution The main flowchart diagram of the project The flowchart diagram of supplydata.sas The flow chart diagram for procedure The flow chart diagram for procedure x

11 Chapter 1 Introduction But is not always so; it may happen that small differences in the initial conditions produce very great ones in the final phenomena. Jules H. Poincare 1.1 Why multiple hypothesis testing? Multiple Hypothesis Testing is a test of more than one hypothesis at the same time. It represents a rich field of scientific research in a branch of inferential statistics addressing aspects of multiple comparison. For over half a century, beginning with the names of world famous statisticians such as Fisher, Tukey, Bonferroni, and Duncan, just to name a few, statisticians have worked on development and improvement of multiple comparison procedures based on the parameters under consideration and various assumptions about the underlying distributions, still leaving room for improvement or alternative implementation. Even the very need for multiple hypothesis testing 1

12 2 remains controversial as in which situations should multiple methods be applied and whether they should be applied at all. What are the considerations that we must have when addressing the question of statistical significance? To keep it simple, let us suppose that we like to study how vitamins affect people s strength. In our experiment, we randomly divide, say, 100 people into 5 groups of 20 and ask each person to take a daily vitamin pill. One group is assigned to a control group and they take a placebo (a pill that contains no vitamins at all). The remaining four groups are treatment groups and they take, respectively, a low dose of vitamin brand A, a high dose of vitamin brand A, a low dose of vitamin brand B, and a high dose of vitamin brand B. The response variable is a certain characteristic of people s strength. Is there a significant difference in responses between a control group and each treatment group? Is there a significant difference in responses between groups taking different dosages, different vitamins brands, different dosages and different brands? The attempt to answer all these questions will lead to ( 5 2) = 10 pairwise comparisons. If we test each null hypothesis of no significant difference between two different groups at the 5% level of significance and all tests are independent, then the probability that we falsely reject at least one true null hypothesis is P (at least one false positive) = 1 (.95) 10 = In other words, with only 10 hypothesis tests performed simultaneously, there is slightly over 40% chance the researchers will report the significant findings when in reality no effect exists. This probability increases rapidly with the increase of the number of tests required. With as many as 20 tests, the probability of at least one false rejection

13 3 already reaches 64%. The primary concern of the theory of the multiple hypothesis testing is to develop methods that would adjust or account for multiplicity effect. 1.2 Example: HLA-disease associations Questions raised in various research fields such as medicine, economics, engineering sciences, and social sciences often call for Multiple Hypothesis Testing. Human genetics White blood cells are components of blood and are part of the immune system. Another name for white blood cells is leukocytes or immune cells. White blood cells carry a group of genes called the human leukocyte antigen (HLA) system. It consists of several closely linked genetic loci on chromosome number 6. The loci within the HLA system are highly polymorphic showing numerous alleles (i.e. alternative forms of a single gene). It has been demonstrated that some human leukocyte antigens (genetic markers) are linked to particular diseases. For instance, it s been shown that HLA-B5 marker is associated with Hodgkin s disease, and HLA-B27 marker is associated with ankylosing spondylitis. Can there be associations of particular markers with other diseases?... literally scores of studies of HLA-disease associations have been published. Each of these

14 4 studies has attempted to find an HLA association with some particular disease, and a suprising number have succeeded. A few of these associations, for example that between HLA-B27 and ankylosing spondylitis, have been repeatedly confirmed and are very striking and obviously real. Most of the reported associations, however, have not withstood closer examination.... The problem lies in the large number of intercorrelated hypothesis tests, one for each antigen, a procedure that has a high probability of yielding a significant result for at least one of the tests, even when no real association exists. [9] 1.3 Thesis overview This thesis focuses on the implementation and simulation studies of multiple testing procedures that have been recently developed [1]. The objective of experiments performed in these studies is to demonstrate that these procedures based on bootstrapresampling techniques provide asymptotic control for Type I Error rate for a wide range of multiple hypothesis testing problems. Chapter 2, Methods of Multiple Hypothesis Testing, outlines some of the wellknown multiple testing procedures and their features. Chapter 3, Bootstrap-Based Resampling Techinques, introduces the reader to the resampling techniques and discusses their effectiveness and robustness. Chapter 4, Proposed Bootstrap Multiple Testing Procedures, discusses the new procedures. Chapter 5, Simulation Studies, outlines the main objectives of simulation studies and presents the results of a se-

15 5 ries of experiments. Chapter 6, Implementation, is devoted to implementation of these new procedures with the Statistical Analysis System (SAS) programming language. Chapter 7, Conclusion, summarizes the results and discusses the advantages and disadvantages of new procedures in the light of existing procedures and classical methods.

16 Chapter 2 Methods of multiple hypothesis testing Euclid taught me that without assumptions there is no proof. Therefore, in any argument, examine the assumptions. E. T. Bell 2.1 Background: Analysis of Variance Historically, many multiple comparison procedures originate from the test addressing equality of the group means. The framework of the test lies within the Analysis of Variance Procedure. Suppose we want to compare the average effects of k treatments. One question we might ask is, Do all treatments produce the same effect? Then, the null and 6

17 7 alternative hypotheses will be stated as follows: H 0 : µ 1 = µ 2 =... = µ k H 1 : not all the µ s s are equal The extreme value of the F statistic at a given level α only indicates that one or more µ j s significantly differ from the others. The rejection of H 0 gives no information which µ j s are different and which are the same. To answer these kinds of questions, many comparisons of the means might be needed. These problems give rise to Multiple Comparison Methods or more generally Multiple Hypothesis Testing. 2.2 Multiple Hypothesis Testing Multiple Hypothesis Testing is a branch of simultaneus inference and refers to methods designed to simultaneously test two or more hypotheses. The desireable property of such methods or procedures is to control the number of falsely rejected hypotheses in a probabilistic sense. Other considerations are also important; the ability of the method (procedure) to correctly recognize the set of the true hypotheses (known as the power of the test) or the ability of the procedure to account for the dependence structure and/or logical constraints among the set of hypotheses to be tested. To help build the theoretical foundation for the ideas presented in this and the following chapters, let us introduce important notions and terminology that will be

18 8 used throughout the entire paper Basic Concepts Defined Let H 0 and H 1 represent the null and alternative hypotheses, respectively. In Multiple Hypothesis Testing problems we have a collection of null and corresponding alternative hypotheses {(H 0j, H 1j ) : j = 1,..., m}, where m 2 is the number of hypotheses to be tested. A Multiple Hypotheses Test is a procedure or an algorithm that leads to a decision whether or not each null hypothesis should be rejected in a favor of its alternative. A family of ( k 2) hypotheses, for example, can be stated as follows, introducing the problem of pairwise comparisons among k group means: H 0 : µ i = µ j H 1 : µ i µ j, for all i < j i, j {1, 2,..., k} To compare the means among k = 5 groups, for instance, ( ) 5 2 = 5! = 10 pairwise 2!3! comparisons are required. Other examples of families of hypotheses are comparing group means to a control or testing general contrasts. Generally speaking, the parameters of interest in the hypothesis testing are means, variances, covariances, correlations, or parameters of a regression model.

19 9 Overall, the procedures for hypotheses testing can be distinguished by the type of inference or comparisons they make and the strength of the inference they provide. The rejection of a true null hypothesis is classified as Type I error (false positive is another term frequently used in the theory of hypothesis testing). If the procedure fails to reject a false null hypothesis, we say a Type II error (false negative) is committed. In multiple hypothesis testing, therefore, the procedure may result in more than one false positive. In this respect, we can define the rate of false positives or Type I error rate. Commonly-used Type I error rate can be defined as follows [1]: Per-comparison error rate (PCER) is the expected proportion of Type I errors among the m tests. Per-family error rate (PFER) is the expected number of Type I errors. Median-based per-family error rate (mpfer) is the median number of Type I errors. Family-wise error rate (FWER) is the probability of at least one Type I error. Generalized Family-wise error rate (gfwer) is the probability of at least (k+1) Type I errors, where (k + 1) cannot exceed the number of true null hypotheses. Strong and weak control When appropriate, procedures can be compared or classified by the type of control they provide for Type I error rate.

20 10 Definition 1. For any given Multiple Testing Procedure, if P r(reject at least one H i, i = j 1,, j t H j1,, H jt are true) α, for any configuration of true nulls H j1, H jt, then the MTP controls the FWER in the strong sense. Definition 2. A Multiple Testing Procedure is said to control the FWER in the weak sense if P r(reject at least one H i all H i are true) α. Another criterion that can be applied when charicterizing a certain procedure or comparing methods is whether the test is conservative. Definition 3. A hypothesis test is conservative if the actual significance level for the test is smaller than the stated significance level of the test. A conservative test may incorrectly fail to reject the null hypothesis, and thus is less powerful than was expected. Subset pivotality condition Let m be the number of hypotheses being tested and {P 1,..., P m } a vector of unadjusted p-values. (Note that P i s denote the random p-values as opposed to p i s, experimentaly observed p-values.) [2] Definition 4. The distribution of vector {P 1,..., P m } is said to have a subset pivotality property if the joint distribution of the subvector {P i : i K} is identical

21 11 under the restrictions i K H 0i and H C 0, for all subsets K = i 1,..., i j of true null hypotheses. Here, H C 0 denotes the complete null specification (H C 0 = m i=1 {H 0i is true }). As stated in [2]: The subset pivotality condition is important for two reasons. First, resampling is particularly convenient under this condition: resampling is done under the complete null hypothesis H0 C, rather than under partial hypotheses H0 K. Second, when subset pivotality holds, it will be shown that such resampling-based methods control the Family Wise Error rate in the strong sense (at least approximately, under asymptotic subset pivotality). Without this condition, resampling under H C 0 can be assume to control only the FWEC.. In [2], FWEC denotes the FWER calculated under the complete null hypothesis Classification of methods Methods of Multiple Hypothesis Testing can be classified as follows: Single-step methods, and stepwise methods that can be further subdivided into step-up and step-down methods. Definition 5. Single-step methods are simultaneous test procedures that perform equivalent multiplicity adjustment for all tests, regardless of the ordering of the observed p-values p 1,..., p k, and without considering any predetermined sequence of hypotheses.

22 12 Definition 6. Stepwise simultaneous step procedures allow different adjustment techniques for different hypotheses, depending upon how the hypotheses are ordered. Hypotheses may be ordered according to the size of p-values, or experimental or logical constraints. One may acheive the improvement in power and control for Type I error rate by the use of stepwise procedures. In step-down procedures, the hypotheses corresponding to the most significant test statistics (e.g. smallest unadjusted p-values) are considered successively, with further tests depending on outcome of earlier ones. As soon as one fails to reject a null hypothesis, no further hypotheses are rejected.,,[1] In step-up procedures, the hypothesis corresponding to the least significant test statistics are considered successively, again with further tests depending on the outcome of earlier ones. As soon as one hypothesis is rejected, all remaining more significant hypotheses are rejected.,,[1] In the following subsections, we ll examine procedures representing class of singlestep procedures and stepwise procedures. The Bonferroni procedure (Fisher s second procedure), for example, can be considered a typical example of single-step procedure, while Holm procedure is an example of step-down procedures. Newly proposed procedures introduced in Chapter 4 belong to the class of singlestep procedures. Therefore, the focus of the discussion will be on single-step proce-

23 13 dures. 2.3 Single-Step Procedures LSD Test Fisher s first procedure to account for multiplicity is called the protected Least Significant Difference (LSD) test. The procedure is done in two steps: Step 1. Apply ANOVA F-test to test if the means are significantly different. Step 2. If the result is significant, perform multiple t-tests, each at level α. If the result is not significant, no additional tests required. Procedure terminates. Features: The procedure doesn t control for FWER for all configurations of group means. Control for FWER is only provided under the null hypothesis that there is no difference in means. Thus, the test controls for FWER in a weak sense Bonferroni procedure Bonferroni procedure performs multiple t-tests, each at level α = α/ ( k 2). The decision rule can be stated in terms of p-values. If p j < α/k, H 0j should be rejected. For the Bonferroni procedure, adjusted p-values p j can be defined by p j = min(kp j, 1) and equivalently can be used in the decision rule: reject H 0j if p j α.

24 14 Features: The procedure controls for FWER in the strong sense. The method is conservative (fails to account for dependencies among tests) Bonferroni procedure can be used to obtain confidence intervals for all pairwise differences among the group means Šidák s Method Šidák Method rejects H 0j when p j < 1 (1 α) 1/k where p j is the corresponding p-value. The adjusted p-value is computed as follows: p j = 1 (1 p j ) k. Features: Šidák Method is conservative when p-values are not independently distributed Tukey s Method Tukey s method is designed for ( k 2) pairwise comparisons of k individual means, that is H 0 : µ i = µ j versus H 1 : µ i µ j for all i < j, where i, j {1, 2,..., k} The test is performed using confidence intervals for µ i µ j. The construction of confidence intervals involves the studentized range, Q k,v = R, where R is the range S

25 15 of a set of normally distributed random variables and S is their estimated standard deviation. Features: Tukey s method works for one-way balanced ANOVA. The method accounts for dependencies and thus, achieves more power Tukey-Kramer method The Tukey-Kramer method is a generalization of Tukey s method designed to work in the case of unbalanced design. Features: As the differences between the group size increase, the method becomes more conservative. The method controls FWER for means comparisons in a strong sense. The method is more powerful than the Bonferroni, Sidak, or Scheffe methods for pairwise comparisons Scheffé Method Scheffé Method is built within ANOVA framework: If ANOVA gives insignificant results, there will be no significant contrasts declared by the Scheffé Method. Let k

26 16 be the number of means to be compared, then two means, µ i and µ j, are considered to be significantly different if t ij (k 1)F (α; k 1, ν). Observe that the critical value depends on the number of means and not on the number of tests. Features: The method is appropriate for all possible comparisons. Thus, the method is appropriate for pairwise comparisons, general contrasts, or orthogonal contrasts. The method controls for FWER for all possible contrasts. The method is known to be conservative in case of pairwise comparisons. The power of the test increases when the number of comparisons is large comparing to the number of means. For pairwise comparisons, Šidák s method gives more power and thus is preferrable. 2.4 Stepwise Procedures Holm s procedure Holm s method is a step-down procedure based on the Bonferroni inequality. Let p (1), p (2),..., p (k) be the ordered p-values such that p (1) p (2)... p (k), corresponding to null hypotheses H 0(1), H 0(2),..., H 0(k). The decision rule to reject null hypotheses can be conveniently stated in terms of adjusted p-values: reject H 0(k) if

27 17 p (k) α. Here is how one would calculate the adjusted p-values for Holm procedure [2]: p (1) = kp (1) p (2) = max( p (1), (k 1)p (2) ). p (j) = max( p j 1, (k j + 1)p (j) ). p (k) = max( p k 1, p (k) ) Features: The test can be applied to any family of pairwise comparisons (no assumption required about model or distribution). The procedure provides a strong control for FWER. The test is conservative Shaffer method Shaffer s method is an improvement over Holm method. The method incorporates the logical constraints among hypotheses. Features:

28 18 The method controls FWER in a strong sense. The method is more powerful than Holm method (as stated in Westfall) 2.5 Choosing the right method No method is universal enough to be applied to all possible situations where multiple hypotheses testing is required. Since methods can generally be classified by the type and strength of inference they provide, the choice of method depends on the particular question under consideration. Other factors also play an important role, such as sample sizes, assumptions we can make about models and distributions, known or unknown logical constraints, and the dependence structure among test hypotheses. In addition, typically single-step procedures are based on or may lead to construction of simultaneous confidence intervals, while stepwise procedures are generally more powerful but in most cases don t produce simultaneous confidence intervals.

29 Chapter 3 Bootstrap-based resampling techniques On two occasions I have been asked, Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out? I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. Charles Babbage 3.1 Introduction The theory of statistical inference essentially resides on the construction of sampling distributons and computing an accuracy measure of a given statistic. Bias and standard error are typically used to assess the accuracy of an estimator. The standard error of the mean can be estimated analytically from its empirical distribution. 19

30 20 This is an example of a traditional way of dealing with a situation. However, the traditional approach may fail or have certain disadvantages. Theoretical formulas require assumptions about the model. If the postulated model is different from the true model or the model changes because of added or removed conditions, the results may be invalid and thus, theoretical formulas will have to be derived again for each new problem. Resampling is a method or rather a class of methods used to obtain samples from a given data set and thus estimate or approximate the accuracy and the sampling distribution of a statistic under consideration. Most popular methods are Bootstrap, Jackknife and Permutation methods. In this chapter we focus on the Bootstrap method since the proposed multiple testing procedures are based on a bootstrap resampling technique. The bootstrap method was introduced in 1979 by Efron as a computer-based resampling method to estimate the standard error of a parameter estimate. (For instance, the bootstrap method can be used to estimate the standard error of a sample mean.) Nowadays, the bootstrap can be applied to a variety of statistical procedures. Here are some usages of the bootstrap methods [3]: The bootstrap estimate of a standard error of a statistic from a single sample. The bootstrap estimate of bias. The bootstrap construction of confidence intervals.

31 21 The bootstrap applied to hypothesis testing problems. 3.2 Basic definitions Let X 1, X 2,..., X n be independent and identically distributed (iid) random variables from unknown distribution F. Let θ be a parameter of the distribution F and let ˆθ be a statistic that estimates the parameter of interest θ. The empirical distribution ˆF, also denoted by F n, is defined by F n (x) = 1 n n I{X i x} i=1 where I{X i x} is the indicator function 1, if X i x; I{X i x} = 0, otherwise. If x i is a realization of random variable X, then x = (x 1, x 2,..., x n ) is a random sample drawn from unknown distribution F. In the construction of empirical distribution ˆF, each x i has a probability of 1/n of occurring. We can also think of a parameter θ as a function of the probability distribution F and a statistic as a function of the sample x. Let θ = τ(f ) and ˆθ = t(x). Definition 7. Estimator ˆθ is called a plug-in estimate of a parameter θ = τ(f ) if ˆθ = τ( ˆF ).

32 22 For instance, while sample mean x = n 1 n i=1 x i is a plug-in estimate of the population mean µ, but sample variance s 2 = (n 1) 1 n i=1 (x i x) 2 is not a plug-in estimate of population variance σ 2. The bootstrap resampling methods can be classified as parametric or nonparametric. The definition below refers to nonparametric bootstrap. Nonparametric bootstrap methods are based on empirical distribution ˆF, while parametric bootstrap methods are based on the ˆFˆθ, an estimate of a parametric model F θ Nonparametric bootstrap Definition 8 (Bootstrap sample). A bootstrap sample x # = (x # 1,..., x # n ) is a random sample of size n where each x # i is obtained with probability 1/n by drawing with replacement from the original sample x = (x 1,..., x n ). For example, say, we have a random sample x of size n = 7 drawn from a distribution F, and x # is one of the possible bootstrap samples, then x = (x 1, x 2, x 3, x 4, x 5, x 6, x 7 ) x # = (x 3, x 7, x 1, x 4, x 3, x 1, x 6 ) is the actual data set is a bootstrap sample. And thus, x # 1 = x 3, x # 2 = x 7, x # 3 = x 1, x # 4 = x 4, x # 5 = x 3, x # 6 = x 1, x # 7 = x 6. A statistic ˆθ # obtained from a bootstrap sample is called a bootstrap replication of

33 23 ˆθ. For instance, if ˆθ = x = n i=1 x i/n is a sample mean, then ˆθ # = x # = n i=1 x# i /n is a bootstrap replication of a sample mean Parametric bootstrap Again, let x = (x 1, x 2,..., x n ) be n independent realizations of random variable X F. If θ is a parameter or a vector of parameters of distribution F, then ˆFˆθ is a parametric estimate of the probability distribution F. A bootstrap sample x # = (x # 1, x # 2,..., x # n ) of size n is drawn from a parametric estimate ˆFˆθ of the true unknown distribution F θ. 3.3 Bootstrap estimate of standard error As defined previously, θ is a parameter of interest from a population described with unknown distribution F. Draw a random sample x from F and calculate an estimate of θ. x = (x 1, x 2,..., x n ) ˆθ. We want to assess the accuracy of ˆθ. Here we outline the algorithm that uses bootstrap resampling method to estimate the standard error of the ˆθ. Step 1 Draw B independent bootstrap samples of same size n. Bootstrap technique can be parametric or nonparametric. When parametric bootstrap is used, replace

34 24 empirical distibution ˆF with the parametric estimate of the population, ˆFˆθ. ˆF x #1 = (x #1 1, x #1 2,..., x #1 n ) ˆF x #2 = (x #2 1, x #2 2,..., x #2 n ) ˆF x #B = (x #B 1, x #B 2,..., x #B n ). Step 2 Obtain a bootstrap replication of ˆθ from each bootstrap sample. x #1 = (x #1 1, x #1 2,..., x #1 n ) ˆθ #1 = t(x #1 ) x #2 = (x #2 1, x #2 2,..., x #2 n ) ˆθ #2 = t(x #2 ). x #B = (x #B 1, x #B 2,..., x #B n ) ˆθ #B = t(x #B ). Step 3 Calculate the sample standard deviation of the B bootstrap replications. This sample standard deviation is the bootstrap estimate of standard error of ˆθ. ŝe B = B (ˆθ #b ˆθ # ( )) 2 B 1 b=1 where ˆθ # ( ) = B b=1 ˆθ #b /B. Number of bootstrap replications needed As stated in [3], 1. Even a small number of bootstrap replications (B = 25) is usually informative.

35 25 B = 50 is often enough to give a good estimate of se F (ˆθ). 2. Very seldom are more than B = 200 replications needed for estimating a standard error. Much bigger values of B are required for bootstrap confidence intervals. 3.4 Bootstrap estimate of bias Let ˆθ be an estimator of the parameter θ. The bias of ˆθ, denoted by bias(ˆθ), is defined as the difference of the expected value of ˆθ and the parameter θ being estimated, bias F (ˆθ) = E F (ˆθ) θ. The estimator ˆθ is called unbiased estimator of θ if E F (ˆθ) = θ. Unbiasness is a desirable property of an estimator. Using bootstrap samples, we ll obtain the bootstrap estimate of bias defined as follows: bias ˆF = E ˆF [t(x # )] τ( ˆF ) Having utilized the algorithm for approximating a sample error of ˆθ, generate B bootstrap samples x #1,..., x #B and compute bootstrap replications ˆθ #b = t(x #b ), b = 1,..., B. Then, bias B = ˆθ # ( ) τ( ˆF )

36 26 where ˆθ # ( ) = B b=1 ˆθ #b /B. Also observe that the formula for bootstrap estimate of a bias uses the plug-in estimate τ( ˆF ). 3.5 Bootstrap confidence intervals The bootstrap-t interval Let θ be a parameter of interest and ˆθ = τ( ˆF ) be a plug-in estimate of θ. In addition to point estimate ˆθ, we may also be interested in constructing an interval to estimate θ with a desired degree of confidence. If α is a real number between 0 and 1, typically taking small values such as 0.01, 0.05, or 0.10, a (1 α) 100% confidence interval can be derived as follows: [ˆθ q (1 α/2) ŝe, ˆθ q (α/2) ŝe] where ŝe can be either a bootstrap estimate or any other reasonable estimate of standard error of ˆθ. And q (α/2) and q (1 α/2) are 100 (α/2) and 100 (1 α/2) percentiles, respectively, of the distribution of random variable Z = (ˆθ θ)/ŝe. Note that random variable Z used here does not necessarily indicate a standard normal distribution. Whenever the normality holds (at least in asymptotic sense), q (α/2) and q (1 α/2) values can be replaced by the standard scores from the standard normal table. For instance, q = z = and q = z = And thus, 95% con-

37 27 difence interval for θ will be constructed as [ˆθ ŝe, ˆθ ŝe]. When Z cannot be assumed to be standard normal or a t-distribution, the bootstrap can be used to obtain an accurate interval. Here is the procedure: Step 1 Generate B bootstrap samples x #1, x #2,..., x #B. Typically, B = 1000 is required for quantile estimation. Step 2 For each bootstrap sample b, compute ˆθ #b = t(x #b ) and the estimated standard error of ˆθ #b denoted by ŝe #b Z #b = ˆθ #b ˆθ ŝe #b Note that when ˆθ is not a sample mean but a more complicated statistics, bootstrap rasampling may be used to estimate ŝe #b for each bootstrap sample b. This results in a nested bootstrap sampling. Step 3 Let Q denote a cumulative distribution of Z #b values. Then, the α/2 quantile of Z #b is estimated by the value ˆt α/2 such that ˆt α/2 = inf{z #b : Q(Z #b ) α/2} Step 4 Construct the bootstrap-t (1 α) 100% confidence intervals: (ˆθ ˆt 1 α/2 ŝe, ˆθ ˆt α/2 ŝe)

38 28 Two main disadvantages of this algorithm are (1) costly computing as the result of two nested levels of bootstrap samples and (2) erratical results in the case of a small sample, nonparametric setting.[3] 3.6 Bootstrap-based hypothesis testing The bootstrap-based hypothesis test proposed by Efron is similar to permutation test introduced by R.A. Fisher in 1930.[3] The procedure is designed to test the null hypothesis that the two probability distributions from which samples are drawn are identical. Let X F and Y G and observe independent random samples, sizes n and m respectively, x = {x 1, x 2,..., x n } and y = {y 1, y 2,..., y m }. H 0 : F = G Here is a bootstrap algorithm: Step 1 Calculate a test statistic (in this case, the difference of means): T (x, y) = x y, where x = n i=1 x i/n and y = m i=1 y i/m.

39 29 Step 2 Form a new sample w of size n + m by combining samples x and y w = (x 1, x 2,..., x n, y 1, y 2,..., y m ). Step 3 Generate B bootstrap samples from w. In each bootstrap sample, let the first n observations form bootstrap sample x # and the remaining m observations form bootstrap sample y #. w #b = (w #b 1, w #b 2,..., w n #b #b, w }{{} n+1, w n+2, #b..., w #b n+m) for all b = 1,..., B. }{{} x # y # Step 4 For each bootstrap sample b, evaluate a bootstrap replication of test statistic (in this case, a difference in means): T (w #b ) = x #b y #b, where x #b = 1 n n i=1 w#b i and y #b = 1 m (n+m) i=n+1 w#b i. Step 5 Approximate bootstrap P -value of the test by ˆP boot -value = 1 B B I(T (w #b ) T (x, y)), b=1 where I( ) is an indicator function. Step 6 Decision Rule: Reject H 0 if ˆP boot -value α, where α is some prespecified level of significance.

40 30 Statement Interpretation P -value <.10 borderline evidence against H 0 P -value <.05 reasonably strong evidence against H 0 P -value <.025 strong evidence against H 0 P -value <.01 very strong evidence against H 0 Table 3.1: Evidence against H 0. Note that studentized statistics could also be used: T (x, y) = x y σ 1/n + 1/m, where σ = [ n i=1 (x i x) 2 + m i=1 (y i y) 2 ]/[n + m 2]. Generally, one can adopt the convention given in table 3.1 to interpret the P -value of the test. 3.7 Robustness and failure of bootstrap Definition 9. A procedure is said to be robust if it is not heavily affected by the violations of assumptions made about the model. In other words, robustness signifies insensitivity to small deviations from assumptions [5].

41 31 Nonparametric bootstrap methods and jackknife methods which will be described further are considered robust since to perform well, they do not require theoretical assumptions about the model. Robust methods offer a remedy when theoretical assumptions about the model are violated; at the same time they do not claim to be efficient when all assumptions are present in the model. For instance, the equality of error variances is one of the classical assumptions of a linear model Y i = β 0 + β 1 X 1i β k X ki + ɛ i. That is, the variance V ar(ɛ i ) = σ 2 is a constant i = 1,..., n. In the event the model assumptions are correct, the bootstrap method will not be the most efficient. However, we can estimate the regression model coefficients by bootstraping if the equality of variances assumption fails. While bootstrap methods are known to be robust and efficient, we should still have consideration for the cases when the bootstrap approach does fail. The major cases of bootstrap failer include: small sample size, distributions with infinite moments, and estimation of extreme values [7]. 3.8 Jackknife as an approximation to Bootstrap Jackknife samples and estimates Jackknife is another popular resampling method used for estimating the bias and standard error of an estimate. Jackknife samples are obtained by removing one obser-

42 32 vation at a time. This technique was introduced by Quenouille in 1949 and precedes bootstrap introduced by Efron in Definition 10. Let x = (x 1, x 2,..., x n ) be a random sample of size n. The jackknife i-th sample is a sample with the i-th observation left out of the original sample: x (i) = (x 1, x 2,..., x i 1, x i+1,..., x n ) where i = 1, 2,..., n. Let θ be a parameter of interest and ˆθ its estimator. If ˆθ (i) = t(x (i) ) is a jackknife replication of ˆθ, the jackknife estimate of bias is defined as bias jack = (n 1)(ˆθ ( ) ˆθ) where ˆθ ( ) = n i=1 ˆθ (i). The jackknife estimate of standard error is defined by n ŝe jack = [ n 1 n n (ˆθ (i) ˆθ ( ) ) 2 ] 1/2 i= Jackknife or Bootstrap? Having introduced the two resampling techniques, we would like to outline further the advantages and appropriateness of usage of one technique over another. Let us introduce a few more definitions.

43 33 Definition 11. A statistic, ˆθ, is said to be a linear statistic if it can be written in the form ˆθ = t(x) = µ + 1 n n α(x i ), i=1 where µ is a constant and α( ) is a function of data. Definition 12. A statistics, ˆθ, is said to be a quadratic statistic if it can be written in the form ˆθ = t(x) = µ + 1 n α(x i ) + 1 β(x n 2 i, x j ) 1 i n 1 i j n where µ is a constant and α and β are functions of data. The examples of linear and nonlinear statistics are the mean and the correlation coefficient, respectively. If ˆθ is a linear statistic, the jackknife and bootstrap estimate of standard errors agree, except for a factor of {(n 1)/n} 1/2 used by jackknife. If ˆθ is a nonlinear statistic, the jackknife makes a linear approximation to the bootstrap. This means, if there is a certain linear statistic that approximates ˆθ, then the jackknife will agree with the bootstrap (again, there is a difference of a factor present). Generally speaking, the accuracy of jackknife estimate of standard error of ˆθ depends on the degree of linearity of ˆθ. If ˆθ is highly nonlinear, the jackknife can be very inefficient [3]. If ˆθ is a quadratic statistic, the jackknife and bootstrap estimates of bias essentially agree for quadratic statistics.

44 34 In the cases where jackknife provides a good approximation to bootstrap, there is an advantage of using jackknife since it s easier to compute. However if the statistic under consideration is not a smooth (differentiable) function of x, the jackknife estimate of standard error is inconsistent, that is the estimator does not converge to the true standard error of the ˆθ.

45 Chapter 4 Proposed bootstrap multiple testing procedures Probability is expectation founded upon partial knowledge. A perfect acquaintance with all the circumstances affecting the occurrence of an event would change expectation into certainty, and leave neither room nor demand for a theory of probabilities. George Boole 4.1 Microarrays as a Motivating Factor Background The objective of this subsection is to provide the reader with the most basic terminology used throughout this section. In no way this is considered a crash course 35

46 36 in genetics. Much of what will be said in the upcoming example has to do with the expression of genes. Therefore, some background will certainly be helpful. A gene is a segment or region of DNA that encodes instructions, which allow a cell to produce a specific product. This product is typically a protein, such as an enzyme. Proteins are used to support the cell structure, break down chemicals, build new chemicals, transport items, and regulate production. Every human being has about 40, 000 putative genes that produce proteins. Many of these genes are always identical from one person to another, but others show variation in different people. The genes determine hair color, eye color, sex, personality, and many other traits that in combination make everyone a unique entity [10]. Every cell of an individual organism will contain the same DNA, carrying the same information. However, a liver cell will be obviously different from a muscle cell for example. The differentiation occurs because not all the genes are expressed in the same way in all cells. The differentiation between cells is given by different patterns of gene activations which in turn control the production of proteins. [10] A gene is active, or expressed, if the cell produces the protein encoded by the gene. If a lot of protein is produced, the gene is said to be highly expressed. If no protein is produced, the gene is not expressed or unexpressed. The objective of researchers is to detect and quantify gene expression levels under particular circumstances. One can

47 37 compare various tissues with each other, or a tumor tissue with the healthy tissue. Gene expression can be used to understand the phenomena related to aging or fetal development. While there have been methods available to look at the expression levels of genes, the problem with those methods was that only a few genes could be analysed at a time [10]. Mircoarrays on the other hand is powerful technology that allows simultaneous measurement of expression levels for up to tens of thousands of genes. The fact that microarrays can interrogate thousands of genes at the same time leads to wide adoption of this technology, but it also creates a number of challeges associated with its use. The classical techniques (such as chi-square test) that were designed to test whether there is a significant difference between the groups considered cannot be applied directly because in microarray expreriments the number of variables (usually thousands of genes) is much greater than the number of experiments (say, tens of experiments) Microarray Experiment Let us consider an experiment comparing the gene expression levels in two different conditions such as healthy tissue vs. tumor. Suppose in our experiment we like to compare 20 genes by comparing 5 tumor samples and 5 healthy tissue samples. The data have been pre-processed and normalized and is presented in table 4.1. The last step in the normalization is division by the global maximum. Thus, all values are between zero and one. The maximum value was an internal control so the value one

48 38 Tumor Control Gene T1 T2 T3 T4 T5 C1 C2 C3 C4 C5 g g g g g g g g g g g g g g g g g g g g Table 4.1: Expression data from two groups of subjects: cancer patients and healthy controls. The data are already normalized [10]. does not actually appear in the data. The task here could be to find those genes that are differentially regulated between cancer and healthy subject.

49 Multiple Hypothesis Testing Model Let P be a data generating distribution and M be a statistical model (parametric or non-parametric). Let X be a random g-vector such that X P M and X = (X(j) : j = 1,..., g). Thus, X 1,..., X n are n iid random variables each of which is a g-vector. In the light of DNA microarray data presented in table 4.1, for a patient i (i = 1,..., n), let x i = (x i (1), x i (2),..., x i (g)) be a realization of random variable X i. Then the data frame might be presented as the one given in table 4.2. With the data set such as the one given in table 4.1 (or the more general setting in table 4.2), a researcher might be interested in comparing the mean expression level of genes from a tumor tissue and healthy (control tissue). This example helps outline the challenges that arise in problems of statistical inference in genomic data analysis: (i) high-dimensional multivariate distributions, with typically unknown and intricate correlation patterns among variables; (ii) large parameter spaces; (iii) a number of variables (hypotheses) that is much larger than the sample size; and (iv) some nonnegligible proportion of false null hypotheses, i.e., true positives [1].

50 40 Tumor (treatment) Group Control Group X k k n X(1) x 1 (1) x 2 (1)... x k (1) x k+1 (1)... x n (1) X(2) x 1 (2) x 2 (2)... x k (2) x k+1 (2)... x n (2) X(3) x 1 (3) x 2 (3)... x k (3) x k+1 (3)... x n (3) X(4) x 1 (4) x 2 (4)... x k (4) x k+1 (4)... x n (4) X(5) x 1 (5) x 2 (5)... x k (5) x k+1 (5)... x n (5) X(6) x 1 (6) x 2 (6)... x k (6) x k+1 (6)... x n (6) X(7) x 1 (7) x 2 (7)... x k (7) x k+1 (7)... x n (7) X(8) x 1 (8) x 2 (8)... x k (8) x k+1 (8)... x n (8) X(9) x 1 (9) x 2 (9)... x k (9) x k+1 (9)... x n (9) X(10) x 1 (10) x 2 (10)... x k (10) x k+1 (10)... x n (10) X(11) x 1 (11) x 2 (11)... x k (11) x k+1 (11)... x n (11) X(12) x 1 (12) x 2 (12)... x k (12) x k+1 (12)... x n (12) X(13) x 1 (13) x 2 (13)... x k (13) x k+1 (13)... x n (13) X(14) x 1 (14) x 2 (14)... x k (14) x k+1 (14)... x n (14) X(15) x 1 (15) x 2 (15)... x k (15) x k+1 (15)... x n (15) X(16) x 1 (16) x 2 (16)... x k (16) x k+1 (16)... x n (16) X(17) x 1 (17) x 2 (17)... x k (17) x k+1 (17)... x n (17) X(g) x 1 (g) x 2 (g)... x k (g) x k+1 (g)... x n (g) Table 4.2: n realizations of a random g-vector.

51 Null hypotheses General definition Let m be the number of null hypotheses and let {M j } m j=1 be the collection of submodels, that is M j M for each j = 1,..., m. Define m null hypotheses and corresponding alternative hypotheses as follows: H 0j I(P M j ) H 1j I(P / M j ) Here I is the indicator function. Special case Typically, corresponding null and alternative hypotheses are defined in terms of single parameters which are the functions of the data generating distributions. Consider an m-vector of parameters µ = (µ(j) : j = 1,..., m) where each µ(j) = µ j (P ) R is a function of the unknown data generating distribution. Let µ 0 (j) be hypothesized null-values. There are two types of testing problems: One-sided tests: H 0j = I(µ(j) µ 0 (j)) H 1j = I(µ(j) > µ 0 (j)), j = 1,..., m.

52 42 Two-sided tests: H 0j = I(µ(j) = µ 0 (j)) H 1j = I(µ(j) µ 0 (j)), j = 1,..., m. Parameters of interest Parameters of interest can be classified as follows: Location parameters (means, difference in means, medians) Scale parameters (standard deviation, covariances and correlations) Regression parameters (slopes, main effects, interactions, parameters for Cox proportional hazard model) Parameters that refer to time-series models, or dose-response models, etc Hypothesis testing Notations and notions Here we introduce some important conventions, notions and notations used in connection with hypothesis testing. Fact 1 In hypothesis testing, each hypothesis is either true or false depending on the true (but uknown) data generating distribution P. Fact 2 Testing procedure results in either rejecting a null hypothesis or failing to do so.

53 43 Type I error Type II error false positive false negative Rejecting a true null hypothesis Failing to reject a false null hypothesis V n is the number of Type I errors U n is the number of Type II errors V n = S n S 0 U n = Sn c S0 c Table 4.3: Type I and Type II errors Recall that m represent the number of null hypotheses being tested. Now we introduce the following notation: True null hypotheses Let S 0 denote the set of true null hypotheses. Then, S 0 = S 0 (P ) = {j : H 0j is true} and m 0 = S 0 is the number of true null hypotheses. False null hypotheses Let S c 0 denote the set of false null hypotheses. Then, S c 0 = S 0 (P ) = {j : H 0j is false} and m 1 = S c 0 is the number of false null hypotheses. Note that m 0 + m 1 = m. Rejected null hypotheses Let S n denote the set of rejected null hypotheses. Then, R n = S n is the number of rejected hypotheses. Multiple testing procedure and types of errors that can be committed The outcome of the multiple testing procedure is the set of rejected null hypotheses, S n. Since S n only estimates S0, c two types of errors can be committed as outlined in table 4.3: rejecting a true null hypothesis and failing to reject a false null hypothesis.

54 44 Type I error rates The set of rejected hypotheses, S n = S(T n, Q 0, α), is the function of 1. Test statistics T n ( where T n = (T n (j) : j = 1,..., m) are the functions of the data X 1,..., X n ) 2. Test statistics null distribution Q 0 (which is used to derive cut-offs) 3. The desired upper bound for Type I error rate (nominal level α) Definition 13. Let F Vn be discrete cumulative distribution function on {0, 1,..., m} for the number of Type I errors, V n. Type I error rate is defined as the parameter θ of the distribution of Type I errors, θ(f Vn ). Let us formally introduce Type I error rates commonly-used in multiple hypothesis testing: Definition 14. Per-comparison error rate is the expected proportion of Type I errors among the m tests P CER E(V n )/m = vdf Vn (v)/m. Definition 15. Per-family error rate is expected number of Type I errors P F ER E(V n ) = vdf Vn (v). Definition 16. Median-based per-family error rate is the median number of Type I errors mp F ER Median(F Vn ) = F 1 V n (1/2).

55 45 Definition 17. Family-wise error rate is the probability of at least one Type I error F W ER P r(v n 1) = 1 F Vn (0). Definition 18. Generalized family-wise error rate is the probability of at least (k+1) Type I errors, k = 0,..., m 0 1 F W ER P r(v n k + 1) = 1 F Vn (k). Assumptions for the parameter θ Given parameter θ such that θ : F θ(f ), we make the following assumptions: Monotonicity. Given two c.d.f. s F 1 and F 2 on {0,..., m}, F 1 F 2 = θ(f 1 ) θ(f 2 ) Uniform continuity. Given two c.d.f. s F 1 and F 2 on {0,..., m}, define the distance measure d by d(f 1, F 2 ) = max x {0,...,m} F 1 (x) F 2 (x). For two sequences of c.d.f. s, {F n } and {G n }, if d(f n, G n ) 0, as n, then θ(f n ) θ(g n ) 0.

56 46 Type I error rate control Definition 19. We say that a multiple testing procedure S n = S(T n, Q 0, α) provides finite sample control of the Type I error rate θ(f Vn ) at level α (0, 1), if θ(f Vn ) α Definition 20. A multiple testing procedure S n = S(T n, Q 0, α) provides asymptotic control of the Type I error rate θ(f Vn ) at level α (0, 1), if lim sup θ(f Vn ) α n Approach to Type I error rate control In multiple testing procedure S n = S(T n, Q 0, α), we use the assumed null distribution Q 0 of the test statistics to derive the cut-offs for the rejection regions. Note that the unknown true distribution, denoted by Q n = Q n (P ), of the test statistics T n defines the number of the false positives V n. The choice of Q 0 is thus crucial since we want to make sure the multiple test procedure provides the required control of Type I error rate under Q n. Let R(Q 0 Q) = R(S(T n, Q 0, α) Q) S(T n, Q 0, α), V (Q 0 Q) = R(S(T n, Q 0, α) Q) S(T n, Q 0, α) S 0 (P ).

57 47 Then R n R(S(T n, Q 0, α) Q n ), R 0 R(S(T n, Q 0, α) Q 0 ), V n V (S(T n, Q 0, α) Q n ), V 0 V (S(T n, Q 0, α) Q 0 ). Control of Type I error rates (θ(f Vn )) is achieved by three-step approach: 1. Null domination conditions for Type I error rate. A null distribution should be selected so that θ(f Vn ) θ(f V0 ) lim sup n θ(f Vn ) θ(f V0 ) [finite sample control] [asymptotic control]. 2. Note that V 0 R 0 and F V0 F R0. Thus, by Monotonicity Assumption, θ(f V0 ) θ(f R0 ). 3. Control the parameter θ(f R0 ), corresponding to the observed number of rejected hypotheses R 0, under the null distribution Q 0, i.e., assuming T n Q 0, θ(f R0 ) α. Steps 1, 2, and 3 lead to the control of Type I error rate as follows: θ(f Vn ) θ(f V0 ) θ(f R0 ) α lim sup n θ(f Vn ) θ(f V0 ) θ(f R0 ) α finite sample control asymptotic control

58 Multiple Testing Procedures Single-step common-quantile procedure Procedure 1. Single-step common-quantile procedure for control of general Type I error rates θ(f Vn ) Given an m-variate null distribution Q 0 and δ [0, 1], define an m-vector, d(q 0, δ) = (d j (Q 0, δ) : j = 1,..., m), of δ-quantiles, d j (Q 0, δ) Q 1 0j (δ) = inf{z : Q 0j(z) δ}, j = 1,..., m, where the Q 0j denote the marginal cumulative distribution functions corresponding to Q 0. For a test of level α (0, 1), choose δ as δ 0 (α) inf{δ : θ(f R(d(Q0,δ) Q 0 )) α}, where R(d(Q 0, δ) Q 0 ) denotes the number of rejected hypotheses for common-quantile cut-offs d(q 0, δ), under the null distribution Q 0 for the test statistics T n. The single-step common-quantile multiple testing procedure for controlling the Type I error rate θ(f Vn ) at level α is defined in terms of the common-quantile cut-offs,

59 49 c(q 0, α) d(q 0, δ 0 (α)), by the following rule. Reject H 0j if T n (j) > d j (Q 0, δ 0 (α)), j = 1,..., m, that is, S(T n, Q 0, α) {j : T n (j) > d j (Q 0, δ 0 (α))}. Here, F Vn denotes the c.d.f. for the number of Type I errors, V n V (d(q 0, δ 0 (α)) Q n ), under the true distribution Q n = Q n (P ) for the test statistics T n. Theorem 1. [Asymptotic control of Type I error rate for single-step common-quantile Procedure 1] Assume that there exists a random m-vector Z Q 0 = Q 0 (P ), so that, for all c = (c j : j = 1,..., m) R m and x {0,..., m}, the joint distribution Q n = Q n (P ) of the test statistics T n satisfies the following asymptotic null domination property with respect to Q 0 lim inf n P r Q n ( j S 0 I(T n > c j ) x ) P r Q0 ( j S 0 I(Z(j) > c j ) x ) AQ0 In other words, the number of Type I errors, V n, under the true distribution Q n = Q n (P ) for the test statistics T n, is stochastically smaller in the limit than the corresponding number of Type I errors, V 0, under the null distribution Q 0 : lim inf n F Vn (x) F V0 (x) x. In addition, suppose that the mapping θ( ) defining the Type I error rate is such that monotonicity and uniform continuity assumptions hold. Then, single-step Procedure 1, with common-quantile cut-offs c(q 0, α) = d(q 0, δ 0 (α)), provides asymptotic control of the Type I error rate θ(f Vn ) at level α.

60 50 That is, lim sup θ(f Vn ) α, n where V n denotes the number of Type I errors for T n Q n (P ) (So, V n V (c(q 0, α) Q n ) = j S 0 I(T n (j) > c j (Q 0, α)). ) Single-step common-cut-off procedure Procedure 2. Single-step common-cut-off procedure for control of general type I error rates θ(f Vn ). Given an m-variate null distribution Q 0 and for a test of level α (0, 1), define a common cut-off e(q 0, α), such that e(q 0, α) inf{c : θ(f R(c,...,c) Q0 ) α}, where we recall that R((c,..., c) Q 0 ) denotes the number of rejected hypotheses for common cut-off c, under the null distribution Q 0 for the test statistics T n. The single-step common-cut-off multiple testing procedure for controlling the Type I error rate θ(f Vn ) at level α is defined in terms of the common cut-offs c(q 0, α) = (e(q 0, α),..., e(q 0, α)), by the following rule: Reject H 0j if T n (j) > e(q 0, α), j = 1,..., m.

61 51 That is, S(T n, Q 0,α ) {j : T n (j) > e(q 0, α)}. Here, F Vn denotes the c.d.f. for the number of Type I errors, V n V ((e(q 0, α),..., e(q 0, α)) Q n ), under the true distribution Q n = Q n (P ) for the test statistics T n Proposed test statistics null distribution Theorem 2. [General construction for null distribution Q 0 ] Suppose there exists known m-vectors λ 0 R m and τ 0 R +m of null-values, so that lim sup n E [T n (j)] λ 0 (j) and lim sup n V ar [T n (j)] τ 0 (j), for j S 0. Let ν 0n (j) min ( ) τ 0 (j) 1, V ar [T n (j)] and define an m-vector Z n by Z n (j) ν 0n (j) (T n (j) + λ 0 E [T n (j)]), j = 1,..., m.

62 52 Suppose that Z n L Z Q 0 (P ). Then, for this choice of null distribution Q 0 = Q 0 (P ), and for all c = (c j : j = 1,..., m) R m and x {0,..., m}, lim inf n P r Q n ( j S 0 I(T n (j) > c j ) x ) P r Q0 ( j S 0 I(Z(j) > c j ) x so that asymptotic null domination Assumption AQ0 in Theorem 1 holds. ) 4.4 Bootstrap-based single step procedures Bootstrap estimation of the null distribution The null distribution Q 0 can be estimated by the distribution of the null-value shifted and scaled bootstrap statistics: Z n # τ 0 (j) (j) min 1, ] V ar P n [T ( [ T # n # n (j) + λ 0 (j) E P n T # n (j) ]) (j) where P n is an estimator of the true data generating distribution P. Procedure 3. Bootstrap estimation of null distribution Q 0 1. Generate B bootstrap samples, ( X b 1,..., X b n), b = 1,..., B. For the bth sample, the X b i, i = 1,..., n are n i.i.d. realizations of a random variable X # P n.

63 53 2. For each bootstrap sample, compute an m-vector of test statistics, T b n = ( T b n(j) : j = 1,..., m ). This can be arranged in an m B matrix, T = ( T b n (j) ), with rows corresponding to the m hypotheses and columns to the B bootstrap samples. 3. Compute row means and variances of the matrix T to yield estimates of E [T n (j)] and V ar [T n (j)], j = 1,..., m. 4. Obtain an m B matrix Z = ( Zn() ) b of null-value shifted and scaled bootstrap statistics Zn(j), b as in Theorem 2, by row shifting and scaling the matrix T using the bootstrap estimates of E [T n (j)] and V ar [T n (j)] and the user-supplied nullvalues λ 0 (j) and τ 0 (j). 5. The bootstrap estimate Q 0n of the null distribution Q 0 from Theorem 2 is the empirical distribution of the columns Z b n of matrix Z. Procedure 4. Bootstrap estimation of common quantiles for Procedure 1 for gfwer control. 1. Apply Procedure 3 to generate an m B matrix Z = ( Zn) b of null-values shifted and scaled bootstrap statistics Zn(j). b The bootstrap estimate Q 0n of the null distribution Q 0 from Theorem 2 is the empirical distribution of the columns Zn b of matrix Z. 2. For Procedure 1, the bootstrap common-quantile cut-offs are simply the row quantiles of the matrix Z. That is d j (Q 0n, δ) is the δ-quantile of the B-vector

64 54 ( Z b n (j) : b = 1,..., B ) of bootstrap statistics for H 0j. d j (Q 0n, δ) inf { z : 1 B } B I(Zn(j) b z) δ b=1 3. For a test with nominal level α (0, 1), δ is chosen as δ 0n (α) inf { δ : θ(f R(d(Q0n,δ) Q 0n )) }. That is, δ 0n (α) corresponds to the smallest cut-offs d(q 0n, δ) such that the value of the mapping θ( ), applied to the distribution of the number of rejections R(d(Q 0n, δ) Q 0n ), under the bootstrap distribution Q 0n, is at most α. In the case of gfwer control, and for a (limit) null distribution Q 0 with continuous and strictly monotone marginal distributions, (1 δ 0n (α)) is the α-quantile of the bootstrap estimate of the distribution of the (k + 1)st ordered unadjusted p-value. Specifically, δ 0n (α) is obtained as follows. (a) Compute an m B matrix, P = ( P b n(j) ), of bootstrap unadjusted p-values, by row-ranking the matrix Z, i.e., by replacing each Z b n(j) by its rank over the B bootstrap samples, where 1 corresponds to the largest value of Z b n and B the smallest. (b) For each column of the matrix P, compute the (k + 1)st smallest p-value, P b n (k + 1). For FWER control (k = 0), simply compute column minima. (c) The estimate (1 δ 0n (α)) is the α-quantile of the B-vector (P b n (k + 1) : b = 1,..., B).

65 Advantages of the Proposed Procedures In this section we summarize main features and advantages of the proposed procedures. The key feature of the proposed procedures is the test statistics null distribution (T Q 0 ) that is used to derive cut-offs and the adjusted p-values. The existing multiple testing procedures, on the other hand, use data generating null distribution (X P 0 ) [1]. Single-step common-quantile and common-cutoff procedures control for the Type I error rate for arbitrary data generating distribution under the asymptotic domination condition for a null distribution. Therefore, there is no need for the subset pivotality condition [1]. The control of the Type I error rate is done under the true data generating distribution P, i.e., under the joint distribution Q n = Q n (P ) of the test statistics T n implied by P. Therefore, the notion of weak and strong control are irrelevant to the proposed procedures [1]. The proposed procedures based on the construction of the test statistics null distribution provide the desired asymptotic Type I error rate control in general testing problems, whereas the currently existing procedures can only be applied to a limited set of multiple testing problems.

66 Chapter 5 Simulation Studies Anyone who attempts to generate random numbers by deterministic means is, of course, living in a state of sin. John von Neumann 5.1 Formulation of the Problem The preceding chapter, Chapter 4 Proposed bootstrap multiple testing procedures, outlines the multiple testing procedures for simultaneous testing of parameters (such as mean) of an arbitrary data generating distribution. In particular, in this paper we focus on single-step common-quantiles multiple testing procedure about the vector of mean values µ = (µ(j) : j = 1,..., m). 56

67 57 For example, a collection of right-sided tests is stated as follows: H 0j = I(µ(j) µ 0 (j)) H 1j = I(µ(j) > µ 0 (j)), j = 1,..., m. The procedure is claimed to asymptotically control the Type I error rate. For rigorous mathematical proof of this theoretical results, please refer to [1]. The data generating distribution can be chosen to be arbitrary so that no particular data model assumptions must be made in advance. Also, the subset pivotality condition is not required as was discussed previously (see chapter 2 and chapter 4). The procedure is based on the consistent estimator of the null distribution of the test statistics generated with the bootstrap algorithm. 5.2 Objectives of Experiments Using the multiple hypothesis testing procedures (MTP) implemented based on the theoretical results outlined in this paper, we d like to perform a series of experiments. All experiments are run with known theoretical probability models where the classical (theoretical) approach can be used to obtain the results. The experiments are performed in both univariate and multivariate settings where the procedures are used to test a family of hypotheses about population means. Tests about other parameters (such as median or parameters of a linear regression model) will not be demonstrated here due to the time constraints.

68 58 The objectives of experiments are to show that the experimental results obtained from implemented procedures coincide with the known theoretical results the MTP work with various distributions (discrete or continuous) the null distribution is (asymptotically) normal the MTP provides (asymptotic) control for type I error rate (FWER α). 5.3 Tests about the Mean Normal Distribution Models Normal Univariate Case Let X be a random variable from N(0, 1), the univariate standard normal distribution with µ = 0 and σ 2 = 1. To simulate this model in our experiment we draw n = 400 independent realizations of X. The empirical distribution of X based on a sample of n = 400 observations is presented by a histogram in figure 5.1. We like to perform the right-sided test of hypothesis about the population mean: H 0 : µ = 0 H 1 : µ > 0

69 59 Normal Distribution Theoretical Model Simulated Model Number of observations N/A 400 Mean Standard Deviation Table 5.1: Summary of postulated and simulated models. Null Distribution Theoretical Bootstrap Estimated Number of observations N/A 1000 Mean Standard Deviation α Critical value Z Test statistic N/A 0.83 Table 5.2: Summary of test statistic null distribution, theoretical and bootstrap estimated. Under the null hypothesis T = X µ 0 (σ/ n) is normaly distributed, T N(0, 1), and we can refer to a table for a critical value for each given level of significance α. For instance, for a level of significance α = 0.05 the table will provide In order to arrive at the conclusion for the test of hypothesis in our experiment, we don t rely on table values or any prior knowledge about the model, but rather use the proposed procedure to construct bootstrap estimate of the null distribution of the test statistic. To compare the theoretical results with the experimental results, please refer to a brief summary of parameters and statistics in tables 5.1 and 5.2. In particular, note that the experiment-based critical value obtained from the estimated null distribution is very close to a table value: 1.63 versus

70 60 Figure 5.1: Empirical Normal distribution. Figure 5.2: Estimated bootstrap distibution.

71 61 The experiment was performed with B = 1000 bootstrap samples drawn from n = 400 observations. As the result of experiment, we can also see that in the case of univariate normal distribution model the procedure controls for FWER. (In this reduced case, since the family of hypotheses consists of only one hypothesis, FWER, the probability of at lease one false positive, is just the probability of type I error). According to our experiment, the FWER is estimated as which is less or equal to nominal level α = 0.05, F W ER α. The bootstrap estimated null distriution is shown on figure 5.2. One of the objectives of the experiment is to show that this constructed null distribution of the test statistic is normal. H 0 : bootstrap estimated distribution is normal H 1 : bootstrap estimated distribution is not normal The goodness-of-fit tests presented in table 5.3, according to the provided p-value, all agree in the conclusion: At the 5% level of significance, fail to reject the assumption that bootstrap estimated null distribution is normal. Normal Multivariate Case Let X = (X(j) : j = 1, 2,..., 20) be a random vector, where a random variable X(j) is from N(0, 1), j = 1, 2,..., 20. To simulate this model in our experiment we draw n = 400 independent realizations of X. We like to perform 20 right-sided tests

72 62 Test Statistic DF p Value Kolmogorov-Smirnov D = > Cramer-von Mises W-Sq = > Anderson-Darling A-Sq = > Chi-Square Chi-Sq = > Table 5.3: Summary of test statistic null distribution, theoretical and bootstrap estimated. of hypothesis about the population mean vector, µ = (µ(j) : j = 1,..., 20): H 0j : µ(j) = 0 H 1j : µ(j) > 0 Under the null hypothesis the vector of test statistics T = (T (j) : j = 1, 2,..., 20) is multivariate normal, N(µ, Σ), where T (j) N(0, 1), µ = 0 is a mean vector µ = (0, 0,..., 0), and Σ = I is a covariance matrix such that Σ =

73 63 If all X(j) s are independent (the condition implied by the covariance matrix given) and the nominal level α is set to 0.05, then we could calculate the theoretical critical values. Let q be a critical value such that the probability of at least one false positive is Since all X(j) s are iid, let F (x) = P (X i x) X(j). Then, P (at least one X(j) > q) = 1 (F (q)) 20 = Solving for q results in the theoretical critical value q = The experiment was performed with B = 1000 bootstrap samples drawn from the dataset. Table 5.4 presents 20 values for the test statistics and the corresponding experimental critical values. Compare the theoretically obtained critical value q = with the experimentaly obtained values. They are pretty close. Based on the decision rule and experimental critical values none of the 20 hypotheses is rejected. The estimated FWER is found to be < α. In reality, we don t always have independent and uncorrelated structure among the test statistics. So, finding the theoretical cutoff value does not always seem feasible. Instead, we ll have to rely on the bootstrap multiple testing procedures to get the cutoff values from the estimated test statistics null distribution. In table 5.4, also note that we would have to reject Hypotheses 7 and 10 if we performed 20 single hypothesis tests, each at the α = 0.05 level of significance, ignoring the multiplicity effect. We would have announced a significant difference, whereas the effect (as we know from the initial conditions of our experiment) is merely due to a chance.

74 64 Hypothesis Test Statistic Critical Value Decision Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Fail to Reject Table 5.4: The results of the twenty simultaneous tests about the mean vector of the multivariate normal distribution.

75 Poisson Distribution Models Poisson Univariate Case Let X be a random variable from a Poisson distribution, X f X (x) = e λ λ x /x!, x = 0, 1, 2,.... Then, E(X) = λ and V ar(x) = λ. We like to simulate a Poisson Distribution model with λ = 2.5, so generate n = 400 observations from P oisson(λ = 2.5). The empirical distribution of X is shown in figure 5.3. To perform the right-sided test of hypothesis about the population mean, we set up the null and alternative hypotheses as follows: H 0 : µ = 2.5 H 1 : µ > 2.5 Under the null hypothesis T = X µ 0 (σ/ n) is normaly distributed, T N(0, 1), and we can refer to a table for a critical value for each given level of significance α. For instance, for a level of significance α = 0.05 the table will provide Let us construct a bootstrap estimate of the null distribution of the test statistic and compare the theoretically expected number for a critical value with the experimental one. Please refer to a brief summary of parameters and statistics in tables 5.5 and 5.6. In particular, the experiment-based critical value obtained from the estimated null distribution is 1.48 versus theoretical value of

66 Poisson Distribution Theoretical Model Simulated Model Number of observations N/A 400 Mean 2.5 2.545 Standard Deviation 2.5 1.58 1.469 Table 5.5: Summary of postulated and simulated models.

76 66 Poisson Distribution Theoretical Model Simulated Model Number of observations N/A 400 Mean Standard Deviation Table 5.5: Summary of postulated and simulated models. Null Distribution Theoretical Bootstrap Estimated Number of observations N/A 1000 Mean Standard Deviation α Critical value Z Test statistic N/A Table 5.6: Summary of test statistic null distribution (from poisson), theoretical and bootstrap estimated. Figure 5.3: Empirical Possion Distribution (λ = 2.5).

67 Figure 5.4: Estimated Bootstrap Distribution. Poisson Multivariate Case Let X = (X(j) : j = 1, 2,..., 20) be a random vector, where a random variable X(j) is from P oisson(λ = 2.5), j = 1, 2,..., 20. To simulate this model in our experiment we draw n = 400 independent realizations of X.

77 67 Figure 5.4: Estimated Bootstrap Distribution. Poisson Multivariate Case Let X = (X(j) : j = 1, 2,..., 20) be a random vector, where a random variable X(j) is from P oisson(λ = 2.5), j = 1, 2,..., 20. To simulate this model in our experiment we draw n = 400 independent realizations of X. We like to perform 20 right-sided tests of hypothesis about the population mean vector, µ = (µ(j) : j = 1,..., 20): H 0j : µ(j) = 2.5 H 1j : µ(j) > 2.5

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca