Journal of Statistical Software

Size: px

Start display at page:

Download "Journal of Statistical Software"

Ophelia Phelps
5 years ago
Views:

JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. doi: 10.18637/jss.v000.

1 JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. doi: /jss.v000.i00 GroupTest: Multiple Testing Procedure for Grouped Hypotheses Zhigen Zhao Abstract In the modern Big Data analysis, testing multiple hypotheses simultaneously has been an important tool in analyzing data arising from scientific studies, such as genetics, astronomy, social sciences and many others. The hypotheses can often be grouped together according to the nature of the scientific investigations. For instance, genes can be grouped according to gene pathways; the nearby pixels in an image can be grouped together. The sparsity appears in both between- and within-group levels. Namely, only a small number of groups are significant and a small number of hypotheses within a significant group are significant. To fully incorporate the group information, Liu, Sarkar, and Zhao (2015) proposed the BSG model, described in Section 2. They have further provided a methodology for testing these grouped hypothesis, controlling the total posterior false discovery rate and within-group false discover rate. The GroupTest package provides the implementation of all the procedures under the R programming environment. The usage of GroupTest is illustrated in this paper using a simulated data set and a real data from Liu et al. (2015). Keywords: local fdr, group, posterior false discovery rate. 1. Introduction In this era of Big Data, multiple hypotheses testing has been an important tools in analyzing data arising from various scientific studies. There are tremendous of research in the last decades, focusing on the theories, methodologies, and applications of multiple comparison. In the novel work of Benjamini and Hochberg (1995), they have coined the false discovery rate (FDR), an error rate which is more liberal than the family wise error rate. In the last two decades, there are many extensions of the BH method to various research scenarios. To name a few here, Benjamini and Yekutieli (2001), Genovese and Wasserman (2002, 2004), Sun and Cai (2007), Sun and Cai (2009), He, Sarkar, and Zhao (2015), Efron (2008), Sarkar (2004), Sarkar (2008), Sarkar (2002).

2 2 GroupTest: Grouped Hypotheses Testing However, there are still many more issues to be addressed when applying the multiple testing procedures in scientific studies, such as the group structure where the hypotheses are naturally grouped according to, for instance, biological functions, gene pathways, geological conditions, and many others. Ignoring the group structure when constructing multiple testing methods may result in misleading conclusions (Efron (2008)). In Liu et al. (2015), they have analyzed the Adequately Yearly Progress (AYP) data set to study the academic performance between the socioeconomically disadvantaged students and the socioeconomically advantaged students of all the elementary schools, which can be naturally grouped according to the school districts. The sparsity appears on both the betweengroup (school district) and within-group (schools within a district) levels. Identifying those school districts with at least one school within it having significant difference in academic performance between these two groups of students is equally important, if not more, than identifying the schools. In this paper, they introduced the BSG model to capture the group structure. Based on the decision theoretical approach, Liu et al. (2015) proposes a twostage method providing decisions on both the between- and within-group levels, subject to control the total posterior false discovery rate and within-group false discovery rate. The package, GroupTest, provides all the functions to implement this two-stage method under the R programming environment. Here is how this article is organized. In Section 2, we rephrase the BSG model from Liu et al. (2015) for the multiple testing with grouped hypotheses. In Section 3, we explain the structure of the data and introduce the AYP data set included in this package for further demonstration. In Section 4, we show the structure of the package and demonstrate how to use this package to make decisions. We conclude the article in Section Model and Hypotheses Testing Denote the hypotheses to be tested as H gj where g = 1, 2,, G, j = 1, 2,, m g where G is the total number of groups and m g is the number of hypotheses within the g-th group. Let θ gj be an indicator whether the alternative hypotheses is true. Namely, θ gj = 0 if the null hypotheses corresponding to H gj is true and θ gj = 1 otherwise. Let X gj be the corresponding test statistic. The following model is introduced in Liu et al. (2015), θ 1,..., θ G θ gj θ g = 0 θ g1,..., θ gmg θ g = 1 X gj θ gj i.i.d. Bernoulli(π 1 ), i.i.d Bernoulli (0) (i.e., P (θ gj = 0 θ g = 0) = 1), mg j=1 { } (1 π 2 1 ) 1 θgj π θ gj I( 2 1 j θ gj>0) 1 (1 π 2 1 ) mg, ind (1 θ gj )f 0 (x gj ) + θ gj f 1 (x gj ), where f 0 (x) and f 1 (x) are the density functions of the test statistic under the null and alternative hypotheses respectively. They call this model with Truncated Bernoulli hidden states within each Significant Group as BSG model. Here, θ g, the indicator whether the g-th group is significant, follows a Bernoulli distribution with the probability of success π 1, i.e. the probability that a group is significant is π 1. Given that the group is not significant, then θ gj = 0 for all j = 1, 2,, m g, implying that all the null hypotheses within this group are true. If a group is significant, the (1)

3 Journal of Statistical Software 3 hidden states within this group follows a truncated Bernoulli with a probability of success π 2 1, ensuring that at least one of the hypotheses within this group is false. They further modeled the distribution of the test statistic as following. When θ gj = 0, the test statistic X follows a standard normal distribution φ(x); and when θ gj = 1, the test statistic X follows a mixture of Gaussian distributions with L components. Namely, f 1 (x) = L l=1 c l 1 σ l φ( x µ l σ l ), (2) where l l=1 c l = 1. The parameter in the model (1-2) is ω = (π 1, π 2 1, c l, µ l, σ l ) which can be estimated using the EM algorithm, outlined in the appendix of Liu et al. (2015). Let δ gj {0, 1} be a decision on the hypothesis H gj where δ gj = 1 if we reject the null hypothesis H gj and δ gj = 0 otherwise. Traditionally, the FDR can be written as G mg g=1 j=1 F DR = E (1 θ gj)δ gj { G } mg, g=1 j=1 δ gj 1 where the hidden states θ gj s are assumed to be fixed and the expectation is taken with respect to X only. In Liu et al. (2015), they considered the total posterior FDR for the grouped hypotheses, a counterpart from the Bayesian perspective, as following P F DR T (X) = E G mg g=1 { G g=1 j=1 (1 θ gj)δ gj } mg j=1 δ gj 1 X. (3) For the within-group level, they have further introduced posterior FDR within a group P F DR W g as P F DR W g (X) = E j δ gj(1 θ gj ) { j gj} δ 1 X The total posterior FDR can be rewritten as a function of the following two quantities, which are crucial for the final proposed method, fdr g (x) = P (θ g = 0 x), fdr j g (x) = P (θ gj = 0 θ g = 1, x). (4) Here, the first quantity measures whether the g-th group is significant; and the second quantity measures whether the j-th hypotheses within the g-th group is significant given that this group is significant. Definition 2.1 (Multiple Testing Procedure for Grouped Hypotheses). Step 1. Estimate the parameter ω using the EM algorithm; Step 2. Calculate fdr g and fdr j g according to (4);

4 4 GroupTest: Grouped Hypotheses Testing Step 3. For each g, let fdr (1) g fdr (2) g fdr (mg) g be the ordered fdr j g, with H g(1),..., H g(mg) being the corresponding hypotheses, and find k R g = max k 1 g g : fdr (j) g η, k g j=1 given 0 < η α < 1. Mark the hypotheses H g(1),..., H g(rg) for possible rejection and go to the next step. Step 4. Calculate η g = 1 Rg R g j=1 fdr (j) g, and define fdrg = 1 (1 η g )(1 fdr g ), for each g. Order these fdrg values as fdr(1) fdr (G), and find { k g=1 l = max k : R (g)fdr(g) } k g=1 R α, (g) with R (g) being the value of R for the group that corresponds to fdr(g). The hypotheses that were marked for possible rejection in the groups (1),..., (l) are ultimately rejected. The proposed two-stage method firstly screens all the hypotheses for each group and marks those hypotheses for potential rejection by controlling the posterior FDR within the group level. After that, the between-group decision is made to guarantee a control of the total posterior FDR. The final rejections are taken from those hypotheses which are marked for rejection within the rejected groups. As shown in Liu et al. (2015), this method fully utilizes the group structure and can substantially improve the existing methods. 3. Package The R package GroupTest contains the following four functions and two data sets: 1. GT.wrapper is the main function called by users to do multiple testing for grouped hypotheses; 2. GT.em is the function for estimating the parameter ω; 3. GT.localfdr is the function for calculating the local fdr scores defined in (4); 4. GT.decision is the function making the decision based on Algorithm 2.1; 5. AYP, the Adequately Yearly Progress data of California, 2013; 6. GroupTest simulate, simulated data set Function GT.wrapper This function is the main function to perform the two-stage testing for grouped hypotheses. GT.wrapper(TestStatistic, alpha = 0.05, eta = alpha, pi1.ini = 0.7, pi2.1.ini = 0.4, L = 2, mul.ini = c(-1, 1), sigmal.ini = c(1, 1), cl.ini = c(0.5, 0.5), DELTA = 0.001, sigma.known=false) It takes the following arguments:

5 Journal of Statistical Software 5 TestStatistic: an array of list. Each list of the array corresponds to one group, containing the test statistic, stored as X, and the group size, stored as mg; alpha: the targeted FDR level. By default, it is chosen as 0.05; eta: the targeted FDR level within each group. The default and recommended choice is alpha; pi1.ini: initial value, the probability that a group is significant. By default, it is chosen as 0.7; pi2.1.ini: initial value, the probability that an individual null hypothesis is false given that the group is significant. By default, it is chosen as 0.4; L: the number of Gaussian component under the alternative hypothesis. By default, it is chosen as 2; mul.ini: initial value: a vector of means for all the components of the Gaussian mixture. By default, is is chosen as -1 and 1; sigmal.ini: initial value: a vector of standard deviation of all the components of the Gaussian mixture. By default, it is chosen as 1 and 1; cl.ini: initial value: a vector of the probability for all the components of the Gaussian mixture. By default, it is chosen as 50% and 50%; DELTA: the criteria to stop the EM algorithm. In this algorithm, we calcualte the maximum of absolution difference of the current estiamted value and its previous value for the parameters. By default, it is chosen as ; sigma.known: the boolean variable, indicating whether the variance is known. Be default, it is chosen as FALSE. This function returns the following two varibles: parameter: a list, consisting of estimated parameters based on the EM algorithm. The elements are π 1, π 2 1,, c l, µ l, σ l ; TSGroupTest[[g]]: the quntities regarding the g-th group, including the test statistic within this group, the individual conditional local fdr score (P (θ gj = 0 x, θ g = 1)), the group-wise local fdr score (P (θ g = 0 x)), between-group decision and within-group decision Simulated Data Set This is a simulated data set for demonstrating the structure of the data set. The data set is an array of three list, where each list corresponds to one group. In each list, the number of individual hypotheses and test statistics for the individual hypotheses within the group are stored. The group sizes are 3, 4, and 5 respectively.

6 6 GroupTest: Grouped Hypotheses Testing library(grouptest) data(``grouptest_simulate'') GroupTest_simulate > GroupTest_simulate [[1]] [[1]]$X [1] [[1]]$mg [1] 3 [[2]] [[2]]$X [1] [[2]]$mg [1] 4 [[3]] [[3]]$X [1] [[3]]$mg [1] AYP data set This data set is the adequate yearly progress (AYP) study of California elementary schools in 2013 available from In Liu et al. (2015), they used this data set to compare the academic performance for socioeconomically advantaged (SEA) against socioeconomically disadvantaged (SED) students in elementary schools of California. For each school, they compared the success rates in Math exams for these two group of students. The focus is to discover the schools with unusually small or large difference in performance and to identify the school districts with such schools. Let p 1i and p 2i be the success rates and n 1i and n 2i be the numbers of students in the groups of SEA and SED students, respectively, in the ith school, i = 1,..., N. A z-value for school i is computed according to ( Efron (2008), Cai and Sun (2009), Liu et al. (2015)) z i = p 1i p 2i τ p1i (1 p 1i )/n 1i + p 2i (1 p 2i )/n 2i,

7 Journal of Statistical Software 7 where τ is the overall difference, median (p 1i )- median (p 2i ), which is 18.4% in this AYP study. After removing the schools with extremely small or large test statistic and schools with less than 20 students, there are 4118 (= N) elementary schools and 701 qualified school districts. The data is saved as an array of lists, with each list corresponds to one school district, containing the following three variables, X: the test statistic for each individual schools within this school district; mg: the number of schools within this school district; School.District: the name of the school district. library(grouptest) data("ayp") length( AYP ) AYP[[1]] > length( AYP ) [1] 701 > AYP[[1]] $mg [1] 16 $X [1] [7] [13] $School.District [1] "ABC Unified" 4. Examples In this section, we will demonstrate how to use functions from this package to analyze both the simulated data and the AYP data set Simulated Data The function GT.em calculates the estimate of the parameter ω using the EM algorithm. The number of components of the Gaussian mixture is provided through the argument L, with a default choice of 2. The initial value of ω should be provided. Depending on the data set, one can assume σ to be either known or unknown, determined by the argument sigma.known. The function returns the estimation of ω. library(grouptest) data(grouptest_simulate)

8 8 GroupTest: Grouped Hypotheses Testing em.estimate <- GT.em( GroupTest_simulate, L=2, pi1.ini=0.7, pi2.1.ini=0.4, mul.ini=c(-1,1), sigmal.ini=c(1,2), cl.ini=c(0.4,0.6), DELTA=0.001, sigma.known=false ) > em.estimate $pi1 [1] $pi2.1 [1] $mul [1] $sigmal [1] $cl [1] $L [1] 2 Based on this output, π 1 and π 2 1 are estimated as 1 and 0.60 respectively, and f 1 (x), the density function under the alternative hypothesis can be estimated as ˆf 1 (x) = 0.50N(.85, ) N(0.02, ). To get the decision on all the hypotheses incorporating the group structure, one can apply the GT.wrapper function, which integrates the estimation of all the parameters, the calculation of the local fdr scores, and provides the final decisions. GT.Test <- GT.wrapper( GroupTest_simulate, alpha=0.05, eta=alpha, pi1.ini=0.7, pi2.1.ini=0.4, L=2, mul.ini=c(-1,1), sigmal.ini=c(1,2), cl.ini=c(0.4,0.6), DELTA=0.001, sigma.known=false ) > GT.Test[[1]] $X [1] $mg [1] 3 $f0x [1] $f1x

9 Journal of Statistical Software 9 [1] $fx [1] $fdr.g [1] e-07 $fdr.j.g [1] $prob.theta.1.theta.j.g.1 [1] $m.j.g [,1] [,2] [1,] e-15 [2,] e-59 [3,] e-01 $within.group.rej [1] $eta.g [1] 0 $fdr.star.g [1] e-07 $between.group.rej [1] 0 > GT.Test[[2]] $X [1] $mg [1] 4 $f0x [1] $f1x [1] $fx [1]

10 10 GroupTest: Grouped Hypotheses Testing $fdr.g [1] e-08 $fdr.j.g [1] $prob.theta.1.theta.j.g.1 [1] $m.j.g [,1] [,2] [1,] e-04 [2,] e-05 [3,] e-01 [4,] e-01 $within.group.rej [1] $eta.g [1] 0 $fdr.star.g [1] e-08 $between.group.rej [1] 0 > GT.Test[[3]] $X [1] $mg [1] 5 $f0x [1] $f1x [1] $fx [1] $fdr.g [1] e-07

11 Journal of Statistical Software 11 $fdr.j.g [1] $prob.theta.1.theta.j.g.1 [1] $m.j.g [,1] [,2] [1,] e-01 [2,] e-30 [3,] e-09 [4,] e-14 [5,] e-01 $within.group.rej [1] $eta.g [1] 0 $fdr.star.g [1] e-07 $between.group.rej [1] 0 Note that this function returns an array of lists with each element corresponding to one group and the estimator of the parameters. As an example, GT.Test[[3]] returns all the information regarding the 3rd group. Here are some key quantities: X: the test statistic of all the hypotheses within the 3-rd group; mg: the number of hypothese within this group; fdr.g: the between-group local fdr score fdr g = P (θ g = 0 x); fdr.j.g: the within-group local fdr scores fdr j g = P (θ j g = 0 x, θ g = 1); between.group.rej: the between-group decision, δ g ; within.group.rej: the within-group decision δ j g ; This function also returns the estimation of the parameters which is the same output from GT.em(). > GT.Test$parameter $pi1

12 12 GroupTest: Grouped Hypotheses Testing [1] $pi2.1 [1] $mul [1] $sigmal [1] $cl [1] $L [1] AYP Data Set In this section, the AYP data set will be analyzed using the functions providing from GroupTest. AYP.Test <- GT.wrapper( AYP, alpha=0.1, sigma.known=true ) > AYP.Test$parameter $pi1 [1] $pi2.1 [1] $mul [1] $sigmal [1] 1 1 $cl [1] $L [1] 2 When setting L = 2, note that ˆπ 1 = 0.533, which implies that the estimated proportion of significant school districts is 53.3%. When a school district is significant, the chance that an

13 Journal of Statistical Software 13 individual school is significant is 59.4%. Under the alternative hypothesis, f 1 (x), the density function of the test statistic, can be estimated as f 1 (x) = 0.792N( 1.878, 1) N(2.644, 1). G <- length(ayp) G.rej <- array(0, G) for( g in 1:G ) G.rej[g] <- AYP.Test[[g]]$between.group.rej sum(g.rej) total.school <- 0 for(g in 1:G) total.school <- total.school + sum( AYP.Test[[g]]$within.group.rej ) > sum(g.rej) [1] 284 > which(g.rej==1) [1] [19] > total.school [1] 1085 > AYP.Test[[1]] $mg [1] 16 $X [1] [7] [13] $School.District [1] "ABC Unified" $fdr.g [1] $fdr.j.g [1] [7] [13]

14 14 GroupTest: Grouped Hypotheses Testing $within.group.rej [1] $fdr.star.g [1] $between.group.rej [1] 1 When setting η = α = 0.1, based on the above output, the total number of rejected school districts is 284, including the ABC Unified school district, which is stored as the first school district in the AYP data set. Note that there are 16 schools within the ABC Unified School district and three of them are rejected (the 7-th, 9-th, and the 10-th school). The total number of schools that are rejected is As reported in Liu et al. (2015), the proposed test is much more powerful than the alternative methods without incorporating the group structure. For instance, the SC s method (Sun and Cai (2007) results in 668 rejected schools and the adaptive BH method(benjamini and Hochberg (2000)) leads to 629 rejected schools, not even to mention that both methods provide no control on the between-group levels. 5. Conclusion In big data analysis when testing grouped hypotheses, how to incorporate the group structure to develop valid and sharp testing procedure is an important, urgent, yet challenging problems that needs immediately attentions. Liu et al. (2015) has successfully built a theoretical framework to develop testing procedure using the Bayesian/empirical Bayesian methodologies. The GroupTest, a R-package has been developed to implement the proposed method such that it can be easily applied in real application. In this article, the author used a simulated example and AYP data set to demonstrate how to use this package. References Benjamini Y, Hochberg Y (1995). Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B, 57(1), Benjamini Y, Hochberg Y (2000). On the adaptive control of the false discovery rate in multiple testing with independent statistics. Journal of Educational and Behavioral Statistics, 25(1), Benjamini Y, Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pp Cai TT, Sun W (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. Journal of the American Statistical Association, 104(488),

15 Journal of Statistical Software 15 Efron B (2008). Microarrays, empirical Bayes and the two-groups model. Statistical Science, 23(1), Genovese C, Wasserman L (2002). Operating Characteristics and Extensions of the False Discovery Rate Procedure. Journal of the Royal Statistical Society. Series B, 64(3), Genovese C, Wasserman L (2004). A stochastic process approach to false discovery control. The Annals of Statistics, 32(3), He L, Sarkar SK, Zhao Z (2015). Capturing the Severity of Type II errors in High-Dimensional Multiple Testing. Journal of Multivariate Analysis, 142, Liu Y, Sarkar SK, Zhao Z (2015). grouped hypotheses. Submitted. A decision theoretic approach to multiple testing of Sarkar SK (2002). Some results on false discovery rate in stepwise multiple testing procedures. Annals of Statistics, pp Sarkar SK (2004). FDR-controlling stepwise procedures and their false negatives rates. Journal of statistical planning and inference, 125(1), Sarkar SK (2008). On Methods Controlling the False Discovery Rate(with discussion). Sankhya, Ser. A, 70, Sun W, Cai TT (2007). Oracle and adaptive compound decision rules for false discovery rate control. Journal of the American Statistical Association, 102(479), Sun W, Cai TT (2009). Large-scale multiple testing under dependence. Journal of the Royal Statistical Society. Series B, 71(2), Affiliation: Zhigen Zhao 342 Speakman Hall 1801 N. 13th Street Fox School of Business Temple University Philadelphia, PA zhaozhg@temple.edu URL: Journal of Statistical Software published by the Foundation for Open Access Statistics MMMMMM YYYY, Volume VV, Issue II doi: /jss.v000.i Submitted: yyyy-mm-dd Accepted: yyyy-mm-dd

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2009 Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks T. Tony Cai University of Pennsylvania