A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

Size: px

Start display at page:

Download "A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data"

Ralph Mitchell
5 years ago
Views:

1 Biostatistics (2007), 8, 4, pp doi: /biostatistics/kxm002 Advance Access publication on January 22, 2007 A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data YINGLEI LAI Department of Statistics and Biostatistics Center, The George Washington University, Washington, DC 20052, USA ylai@gwu.edu SUMMARY Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. When these variables are simultaneously screened by a statistical test, it is necessary to consider the adustment for multiple hypothesis testing. The false discovery rate has been proposed and widely used to address this issue. A related problem is the estimation of the proportion of true null hypotheses. The long-standing difficulty to this problem is the identifiability of the nonparametric model. In this study, we propose a moment-based method coupled with sample splitting for estimating this proportion. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the identifiability can be achieved. Theoretical aspects of the approximation error are discussed. The proposed estimation method is completely nonparametric and simple with an explicit formula. Simulation studies show the favorable performances of the proposed method when it is compared to the other existing methods. Two microarray gene expression data sets are considered for applications. Keywords: Microarray; Moment estimator; Proportion of true null hypothesis. 1. INTRODUCTION Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. These data include microarray gene expression data (Hedenfalk and others, 2001), mass spectrometry data (Wu and others, 2003), and nuclear magnetic resonance spectral data (Tadesse and others, 2005). The sample sizes of these data sets are usually small because of their relatively high costs. These data sets can be collected for multiple sample groups, and a typical interest is to identify variables significantly distinguishing these groups, such as normal against disease groups. Statistically, we conduct a multisample comparison test for each of the measured variables. Because numerous variables are simultaneously screened, it is necessary to consider the adustment for multiple hypothesis testing. The false discovery rate (FDR) has been proposed and widely used to address this issue (Benamini and Hochberg, 1995; Storey and Tibshirani, 2003). It evaluates the proportion of false positives among the identified c The Author Published by Oxford University Press. All rights reserved. For permissions, please ournals.permissions@oxfordournals.org.

2 A moment-based method for estimating the proportion of true null hypotheses 745 positives. To efficiently evaluate FDRs, it is necessary to obtain an accurate estimate of the proportion of true null hypotheses π 0. For microarray data, it is equivalent to estimate the proportion of differentially expressed genes. This quantity is also crucial for the sample-size calculation in microarray experiment designs (Jung, 2005; Wang and Chen, 2004). Many statistical methods have been proposed to estimate π 0, such as a mixture model proposed by Allison and others (2002), QVALUE (Storey and Tibshirani, 2003), BUM (Pounds and Morris, 2003), SPLOSH (Pounds and Cheng, 2004), and LBE (Dalmasso and others, 2005). These methods are not always efficient. They may give accurate estimation results in some cases but fail in other cases. If the distributions of test statistics or the related p-value distributions can be specified in parametric forms for both the null and the alternative hypotheses, then the model-based estimation approach, such as the mixture model proposed by Allison and others (2002) or BUM proposed by Pounds and Morris (2003), should provide favorable performances. However, it is generally difficult to validate these distribution assumptions, especially when sample sizes are small. For the nonparametric approach, a long-standing difficulty is the model identifiability (unique solution of model parameters), because observations are sampled from mixed distributions from the null and the alternative hypotheses. QVALUE (Storey and Tibshirani, 2003) and SPLOSH (Pounds and Cheng, 2004) first smooth the empirical p-value distribution and then estimate an upper bound of π 0. LBE proposed by Dalmasso and others (2005) estimates the upper bound of π 0 through a moment-based method. Recently, Pawitan and others (2005a,b) discussed the bias in the estimation of π 0 and the influence from sample sizes. Moment-based estimation methods usually require no independence assumptions. Explicit formulas can generally be derived. The requirement of large sample sizes, which is necessary for the statistical efficiency of these methods, limits their usefulness in practice. However, when estimating π 0 for omics data, the sample size is the number of variables and is usually large. Therefore, we consider a moment-based method coupled with sample splitting for estimating π 0. By splitting the sample, we are able to understand the p-value distribution under different hypotheses by establishing the conditional independence structure of oint p-value distribution. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the model identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the model identifiability can be achieved. The proposed method is completely nonparametric and simple with an explicit formula. In the following sections, we first propose the method for estimating π 0. Theoretical aspects of the approximation error are also presented. Then, we present analysis results for several simulated and experimental data sets to compare the performances of the proposed method and the other existing methods. Finally, the advantages and disadvantages of the proposed method are discussed. 2. A MOMENT-BASED ESTIMATION METHOD 2.1 Motivation A typical situation when multiple hypothesis testing is performed for omics data (microarray data, mass spectrometry data, etc.) is that numerous p values are generated. A proportion of these p values are consistent with the null hypothesis and the rest are consistent with the alternative hypothesis. Our interest in this study is to estimate π 0, the proportion of true null hypothesis. To provide an illustrative example for our proposed method, we simulate 2 independent data sets. Both data sets have the same 3000 variables and 2 sample groups with 5 samples in each group. In each data set, the first 1200 variables are independently simulated from the normal distribution N(0, 1) and N(1, 1) for the first and the second sample groups, respectively (40% nonnull), and the rest 1800 variables are independently simulated from the normal distribution N(0, 1) for both the groups (60% null). p values from the 2-sample Student s t-test are calculated for these simulated variables.

3 746 Y. LAI The marginal histograms in Figure 1(a) give illustrations of the p-value distributions based on one data set. From these histograms, one may realize the problem of identifiability when estimating π 0. Although the null distribution is known as uniformly distributed in [0, 1], the nonnull distribution is unknown. Without imposing any parametric or other assumptions on the nonnull distribution, we cannot obtain a unique solution for π 0 if only one data set is considered. However, if we have 2 independent data sets such that both data sets contain the same variables, then the pairs of p values can be obtained for all variables, and these pairs are actually conditionally independent. The scatter plot in Figure 1(a) gives an illustration. From this plot, one may realize that it is possible to solve the identifiability problem and obtain a unique solution for π 0 under certain conditions. In the following subsections, we first introduce an estimation method when 2 independent data sets are available. When there is only one data set, we propose a procedure to generate 2 independent data sets. A bootstrap procedure for confidence intervals and some theoretical aspects are also discussed. 2.2 Two data sets At the beginning, we consider 2 independent data sets. Both data sets contain the same m variables and g sample groups. Their sample sizes may be different. Test statistics are chosen to test some specific hypotheses for each variable, such as H 0 : the variable has the same population means in different sample groups versus H a : the variable has different population means in different sample groups. (For simplicity, we skip the mathematical description of data structure and the related test statistics.) The goal is to estimate π 0, the proportion of variables consistent with the null hypothesis. Suppose a test statistic T is chosen to test a specified hypothesis. Without loss of generality, we assume that T is continuous. For each variable, we can obtain 2 corresponding p values from the 2 data sets. For data set k, k = 1, 2, the p value P (k) follows a uniform distribution U[0, 1] under the null hypothesis H 0. Under the alternative hypothesis H a, there may be various distribution components (except U[0, 1]) for the p-value distribution. We use I ={1, 2,...} to denote the set containing the indices representing different nonnull distribution components. Generally, the set I may contain many different components ( I > 1, where I is the number of elements in I ). We propose that the null component and the different nonnull components can be approximated by 2 components: a null component and a nonnull component. Under this approximation, there is an approximated proportion of true null hypothesis π 0, which may be different from π 0 (however, if I =1, then π 0 = π 0). Considering the moments of p values, we have E[P (1) ] = π 0 E[P(1) H 0 ] + (1 π 0 )E[P(1) H a ], E[P (2) ] = π 0 E[P(2) H 0 ] + (1 π 0 )E[P(2) H a ], E[P (1) P (2) ] = π 0 E[P(1) H 0 ]E[P (2) H 0 ] + (1 π 0 )E[P(1) H a ]E[P (2) H a ]. E[P (k) H 0 ], E[P (k) H a ], and E[P (k) ] are the expected values of p value following the null, nonnull, and marginal distributions in data set k, k = 1, 2, respectively. E[P (1) P (2) ] is the expected value of the product of P (1) and P (2) under the marginal oint distribution. Note that E[P (k) H 0 ] = 1/2 because the null distribution is known as U[0, 1]. Furthermore, E[P (1) P (2) ], E[P (1) ], and E[P (2) ] can be estimated from the data (using the corresponding sample moments). Then, there are only 3 unknown parameters: π 0, E[P(1) H a ], and E[P (2) H a ]. With the above 3 equations, we can obtain an explicit formula π 0 = E[P (1) P (2) ] E[P (1) ]E[P (2) ] E[P (1) P (2) ] E[P (2) ]/2 E[P (1) ]/2 + 1/4.

4 A moment-based method for estimating the proportion of true null hypotheses 747 Fig. 1. (a) Scatter plot with marginal histograms for paired p values based on 2 independently simulated data sets (see Section 2.1 for details), in which the grey and black dots represent variables consistent with the null and the alternative hypotheses, respectively, and the dashed lines represent the proportion of true null hypotheses. (b) An artificial example for the data division scheme (see Procedure 1 for details), in which grey and black colors represent the first and the second sub data sets, respectively. (c,d) Estimation results based on the microarray gene expression data sets for (c) the breast cancer and (d) the blood studies. N, Q, B, S, and L represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively. In the p-value histograms, the lines with different characters represent the original estimates from different methods. The boxplots are based on the bootstrap estimates from different methods.

5 748 Y. LAI The mathematical proof is given as Lemma 1 in supplementary material available at Biostatistics online. Therefore, an estimator for π 0 is proposed as { { π 0 = max 0, min 1, m=1 [P (1) P (2) ]/m [ m =1 P (1) /m][ m =1 P (2) /m] m=1 [P (1) P (2) ]/m m =1 P (1) /(2m) m =1 P (2) /(2m) + 1/4 }}, (2.1) where P (k) is the calculated p value of the th variable in data set k, = 1, 2,...,m, k = 1, 2. Boundary constrains are imposed since the proportion π 0 must be within [0, 1]. 2.3 One data set To estimate π 0 for a given data set, which contains m variables and g sample groups, we can first divide the data set into 2 parts and then use the method described above. The following procedure is proposed. PROCEDURE 1 1) For a given variable, randomly divide its observations in each sample group into 2 parts with (approximately) equal sample sizes; 2) With a given test statistic T, calculate the p value for each part; 3) Repeat steps 1 and 2 for all variables and obtain the set of paired p values; 4) Use (2.1) to estimate π 0 ; 5) Repeat steps 1 4 R times and obtain R estimates of π 0 ; 6) Return the median of these R estimates. There may be complicated dependence structures among the different variables in the data set. We perform data division step (step 1) separately for each variable to reduce the impacts from dependence structures (see Figure 1(b) for an illustration). Although the proposed method is moment based and does not require any independence assumptions, it is still necessary to reduce these impacts so that the estimation can be more statistically efficient. Because different random divisions of the data set result in different estimates, we repeat steps 1 4 R times to obtain a resample distribution of estimates. (In this study, we repeat R = 25 times. Based on some simulation studies [data not shown], 25 is an appropriate choice for the balance between estimation accuracy and computation burden.) Then, the median is reported for robustness purpose. 2.4 Confidence interval Theoretically, we can apply Delta method (Casella and Berger, 2002, p. 240) to obtain formulas for the large sample variance and confidence intervals. However, these formulas may be invalid because of complicated dependence structures among the variables in omics data. Therefore, we use the bootstrap method (Efron, 1979) to obtain confidence intervals. For QVALUE, BUM, SPLOSH, and LBE, we can simply repeat sampling p values and estimating π 0 B times to obtain a resample distribution. For the proposed method, a resample distribution of estimates can be similarly obtained by the following procedure. PROCEDURE 2 1) Run the following 3 steps R times to obtain R sets of paired p values: a) For a given variable, randomly divide its observations in each sample group into 2 parts with (approximately) equal sample sizes;

6 A moment-based method for estimating the proportion of true null hypotheses 749 b) With a given test statistic T, calculate the p value for each part; c) Repeat steps a and b for all variables and obtain the set of paired p. 2) Sample m integer numbers {b 1, b 2,...,b m } with replacement from the set {1, 2,...,m} with probability {1/m, 1/m,...,1/m}. 3) Perform the following 2 steps for each set of paired p values: Form a new set by selecting {b 1, b 2,...,b m }th paired p values; use (2.1) to estimate π 0. 4) Record the median of these R estimates of π 0. 5) Return a resample distribution by repeating steps 2 4 B times. 2.5 Approximation error The proposed estimation method is derived based on the approximated π 0. It is necessary to study the approximation error. We can show that π 0 = π i< 0 + π iπ {E[P (1) H i ] E[P (1) H ]}{E[P (2) H i ] E[P (2) H ]} i I π i{e[p (1) H 0 ] E[P (1) H i ]}{E[P (2) H 0 ] E[P (2), (2.2) H i ]} where E[P (k) H i ] is the expected value of p value following the nonnull distribution component i I. The mathematical proof is given as Lemma 2 in supplementary material available at Biostatistics online. The approximation will be close if E[P (k) H i ] E[P (k) H ] for all i, I and any k = 1, 2. An ideal case is that all p values from the alternative hypothesis follow only one distribution ( I =1). In this situation, we have E[P (k) H i ] = E[P (k) H ] for all i, I and any k = 1, 2, and therefore π 0 = π 0. The approximation will also be close if E[P (k) H i ] 0 for all i I and any k = 1, 2. An ideal case is that the number of samples in each group goes to infinity, in which we have E[P (k) H i ] 0 for all i I and any k = 1, 2, and therefore π 0 π 0. To better understand the approximation error when the p values from the alternative hypothesis are heterogeneously distributed, we have the following discussion. If the number of samples in each group in the first data set is the same as the corresponding one in the second data set, then we have E[P (1) H i ] = E[P (2) H i ] for all i I and π 0 π 0 = 1 i, I π iπ {E[P (1) H i ] E[P (1) H ]} 2 2 i I π i{e[p (1) H 0 ] E[P (1) H i ]} 2 0. Since moment estimators are generally asymptotically efficient, π 0 will be asymptotically overestimated. An upper bound can be further derived: π 0 π 0 1 π 0 2 max i, I { E[P (1) H i ] E[P (1) H ] 2 } min i I { E[P (1) H 0 ] E[P (1) H i ] 2 } = factor numerator denominator. Based on this upper bound, the following conclusions can be drawn: The approximation error depends on the factor (the smaller the better). It will be small if π 0 1. The estimation bias will be larger if π 0 is closer to 0 (or if the proportion of differentially expressed genes is larger). The approximation error depends on the numerator (the smaller the better). It will be small if max i, I { E[P (1) H i ] E[P (1) H ] } 0 or, equivalently, E[P (1) H i ] E[P (1) H ] for all i, I. This case has been discussed above.

7 750 Y. LAI The approximation error depends on the denominator (the larger the better). For p values from the alternative hypothesis, we have 0 < E[P (1) H i ] < 1/2. Since E[P (1) H 0 ] = 1/2, 0 < E[P (1) H 0 ] E[P (1) H i ] < 1/2. Therefore, the approximation error will be small if E[P (1) H i ] 0 for all i I. This case has also been discussed above. 3. SIMULATIONS AND APPLICATIONS 3.1 Comparison with other methods A typical application of the proposed method is to estimate the proportion of differentially expressed genes in a given microarray gene expression data set. This proportion is actually 1 π 0. Therefore, it is equivalent to estimate π 0, which is the proportion of nondifferentially expressed genes. Many statistical methods have been proposed to estimate π 0, such as QVALUE (Storey and Tibshirani, 2003), BUM (Pounds and Morris, 2003), SPLOSH (Pounds and Cheng, 2004), and LBE (Dalmasso and others, 2005). In this section, we compare the proposed method with these existing statistical methods through simulations and applications. The simulations are conducted based on a microarray gene expression data set for a breast cancer study. We use the 2-sample Student s t-test for hypothesis testing. For the experimental data set, we observe from Quantile Quantile plots that the p values given by the t-distribution and the permutation procedure are consistent (data not shown). Therefore, we choose to use the t-distribution to assess p values because it gives unique results. Statistical efficiencies can be compared in simulation studies since we know the truth. With a given π 0, we repeat simulation and estimation procedures B=100 times. Note that the proposed method requires much more computation time than these existing methods because of its repetition of random data division (R = 25). Although B = 100 is a relatively small number, it is adequate to compare the performances of different methods. The root mean square error (RMSE), Bias, and standard deviation (SD) are used to compare different methods (estimators) including the proposed one. For an estimator π 0, let π (i) 0 be the calculated estimate in the ith simulation. The Bias, SD, and RMSE are defined as: Bias( π 0 ) = B i=1 π (i) 0 /B π 0,SD( B π0 ) = i=1 [ π 0 (i) B i=1 π (i) 0 /B] 2 /(B 1), and RMSE( π 0 ) = SD 2 + Bias Simulation studies Configurations. In general, there are complicated dependence structures in a microarray gene expression data set. Therefore, we conduct the following simulation studies with covariance matrices constructed based on a microarray gene expression data set (the first data set in Section 3.3). A gene expression data set is simulated with m = 3000 genes and 2 sample groups with sample sizes n 1 = n 2 = 10 (simulation studies 1 and 2) or 50 (simulation study 3). Data are simulated from normal distributions with an assumed proportion 1 π 0 of differentially expressed genes. Genes are grouped into 30 blocks with 100 genes in each block. For each block, we randomly select 100 genes from the experimental data set and calculate the correlation matrices 1 and 2 in the first and the second groups, respectively. For blocks of differentially expressed genes, we simulate data from the normal distributions N(0, 1 ) and N(µ, 2 ) for the first and the second sample groups, respectively. For the remaining blocks, we simulate data from the normal distributions N(0, 1 ) and N(0, 2 ) for the first and the second sample groups, respectively. Here, 0 and µ are (random) vectors. For each configuration, we repeat simulation and estimation procedures B = 100 times. Different statistical methods are used to estimate π 0. We run QVALUE, BUM, SPLOSH, and LBE with their default settings. For the proposed method, we divide each sample group into 2 parts with equal sample sizes: (5, 5)/(5, 5) for simulation studies 1 and 2, (25, 25)/(25, 25) for simulation study 3.

8 A moment-based method for estimating the proportion of true null hypotheses 751 Fig. 2. Estimation results based on the simulation studies 1 (left panel), 2 (middle panel), and 3 (right panel). The RM- SEs (a c), Biases (d f), and SDs (g i) (y-axes) against the true proportions (x-axes) are plotted. The solid lines with black dots, solid, dashed, dotted, and dot-dashed lines represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively. In the boxplots ( l) of estimated proportions, N, Q, B, S, and L represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively, and the dashed lines represent the true value. The results are summarized in Figure 2 in which RMSE, Bias, and SD are compared. We also compare boxplots of the estimation results from different methods when π 0 = 0.6. Results. The first simulation study is to consider the situation that there is only one p-value distribution component for differentially expressed genes. We fix µ = 1.5 and let π 0 = 0.1, 0.2,...,0.9. Generally, the sample size of a microarray data set is relatively small. Therefore, we set n 1 = n 2 = 10. As shown

9 752 Y. LAI in Figure 2, for π 0 around 0.2, only BUM gives smaller RMSEs than the proposed method. For other values of π 0, the proposed method gives the lowest RMSEs. Note that the behavior of BUM is not stable. It gives the highest RMSEs when π or π For different values of π 0, the proposed method consistently gives relatively low biases and the second lowest SDs. The second simulation study is to consider a general situation that p values of differentially expressed genes may follow different distribution components. We randomly sample µ from a uniform distribution U[1, 2] and let π 0 = 0.1, 0.2,...,0.9 and n 1 = n 2 = 10. As shown in Figure 2, for π 0 > 0.3, the proposed method gives the lowest RMSEs. For π 0 around 0.2, only BUM gives lower RMSEs than the proposed method. Note again that the behavior of BUM is not stable. It gives the highest RMSEs when π 0 > 0.3. For π 0 around 0.1, QVALUE gives the lowest RMSEs, and the proposed method gives a slightly higher RMSEs. For different values of π 0, the proposed method consistently gives relatively low biases and the second lowest SDs. The third simulation study is to consider the situation that the sample size of a microarray data set is relatively large. Therefore, we set n 1 = n 2 = 50. We still consider a general situation that p values of differentially expressed genes may follow different distribution components. We randomly sample µ from a uniform distribution U[1, 2] and let π 0 = 0.1, 0.2,...,0.9. As shown in Figure 2, the proposed method always gives the lowest RMSEs and biases and the second lowest SDs for different values of π 0. Other simulations. Simulations for other configurations are also considered. Generally, the proposed method can give comparably favorable performances. However, if the sample size is very small (e.g. <8), the proposed method will give poor performances. This is not surprising. If the sample size of a given data set is very small, then the sample size of a divided subset will be even smaller, which significantly reduces the power to detect differential expressions. This fact has also been discussed by Pawitan and others (2005a,b). Therefore, while enoying the model identifiability through data division, we lose certain statistical efficiency in estimations. 3.3 Applications The above theoretical and simulation studies show the favorable performances of the proposed method especially when (i) the sample size is relatively large, (ii) the p values from the alternative hypothesis are homogeneously distributed, or (iii) the proportion of differentially expressed genes is relatively small. In practice, it is difficult to find a microarray data set for the second or the third situation. However, there are many microarray data sets with relatively large sample sizes. We consider 2 data sets for applications. The first one is the famous microarray gene expression data set for a breast cancer study. Hedenfalk and others (2001) used microarrays to compare 3226 gene expression profiles between 7 BRCA1 samples and 8 BRCA2 samples. The data set is publicly available at Supplement. A total of 56 genes were filtered out, because they had one or more expression measurements exceeding 20, which were considered not trustworthy (Storey and Tibshirani, 2003). Therefore, 3170 gene expression measurements for 15 samples are used in this study. The second data set has a relatively large sample size. Wiestner and others (2003) used lymphochips to compare gene expression profiles between 79 Ig-mutated and 28 Ig-unmutated samples with chronic lymphocytic leukemia. The data set is publicly available at We use the k-nearest neighbors method (R package impute; Troyanskaya and others, 2001) to impute the missing values in the data set. We use different statistical methods to estimate π 0. QVALUE, BUM, SPLOSH, and LBE are run with their default settings. For the proposed method, we divide the data set into 2 subsets: (3, 4)/(4, 4) for the first data set and (39, 14)/(40, 14) for the second data set. We bootstrap B = 1000 times to obtain the

10 A moment-based method for estimating the proportion of true null hypotheses 753 resample distributions of estimates (see Section 2 for details). Since the p values from the null hypothesis follow a uniform distribution U[0, 1], π 0 is expected to be under the curve of underlying empirical p-value distribution. ( f (p) = π 0 f 0 (p) + (1 π 0 ) f 1 (p) = π 0 + (1 π 0 ) f 1 (p) π 0, where f 0, f 1, and f are the null, nonnull, and marginal distributions of p value, respectively.) For the first data set, Figure 1(c) shows a histogram of p values and boxplots to compare estimates from different methods. Only the proposed method and BUM give estimates under the histogram. The proposed method gives the smallest estimated π 0. Among these 5 methods, BUM gives a relatively small variance and the other 4 give comparatively high variances. However, from the simulation studies (e.g. boxplots in Figure 2), some confidence intervals given by BUM do not contain the true value and are not meaningful. Therefore, the proposed method may give more reliable estimation results. For the second data set, Figure 1(d) shows a histogram of p values and boxplots to compare estimates from different methods. Not only the proposed method gives the smallest estimates but also its whole boxplot is under the histogram. Furthermore, its variance is relatively small among these 5 methods. In the above simulation studies and applications, the variances of BUM are always the lowest among these 5 estimation methods. This comes from the simple model of BUM: the mixture of a beta distribution and a uniform distribution. However, it is difficult to validate this model in practice. 4. DISCUSSION In the problem of estimating the proportion of true null hypotheses, the number of variables is the sample size of study. Microarrays and other high-throughput technologies enable us to collect measurements for a large number of variables. With these data, moment-based estimation methods can be considered, because they are generally asymptotically efficient. In this study, we proposed a momentbased estimation method coupled with sample splitting and discussed its theoretical properties. The simulation studies and the applications to microarray data showed the favorable performances of the proposed method when it was compared with the other existing methods. Since the t-test requires at least 2 samples in each group, the proposed method cannot be applied when a group sample size is less than 4. In such a situation, other statistical methods, such as QVALUE, should be considered. From the above analyses, we observe that there are certain situations for a particular method to achieve its optimal performance. New methods for estimating π 0 are being proposed (Langaas and others, 2005). It is necessary to conduct more comprehensive reviews and systematical comparisons for different π 0 -estimation methods. We recently proposed a likelihood-based method coupled with an EM algorithm for estimating π 0 (Lai, 2006). Random data division was also used to achieve the model identifiability. Through simulations and applications to microarray gene expression data, we showed the favorable performances of this method (Lai, 2006). However, there are 2 disadvantages: (i) The method is likelihood based and assumes independence among different genes, which is unlikely to be true because genes interact with each other during cellular processes. (ii) The method uses an EM algorithm, which may provide unreliable estimation when the likelihood function is not regular. The moment-based method proposed in this study requires no independence assumption. In addition to its favorable performances, it is completely nonparametric and simple with an explicit formula to give a unique solution. A future research topic is to generalize the proposed method so that estimation efficiencies can be further improved. As shown in the simulation studies, the estimation variance tends to increase when the true proportion increases (Figures 2). In the second simulation study for heterogeneous alternative, there is a considerable estimation bias when the true proportion is relatively small (Figure 2). It is necessary to pursue both theoretical and simulation studies so that more efficient estimation methods can be developed.

11 754 Y. LAI ACKNOWLEDGMENTS I am grateful to Prof. Tapan Nayak, the editors, associate editors, and the anonymous reviewers for their helpful comments and suggestions. This work was partially supported by a start-up fund from the George Washington University and the National Institutes of Health grant DK The R codes are available at ylai/research/rdpm. Conflict of Interest: None declared. REFERENCES ALLISON, D. B., GADBURY, G. L., HEO, M., FERNANDEZ, J. R., LEE, C.-K., PROLLA, T. A. AND WEINDRUCH, R. (2002). A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis 39, BENJAMINI, Y.AND HOCHBERG, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, CASELLA, G.AND BERGER, R. L. (2002). Statistical Inference, 2nd edition. Pacific Grove, CA: Duxbury. DALMASSO, C., BROËT, P. AND MOREAU, T. (2005). A simple procedure for estimating the false discovery rate. Bioinformatics 21, EFRON, B. (1979). Bootstrap methods: another look at the ackknife. Annals of Statistics 7, HEDENFALK, I., DUGGAN, D., CHEN, Y., RADMACHER, M., BITTNER, M., SIMON, R., MELTZER, P., GUSTERSON, B., ESTELLER, M., KALLIONIEMI, O.P. and others (2001). Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 344, JUNG, S.-H. (2005). Sample size for FDR-control in microarray data analysis. Bioinformatics 21, LAI, Y. (2006). A statistical method for estimating the proportion of differentially expressed genes. Computational Biology and Chemistry 30, LANGAAS, M.,LINDQVIST, B.H.AND FERKINGSTAD, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B 67, PAWITAN, Y., MICHIELS, S., KOSCIELNY, S., GUSNANTO, A. AND PLONER, A. (2005a). False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21, PAWITAN, Y.,MURTHY, K.R.K.,MICHIELS, S.AND PLONER, A. (2005b). Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 20, POUNDS, S.AND CHENG, C. (2004). Improving false discovery rate estimation. Bioinformatics 20, POUNDS, S. AND MORRIS, S. W. (2003). Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19, STOREY, J. D. AND TIBSHIRANI, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100, TADESSE, M. G., IBRAHIM, J. G., VANNUCCI, M. AND GENTLEMAN, R. (2005). Wavelet thresholding with Bayesian false discovery rate control. Biometrics 61, TROYANSKAYA,O.,CANTOR,M.,SHERLOCK,G.,BROWN,P.,HASTIE,T.,TIBSHIRANI, R., BOTSTEIN,D.AND ALTMAN, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17, WANG, S.-J. AND CHEN, J. J. (2004). Sample size for identifying differentially expressed genes in microarray experiments. Journal of Computational Biology 11, WIESTNER, A.,ROSENWALD, A., BARRY, T.S.,WRIGHT, G.,DAVIS, R.E.,HENRICKSON, S.E.,ZHAO, H., IBBOTSON, R. E., ORCHARD, J. A., DAVIS, Z. and others (2003). ZAP-70 expression identifies a chronic

12 A moment-based method for estimating the proportion of true null hypotheses 755 lymphocytic leukemia subtype with unmutated immunoglobulin genes, inferior clinical outcome, and distinct gene expression profile. Blood 101, WU, B., ABBOTT, T.,FISHMAN, D.,MCMURRAY, W.,MOR, G.,STONE, K.,WARD, D.,WILLIAMS, K.AND ZHAO, H. (2003). Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19, [Received August 4, 2006; revised January 5, 2007; accepted for publication January 17, 2007]

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca