A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

Size: px
Start display at page:

Download "A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data"

Transcription

1 Biostatistics (2007), 8, 4, pp doi: /biostatistics/kxm002 Advance Access publication on January 22, 2007 A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data YINGLEI LAI Department of Statistics and Biostatistics Center, The George Washington University, Washington, DC 20052, USA ylai@gwu.edu SUMMARY Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. When these variables are simultaneously screened by a statistical test, it is necessary to consider the adustment for multiple hypothesis testing. The false discovery rate has been proposed and widely used to address this issue. A related problem is the estimation of the proportion of true null hypotheses. The long-standing difficulty to this problem is the identifiability of the nonparametric model. In this study, we propose a moment-based method coupled with sample splitting for estimating this proportion. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the identifiability can be achieved. Theoretical aspects of the approximation error are discussed. The proposed estimation method is completely nonparametric and simple with an explicit formula. Simulation studies show the favorable performances of the proposed method when it is compared to the other existing methods. Two microarray gene expression data sets are considered for applications. Keywords: Microarray; Moment estimator; Proportion of true null hypothesis. 1. INTRODUCTION Due to advances in experimental technologies, it is feasible to collect measurements for a large number of variables. These data include microarray gene expression data (Hedenfalk and others, 2001), mass spectrometry data (Wu and others, 2003), and nuclear magnetic resonance spectral data (Tadesse and others, 2005). The sample sizes of these data sets are usually small because of their relatively high costs. These data sets can be collected for multiple sample groups, and a typical interest is to identify variables significantly distinguishing these groups, such as normal against disease groups. Statistically, we conduct a multisample comparison test for each of the measured variables. Because numerous variables are simultaneously screened, it is necessary to consider the adustment for multiple hypothesis testing. The false discovery rate (FDR) has been proposed and widely used to address this issue (Benamini and Hochberg, 1995; Storey and Tibshirani, 2003). It evaluates the proportion of false positives among the identified c The Author Published by Oxford University Press. All rights reserved. For permissions, please ournals.permissions@oxfordournals.org.

2 A moment-based method for estimating the proportion of true null hypotheses 745 positives. To efficiently evaluate FDRs, it is necessary to obtain an accurate estimate of the proportion of true null hypotheses π 0. For microarray data, it is equivalent to estimate the proportion of differentially expressed genes. This quantity is also crucial for the sample-size calculation in microarray experiment designs (Jung, 2005; Wang and Chen, 2004). Many statistical methods have been proposed to estimate π 0, such as a mixture model proposed by Allison and others (2002), QVALUE (Storey and Tibshirani, 2003), BUM (Pounds and Morris, 2003), SPLOSH (Pounds and Cheng, 2004), and LBE (Dalmasso and others, 2005). These methods are not always efficient. They may give accurate estimation results in some cases but fail in other cases. If the distributions of test statistics or the related p-value distributions can be specified in parametric forms for both the null and the alternative hypotheses, then the model-based estimation approach, such as the mixture model proposed by Allison and others (2002) or BUM proposed by Pounds and Morris (2003), should provide favorable performances. However, it is generally difficult to validate these distribution assumptions, especially when sample sizes are small. For the nonparametric approach, a long-standing difficulty is the model identifiability (unique solution of model parameters), because observations are sampled from mixed distributions from the null and the alternative hypotheses. QVALUE (Storey and Tibshirani, 2003) and SPLOSH (Pounds and Cheng, 2004) first smooth the empirical p-value distribution and then estimate an upper bound of π 0. LBE proposed by Dalmasso and others (2005) estimates the upper bound of π 0 through a moment-based method. Recently, Pawitan and others (2005a,b) discussed the bias in the estimation of π 0 and the influence from sample sizes. Moment-based estimation methods usually require no independence assumptions. Explicit formulas can generally be derived. The requirement of large sample sizes, which is necessary for the statistical efficiency of these methods, limits their usefulness in practice. However, when estimating π 0 for omics data, the sample size is the number of variables and is usually large. Therefore, we consider a moment-based method coupled with sample splitting for estimating π 0. By splitting the sample, we are able to understand the p-value distribution under different hypotheses by establishing the conditional independence structure of oint p-value distribution. If the p values from the alternative hypothesis are homogeneously distributed, then the proposed method will solve the model identifiability and give its optimal performances. When the p values from the alternative hypothesis are heterogeneously distributed, we propose to approximate this mixture distribution so that the model identifiability can be achieved. The proposed method is completely nonparametric and simple with an explicit formula. In the following sections, we first propose the method for estimating π 0. Theoretical aspects of the approximation error are also presented. Then, we present analysis results for several simulated and experimental data sets to compare the performances of the proposed method and the other existing methods. Finally, the advantages and disadvantages of the proposed method are discussed. 2. A MOMENT-BASED ESTIMATION METHOD 2.1 Motivation A typical situation when multiple hypothesis testing is performed for omics data (microarray data, mass spectrometry data, etc.) is that numerous p values are generated. A proportion of these p values are consistent with the null hypothesis and the rest are consistent with the alternative hypothesis. Our interest in this study is to estimate π 0, the proportion of true null hypothesis. To provide an illustrative example for our proposed method, we simulate 2 independent data sets. Both data sets have the same 3000 variables and 2 sample groups with 5 samples in each group. In each data set, the first 1200 variables are independently simulated from the normal distribution N(0, 1) and N(1, 1) for the first and the second sample groups, respectively (40% nonnull), and the rest 1800 variables are independently simulated from the normal distribution N(0, 1) for both the groups (60% null). p values from the 2-sample Student s t-test are calculated for these simulated variables.

3 746 Y. LAI The marginal histograms in Figure 1(a) give illustrations of the p-value distributions based on one data set. From these histograms, one may realize the problem of identifiability when estimating π 0. Although the null distribution is known as uniformly distributed in [0, 1], the nonnull distribution is unknown. Without imposing any parametric or other assumptions on the nonnull distribution, we cannot obtain a unique solution for π 0 if only one data set is considered. However, if we have 2 independent data sets such that both data sets contain the same variables, then the pairs of p values can be obtained for all variables, and these pairs are actually conditionally independent. The scatter plot in Figure 1(a) gives an illustration. From this plot, one may realize that it is possible to solve the identifiability problem and obtain a unique solution for π 0 under certain conditions. In the following subsections, we first introduce an estimation method when 2 independent data sets are available. When there is only one data set, we propose a procedure to generate 2 independent data sets. A bootstrap procedure for confidence intervals and some theoretical aspects are also discussed. 2.2 Two data sets At the beginning, we consider 2 independent data sets. Both data sets contain the same m variables and g sample groups. Their sample sizes may be different. Test statistics are chosen to test some specific hypotheses for each variable, such as H 0 : the variable has the same population means in different sample groups versus H a : the variable has different population means in different sample groups. (For simplicity, we skip the mathematical description of data structure and the related test statistics.) The goal is to estimate π 0, the proportion of variables consistent with the null hypothesis. Suppose a test statistic T is chosen to test a specified hypothesis. Without loss of generality, we assume that T is continuous. For each variable, we can obtain 2 corresponding p values from the 2 data sets. For data set k, k = 1, 2, the p value P (k) follows a uniform distribution U[0, 1] under the null hypothesis H 0. Under the alternative hypothesis H a, there may be various distribution components (except U[0, 1]) for the p-value distribution. We use I ={1, 2,...} to denote the set containing the indices representing different nonnull distribution components. Generally, the set I may contain many different components ( I > 1, where I is the number of elements in I ). We propose that the null component and the different nonnull components can be approximated by 2 components: a null component and a nonnull component. Under this approximation, there is an approximated proportion of true null hypothesis π 0, which may be different from π 0 (however, if I =1, then π 0 = π 0). Considering the moments of p values, we have E[P (1) ] = π 0 E[P(1) H 0 ] + (1 π 0 )E[P(1) H a ], E[P (2) ] = π 0 E[P(2) H 0 ] + (1 π 0 )E[P(2) H a ], E[P (1) P (2) ] = π 0 E[P(1) H 0 ]E[P (2) H 0 ] + (1 π 0 )E[P(1) H a ]E[P (2) H a ]. E[P (k) H 0 ], E[P (k) H a ], and E[P (k) ] are the expected values of p value following the null, nonnull, and marginal distributions in data set k, k = 1, 2, respectively. E[P (1) P (2) ] is the expected value of the product of P (1) and P (2) under the marginal oint distribution. Note that E[P (k) H 0 ] = 1/2 because the null distribution is known as U[0, 1]. Furthermore, E[P (1) P (2) ], E[P (1) ], and E[P (2) ] can be estimated from the data (using the corresponding sample moments). Then, there are only 3 unknown parameters: π 0, E[P(1) H a ], and E[P (2) H a ]. With the above 3 equations, we can obtain an explicit formula π 0 = E[P (1) P (2) ] E[P (1) ]E[P (2) ] E[P (1) P (2) ] E[P (2) ]/2 E[P (1) ]/2 + 1/4.

4 A moment-based method for estimating the proportion of true null hypotheses 747 Fig. 1. (a) Scatter plot with marginal histograms for paired p values based on 2 independently simulated data sets (see Section 2.1 for details), in which the grey and black dots represent variables consistent with the null and the alternative hypotheses, respectively, and the dashed lines represent the proportion of true null hypotheses. (b) An artificial example for the data division scheme (see Procedure 1 for details), in which grey and black colors represent the first and the second sub data sets, respectively. (c,d) Estimation results based on the microarray gene expression data sets for (c) the breast cancer and (d) the blood studies. N, Q, B, S, and L represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively. In the p-value histograms, the lines with different characters represent the original estimates from different methods. The boxplots are based on the bootstrap estimates from different methods.

5 748 Y. LAI The mathematical proof is given as Lemma 1 in supplementary material available at Biostatistics online. Therefore, an estimator for π 0 is proposed as { { π 0 = max 0, min 1, m=1 [P (1) P (2) ]/m [ m =1 P (1) /m][ m =1 P (2) /m] m=1 [P (1) P (2) ]/m m =1 P (1) /(2m) m =1 P (2) /(2m) + 1/4 }}, (2.1) where P (k) is the calculated p value of the th variable in data set k, = 1, 2,...,m, k = 1, 2. Boundary constrains are imposed since the proportion π 0 must be within [0, 1]. 2.3 One data set To estimate π 0 for a given data set, which contains m variables and g sample groups, we can first divide the data set into 2 parts and then use the method described above. The following procedure is proposed. PROCEDURE 1 1) For a given variable, randomly divide its observations in each sample group into 2 parts with (approximately) equal sample sizes; 2) With a given test statistic T, calculate the p value for each part; 3) Repeat steps 1 and 2 for all variables and obtain the set of paired p values; 4) Use (2.1) to estimate π 0 ; 5) Repeat steps 1 4 R times and obtain R estimates of π 0 ; 6) Return the median of these R estimates. There may be complicated dependence structures among the different variables in the data set. We perform data division step (step 1) separately for each variable to reduce the impacts from dependence structures (see Figure 1(b) for an illustration). Although the proposed method is moment based and does not require any independence assumptions, it is still necessary to reduce these impacts so that the estimation can be more statistically efficient. Because different random divisions of the data set result in different estimates, we repeat steps 1 4 R times to obtain a resample distribution of estimates. (In this study, we repeat R = 25 times. Based on some simulation studies [data not shown], 25 is an appropriate choice for the balance between estimation accuracy and computation burden.) Then, the median is reported for robustness purpose. 2.4 Confidence interval Theoretically, we can apply Delta method (Casella and Berger, 2002, p. 240) to obtain formulas for the large sample variance and confidence intervals. However, these formulas may be invalid because of complicated dependence structures among the variables in omics data. Therefore, we use the bootstrap method (Efron, 1979) to obtain confidence intervals. For QVALUE, BUM, SPLOSH, and LBE, we can simply repeat sampling p values and estimating π 0 B times to obtain a resample distribution. For the proposed method, a resample distribution of estimates can be similarly obtained by the following procedure. PROCEDURE 2 1) Run the following 3 steps R times to obtain R sets of paired p values: a) For a given variable, randomly divide its observations in each sample group into 2 parts with (approximately) equal sample sizes;

6 A moment-based method for estimating the proportion of true null hypotheses 749 b) With a given test statistic T, calculate the p value for each part; c) Repeat steps a and b for all variables and obtain the set of paired p. 2) Sample m integer numbers {b 1, b 2,...,b m } with replacement from the set {1, 2,...,m} with probability {1/m, 1/m,...,1/m}. 3) Perform the following 2 steps for each set of paired p values: Form a new set by selecting {b 1, b 2,...,b m }th paired p values; use (2.1) to estimate π 0. 4) Record the median of these R estimates of π 0. 5) Return a resample distribution by repeating steps 2 4 B times. 2.5 Approximation error The proposed estimation method is derived based on the approximated π 0. It is necessary to study the approximation error. We can show that π 0 = π i< 0 + π iπ {E[P (1) H i ] E[P (1) H ]}{E[P (2) H i ] E[P (2) H ]} i I π i{e[p (1) H 0 ] E[P (1) H i ]}{E[P (2) H 0 ] E[P (2), (2.2) H i ]} where E[P (k) H i ] is the expected value of p value following the nonnull distribution component i I. The mathematical proof is given as Lemma 2 in supplementary material available at Biostatistics online. The approximation will be close if E[P (k) H i ] E[P (k) H ] for all i, I and any k = 1, 2. An ideal case is that all p values from the alternative hypothesis follow only one distribution ( I =1). In this situation, we have E[P (k) H i ] = E[P (k) H ] for all i, I and any k = 1, 2, and therefore π 0 = π 0. The approximation will also be close if E[P (k) H i ] 0 for all i I and any k = 1, 2. An ideal case is that the number of samples in each group goes to infinity, in which we have E[P (k) H i ] 0 for all i I and any k = 1, 2, and therefore π 0 π 0. To better understand the approximation error when the p values from the alternative hypothesis are heterogeneously distributed, we have the following discussion. If the number of samples in each group in the first data set is the same as the corresponding one in the second data set, then we have E[P (1) H i ] = E[P (2) H i ] for all i I and π 0 π 0 = 1 i, I π iπ {E[P (1) H i ] E[P (1) H ]} 2 2 i I π i{e[p (1) H 0 ] E[P (1) H i ]} 2 0. Since moment estimators are generally asymptotically efficient, π 0 will be asymptotically overestimated. An upper bound can be further derived: π 0 π 0 1 π 0 2 max i, I { E[P (1) H i ] E[P (1) H ] 2 } min i I { E[P (1) H 0 ] E[P (1) H i ] 2 } = factor numerator denominator. Based on this upper bound, the following conclusions can be drawn: The approximation error depends on the factor (the smaller the better). It will be small if π 0 1. The estimation bias will be larger if π 0 is closer to 0 (or if the proportion of differentially expressed genes is larger). The approximation error depends on the numerator (the smaller the better). It will be small if max i, I { E[P (1) H i ] E[P (1) H ] } 0 or, equivalently, E[P (1) H i ] E[P (1) H ] for all i, I. This case has been discussed above.

7 750 Y. LAI The approximation error depends on the denominator (the larger the better). For p values from the alternative hypothesis, we have 0 < E[P (1) H i ] < 1/2. Since E[P (1) H 0 ] = 1/2, 0 < E[P (1) H 0 ] E[P (1) H i ] < 1/2. Therefore, the approximation error will be small if E[P (1) H i ] 0 for all i I. This case has also been discussed above. 3. SIMULATIONS AND APPLICATIONS 3.1 Comparison with other methods A typical application of the proposed method is to estimate the proportion of differentially expressed genes in a given microarray gene expression data set. This proportion is actually 1 π 0. Therefore, it is equivalent to estimate π 0, which is the proportion of nondifferentially expressed genes. Many statistical methods have been proposed to estimate π 0, such as QVALUE (Storey and Tibshirani, 2003), BUM (Pounds and Morris, 2003), SPLOSH (Pounds and Cheng, 2004), and LBE (Dalmasso and others, 2005). In this section, we compare the proposed method with these existing statistical methods through simulations and applications. The simulations are conducted based on a microarray gene expression data set for a breast cancer study. We use the 2-sample Student s t-test for hypothesis testing. For the experimental data set, we observe from Quantile Quantile plots that the p values given by the t-distribution and the permutation procedure are consistent (data not shown). Therefore, we choose to use the t-distribution to assess p values because it gives unique results. Statistical efficiencies can be compared in simulation studies since we know the truth. With a given π 0, we repeat simulation and estimation procedures B=100 times. Note that the proposed method requires much more computation time than these existing methods because of its repetition of random data division (R = 25). Although B = 100 is a relatively small number, it is adequate to compare the performances of different methods. The root mean square error (RMSE), Bias, and standard deviation (SD) are used to compare different methods (estimators) including the proposed one. For an estimator π 0, let π (i) 0 be the calculated estimate in the ith simulation. The Bias, SD, and RMSE are defined as: Bias( π 0 ) = B i=1 π (i) 0 /B π 0,SD( B π0 ) = i=1 [ π 0 (i) B i=1 π (i) 0 /B] 2 /(B 1), and RMSE( π 0 ) = SD 2 + Bias Simulation studies Configurations. In general, there are complicated dependence structures in a microarray gene expression data set. Therefore, we conduct the following simulation studies with covariance matrices constructed based on a microarray gene expression data set (the first data set in Section 3.3). A gene expression data set is simulated with m = 3000 genes and 2 sample groups with sample sizes n 1 = n 2 = 10 (simulation studies 1 and 2) or 50 (simulation study 3). Data are simulated from normal distributions with an assumed proportion 1 π 0 of differentially expressed genes. Genes are grouped into 30 blocks with 100 genes in each block. For each block, we randomly select 100 genes from the experimental data set and calculate the correlation matrices 1 and 2 in the first and the second groups, respectively. For blocks of differentially expressed genes, we simulate data from the normal distributions N(0, 1 ) and N(µ, 2 ) for the first and the second sample groups, respectively. For the remaining blocks, we simulate data from the normal distributions N(0, 1 ) and N(0, 2 ) for the first and the second sample groups, respectively. Here, 0 and µ are (random) vectors. For each configuration, we repeat simulation and estimation procedures B = 100 times. Different statistical methods are used to estimate π 0. We run QVALUE, BUM, SPLOSH, and LBE with their default settings. For the proposed method, we divide each sample group into 2 parts with equal sample sizes: (5, 5)/(5, 5) for simulation studies 1 and 2, (25, 25)/(25, 25) for simulation study 3.

8 A moment-based method for estimating the proportion of true null hypotheses 751 Fig. 2. Estimation results based on the simulation studies 1 (left panel), 2 (middle panel), and 3 (right panel). The RM- SEs (a c), Biases (d f), and SDs (g i) (y-axes) against the true proportions (x-axes) are plotted. The solid lines with black dots, solid, dashed, dotted, and dot-dashed lines represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively. In the boxplots ( l) of estimated proportions, N, Q, B, S, and L represent the proposed method, QVALUE, BUM, SPLOSH, and LBE, respectively, and the dashed lines represent the true value. The results are summarized in Figure 2 in which RMSE, Bias, and SD are compared. We also compare boxplots of the estimation results from different methods when π 0 = 0.6. Results. The first simulation study is to consider the situation that there is only one p-value distribution component for differentially expressed genes. We fix µ = 1.5 and let π 0 = 0.1, 0.2,...,0.9. Generally, the sample size of a microarray data set is relatively small. Therefore, we set n 1 = n 2 = 10. As shown

9 752 Y. LAI in Figure 2, for π 0 around 0.2, only BUM gives smaller RMSEs than the proposed method. For other values of π 0, the proposed method gives the lowest RMSEs. Note that the behavior of BUM is not stable. It gives the highest RMSEs when π or π For different values of π 0, the proposed method consistently gives relatively low biases and the second lowest SDs. The second simulation study is to consider a general situation that p values of differentially expressed genes may follow different distribution components. We randomly sample µ from a uniform distribution U[1, 2] and let π 0 = 0.1, 0.2,...,0.9 and n 1 = n 2 = 10. As shown in Figure 2, for π 0 > 0.3, the proposed method gives the lowest RMSEs. For π 0 around 0.2, only BUM gives lower RMSEs than the proposed method. Note again that the behavior of BUM is not stable. It gives the highest RMSEs when π 0 > 0.3. For π 0 around 0.1, QVALUE gives the lowest RMSEs, and the proposed method gives a slightly higher RMSEs. For different values of π 0, the proposed method consistently gives relatively low biases and the second lowest SDs. The third simulation study is to consider the situation that the sample size of a microarray data set is relatively large. Therefore, we set n 1 = n 2 = 50. We still consider a general situation that p values of differentially expressed genes may follow different distribution components. We randomly sample µ from a uniform distribution U[1, 2] and let π 0 = 0.1, 0.2,...,0.9. As shown in Figure 2, the proposed method always gives the lowest RMSEs and biases and the second lowest SDs for different values of π 0. Other simulations. Simulations for other configurations are also considered. Generally, the proposed method can give comparably favorable performances. However, if the sample size is very small (e.g. <8), the proposed method will give poor performances. This is not surprising. If the sample size of a given data set is very small, then the sample size of a divided subset will be even smaller, which significantly reduces the power to detect differential expressions. This fact has also been discussed by Pawitan and others (2005a,b). Therefore, while enoying the model identifiability through data division, we lose certain statistical efficiency in estimations. 3.3 Applications The above theoretical and simulation studies show the favorable performances of the proposed method especially when (i) the sample size is relatively large, (ii) the p values from the alternative hypothesis are homogeneously distributed, or (iii) the proportion of differentially expressed genes is relatively small. In practice, it is difficult to find a microarray data set for the second or the third situation. However, there are many microarray data sets with relatively large sample sizes. We consider 2 data sets for applications. The first one is the famous microarray gene expression data set for a breast cancer study. Hedenfalk and others (2001) used microarrays to compare 3226 gene expression profiles between 7 BRCA1 samples and 8 BRCA2 samples. The data set is publicly available at Supplement. A total of 56 genes were filtered out, because they had one or more expression measurements exceeding 20, which were considered not trustworthy (Storey and Tibshirani, 2003). Therefore, 3170 gene expression measurements for 15 samples are used in this study. The second data set has a relatively large sample size. Wiestner and others (2003) used lymphochips to compare gene expression profiles between 79 Ig-mutated and 28 Ig-unmutated samples with chronic lymphocytic leukemia. The data set is publicly available at We use the k-nearest neighbors method (R package impute; Troyanskaya and others, 2001) to impute the missing values in the data set. We use different statistical methods to estimate π 0. QVALUE, BUM, SPLOSH, and LBE are run with their default settings. For the proposed method, we divide the data set into 2 subsets: (3, 4)/(4, 4) for the first data set and (39, 14)/(40, 14) for the second data set. We bootstrap B = 1000 times to obtain the

10 A moment-based method for estimating the proportion of true null hypotheses 753 resample distributions of estimates (see Section 2 for details). Since the p values from the null hypothesis follow a uniform distribution U[0, 1], π 0 is expected to be under the curve of underlying empirical p-value distribution. ( f (p) = π 0 f 0 (p) + (1 π 0 ) f 1 (p) = π 0 + (1 π 0 ) f 1 (p) π 0, where f 0, f 1, and f are the null, nonnull, and marginal distributions of p value, respectively.) For the first data set, Figure 1(c) shows a histogram of p values and boxplots to compare estimates from different methods. Only the proposed method and BUM give estimates under the histogram. The proposed method gives the smallest estimated π 0. Among these 5 methods, BUM gives a relatively small variance and the other 4 give comparatively high variances. However, from the simulation studies (e.g. boxplots in Figure 2), some confidence intervals given by BUM do not contain the true value and are not meaningful. Therefore, the proposed method may give more reliable estimation results. For the second data set, Figure 1(d) shows a histogram of p values and boxplots to compare estimates from different methods. Not only the proposed method gives the smallest estimates but also its whole boxplot is under the histogram. Furthermore, its variance is relatively small among these 5 methods. In the above simulation studies and applications, the variances of BUM are always the lowest among these 5 estimation methods. This comes from the simple model of BUM: the mixture of a beta distribution and a uniform distribution. However, it is difficult to validate this model in practice. 4. DISCUSSION In the problem of estimating the proportion of true null hypotheses, the number of variables is the sample size of study. Microarrays and other high-throughput technologies enable us to collect measurements for a large number of variables. With these data, moment-based estimation methods can be considered, because they are generally asymptotically efficient. In this study, we proposed a momentbased estimation method coupled with sample splitting and discussed its theoretical properties. The simulation studies and the applications to microarray data showed the favorable performances of the proposed method when it was compared with the other existing methods. Since the t-test requires at least 2 samples in each group, the proposed method cannot be applied when a group sample size is less than 4. In such a situation, other statistical methods, such as QVALUE, should be considered. From the above analyses, we observe that there are certain situations for a particular method to achieve its optimal performance. New methods for estimating π 0 are being proposed (Langaas and others, 2005). It is necessary to conduct more comprehensive reviews and systematical comparisons for different π 0 -estimation methods. We recently proposed a likelihood-based method coupled with an EM algorithm for estimating π 0 (Lai, 2006). Random data division was also used to achieve the model identifiability. Through simulations and applications to microarray gene expression data, we showed the favorable performances of this method (Lai, 2006). However, there are 2 disadvantages: (i) The method is likelihood based and assumes independence among different genes, which is unlikely to be true because genes interact with each other during cellular processes. (ii) The method uses an EM algorithm, which may provide unreliable estimation when the likelihood function is not regular. The moment-based method proposed in this study requires no independence assumption. In addition to its favorable performances, it is completely nonparametric and simple with an explicit formula to give a unique solution. A future research topic is to generalize the proposed method so that estimation efficiencies can be further improved. As shown in the simulation studies, the estimation variance tends to increase when the true proportion increases (Figures 2). In the second simulation study for heterogeneous alternative, there is a considerable estimation bias when the true proportion is relatively small (Figure 2). It is necessary to pursue both theoretical and simulation studies so that more efficient estimation methods can be developed.

11 754 Y. LAI ACKNOWLEDGMENTS I am grateful to Prof. Tapan Nayak, the editors, associate editors, and the anonymous reviewers for their helpful comments and suggestions. This work was partially supported by a start-up fund from the George Washington University and the National Institutes of Health grant DK The R codes are available at ylai/research/rdpm. Conflict of Interest: None declared. REFERENCES ALLISON, D. B., GADBURY, G. L., HEO, M., FERNANDEZ, J. R., LEE, C.-K., PROLLA, T. A. AND WEINDRUCH, R. (2002). A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis 39, BENJAMINI, Y.AND HOCHBERG, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, CASELLA, G.AND BERGER, R. L. (2002). Statistical Inference, 2nd edition. Pacific Grove, CA: Duxbury. DALMASSO, C., BROËT, P. AND MOREAU, T. (2005). A simple procedure for estimating the false discovery rate. Bioinformatics 21, EFRON, B. (1979). Bootstrap methods: another look at the ackknife. Annals of Statistics 7, HEDENFALK, I., DUGGAN, D., CHEN, Y., RADMACHER, M., BITTNER, M., SIMON, R., MELTZER, P., GUSTERSON, B., ESTELLER, M., KALLIONIEMI, O.P. and others (2001). Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 344, JUNG, S.-H. (2005). Sample size for FDR-control in microarray data analysis. Bioinformatics 21, LAI, Y. (2006). A statistical method for estimating the proportion of differentially expressed genes. Computational Biology and Chemistry 30, LANGAAS, M.,LINDQVIST, B.H.AND FERKINGSTAD, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. Journal of the Royal Statistical Society, Series B 67, PAWITAN, Y., MICHIELS, S., KOSCIELNY, S., GUSNANTO, A. AND PLONER, A. (2005a). False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics 21, PAWITAN, Y.,MURTHY, K.R.K.,MICHIELS, S.AND PLONER, A. (2005b). Bias in the estimation of false discovery rate in microarray studies. Bioinformatics 20, POUNDS, S.AND CHENG, C. (2004). Improving false discovery rate estimation. Bioinformatics 20, POUNDS, S. AND MORRIS, S. W. (2003). Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values. Bioinformatics 19, STOREY, J. D. AND TIBSHIRANI, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America 100, TADESSE, M. G., IBRAHIM, J. G., VANNUCCI, M. AND GENTLEMAN, R. (2005). Wavelet thresholding with Bayesian false discovery rate control. Biometrics 61, TROYANSKAYA,O.,CANTOR,M.,SHERLOCK,G.,BROWN,P.,HASTIE,T.,TIBSHIRANI, R., BOTSTEIN,D.AND ALTMAN, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17, WANG, S.-J. AND CHEN, J. J. (2004). Sample size for identifying differentially expressed genes in microarray experiments. Journal of Computational Biology 11, WIESTNER, A.,ROSENWALD, A., BARRY, T.S.,WRIGHT, G.,DAVIS, R.E.,HENRICKSON, S.E.,ZHAO, H., IBBOTSON, R. E., ORCHARD, J. A., DAVIS, Z. and others (2003). ZAP-70 expression identifies a chronic

12 A moment-based method for estimating the proportion of true null hypotheses 755 lymphocytic leukemia subtype with unmutated immunoglobulin genes, inferior clinical outcome, and distinct gene expression profile. Blood 101, WU, B., ABBOTT, T.,FISHMAN, D.,MCMURRAY, W.,MOR, G.,STONE, K.,WARD, D.,WILLIAMS, K.AND ZHAO, H. (2003). Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 19, [Received August 4, 2006; revised January 5, 2007; accepted for publication January 17, 2007]

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

MIXTURE MODELS FOR DETECTING DIFFERENTIALLY EXPRESSED GENES IN MICROARRAYS

MIXTURE MODELS FOR DETECTING DIFFERENTIALLY EXPRESSED GENES IN MICROARRAYS International Journal of Neural Systems, Vol. 16, No. 5 (2006) 353 362 c World Scientific Publishing Company MIXTURE MOLS FOR TECTING DIFFERENTIALLY EXPRESSED GENES IN MICROARRAYS LIAT BEN-TOVIM JONES

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Research Article Sample Size Calculation for Controlling False Discovery Proportion

Research Article Sample Size Calculation for Controlling False Discovery Proportion Probability and Statistics Volume 2012, Article ID 817948, 13 pages doi:10.1155/2012/817948 Research Article Sample Size Calculation for Controlling False Discovery Proportion Shulian Shang, 1 Qianhe Zhou,

More information

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone:

More information

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone:

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE Statistica Sinica 22 (2012), 1689-1716 doi:http://dx.doi.org/10.5705/ss.2010.255 ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE Irina Ostrovnaya and Dan L. Nicolae Memorial Sloan-Kettering

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Holger Schwender, Andreas Krause, and Katja Ickstadt Abstract Microarrays enable to measure the expression levels of tens

More information

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis Statistics Preprints Statistics 11-2006 Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis Peng Liu Iowa State University, pliu@iastate.edu

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Procedures controlling generalized false discovery rate

Procedures controlling generalized false discovery rate rocedures controlling generalized false discovery rate By SANAT K. SARKAR Department of Statistics, Temple University, hiladelphia, A 922, U.S.A. sanat@temple.edu AND WENGE GUO Department of Environmental

More information

STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE. National Institute of Environmental Health Sciences and Temple University

STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE. National Institute of Environmental Health Sciences and Temple University STEPDOWN PROCEDURES CONTROLLING A GENERALIZED FALSE DISCOVERY RATE Wenge Guo 1 and Sanat K. Sarkar 2 National Institute of Environmental Health Sciences and Temple University Abstract: Often in practice

More information

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations and Theses in Statistics Statistics, Department of 2009 DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING

More information

Multidimensional local false discovery rate for microarray studies

Multidimensional local false discovery rate for microarray studies Bioinformatics Advance Access published December 20, 2005 The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

More information

Bias and variance reduction in estimating the proportion of true-null hypotheses

Bias and variance reduction in estimating the proportion of true-null hypotheses Biostatistics (0),,,pp. 89 0 doi:0.09/biostatistics/kxu09 Advance Access publication on June, 0 Bias and variance reduction in estimating the proportion of true-null hypotheses YEBIN CHENG School of Statistics

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Estimation of the False Discovery Rate

Estimation of the False Discovery Rate Estimation of the False Discovery Rate Coffee Talk, Bioinformatics Research Center, Sept, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Interval estimation in a finite mixture model: Modeling P -values in multiple testing applications

Interval estimation in a finite mixture model: Modeling P -values in multiple testing applications Computational Statistics & Data Analysis 51 (2006) 570 586 www.elsevier.com/locate/csda Interval estimation in a finite mixture model: Modeling P -values in multiple testing applications Qinfang Xiang

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Chapter 1. Stepdown Procedures Controlling A Generalized False Discovery Rate

Chapter 1. Stepdown Procedures Controlling A Generalized False Discovery Rate Chapter Stepdown Procedures Controlling A Generalized False Discovery Rate Wenge Guo and Sanat K. Sarkar Biostatistics Branch, National Institute of Environmental Health Sciences, Research Triangle Park,

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

The optimal discovery procedure: a new approach to simultaneous significance testing

The optimal discovery procedure: a new approach to simultaneous significance testing J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received

More information

High-throughput Testing

High-throughput Testing High-throughput Testing Noah Simon and Richard Simon July 2016 1 / 29 Testing vs Prediction On each of n patients measure y i - single binary outcome (eg. progression after a year, PCR) x i - p-vector

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

A Unified Computational Framework to Compare Direct and Sequential False Discovery Rate Algorithms for Exploratory DNA Microarray Studies

A Unified Computational Framework to Compare Direct and Sequential False Discovery Rate Algorithms for Exploratory DNA Microarray Studies Journal of Data Science 3(2005), 331-352 A Unified Computational Framework to Compare Direct and Sequential False Discovery Rate Algorithms for Exploratory DNA Microarray Studies Danh V. Nguyen University

More information

Defect Detection using Nonparametric Regression

Defect Detection using Nonparametric Regression Defect Detection using Nonparametric Regression Siana Halim Industrial Engineering Department-Petra Christian University Siwalankerto 121-131 Surabaya- Indonesia halim@petra.ac.id Abstract: To compare

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University

Correlation, z-values, and the Accuracy of Large-Scale Estimators. Bradley Efron Stanford University Correlation, z-values, and the Accuracy of Large-Scale Estimators Bradley Efron Stanford University Correlation and Accuracy Modern Scientific Studies N cases (genes, SNPs, pixels,... ) each with its own

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

3 Joint Distributions 71

3 Joint Distributions 71 2.2.3 The Normal Distribution 54 2.2.4 The Beta Density 58 2.3 Functions of a Random Variable 58 2.4 Concluding Remarks 64 2.5 Problems 64 3 Joint Distributions 71 3.1 Introduction 71 3.2 Discrete Random

More information

Step-down FDR Procedures for Large Numbers of Hypotheses

Step-down FDR Procedures for Large Numbers of Hypotheses Step-down FDR Procedures for Large Numbers of Hypotheses Paul N. Somerville University of Central Florida Abstract. Somerville (2004b) developed FDR step-down procedures which were particularly appropriate

More information

A Large-Sample Approach to Controlling the False Discovery Rate

A Large-Sample Approach to Controlling the False Discovery Rate A Large-Sample Approach to Controlling the False Discovery Rate Christopher R. Genovese Department of Statistics Carnegie Mellon University Larry Wasserman Department of Statistics Carnegie Mellon University

More information

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators

A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Statistics Preprints Statistics -00 A Simulation Study on Confidence Interval Procedures of Some Mean Cumulative Function Estimators Jianying Zuo Iowa State University, jiyizu@iastate.edu William Q. Meeker

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Large-Scale Hypothesis Testing

Large-Scale Hypothesis Testing Chapter 2 Large-Scale Hypothesis Testing Progress in statistics is usually at the mercy of our scientific colleagues, whose data is the nature from which we work. Agricultural experimentation in the early

More information

SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE

SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE SIGNAL RANKING-BASED COMPARISON OF AUTOMATIC DETECTION METHODS IN PHARMACOVIGILANCE A HYPOTHESIS TEST APPROACH Ismaïl Ahmed 1,2, Françoise Haramburu 3,4, Annie Fourrier-Réglat 3,4,5, Frantz Thiessard 4,5,6,

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods Hypothesis Testing with the Bootstrap Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods Bootstrap Hypothesis Testing A bootstrap hypothesis test starts with a test statistic

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Doing Cosmology with Balls and Envelopes

Doing Cosmology with Balls and Envelopes Doing Cosmology with Balls and Envelopes Christopher R. Genovese Department of Statistics Carnegie Mellon University http://www.stat.cmu.edu/ ~ genovese/ Larry Wasserman Department of Statistics Carnegie

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

UNIVERSITÄT POTSDAM Institut für Mathematik

UNIVERSITÄT POTSDAM Institut für Mathematik UNIVERSITÄT POTSDAM Institut für Mathematik Testing the Acceleration Function in Life Time Models Hannelore Liero Matthias Liero Mathematische Statistik und Wahrscheinlichkeitstheorie Universität Potsdam

More information

Semi-Nonparametric Inferences for Massive Data

Semi-Nonparametric Inferences for Massive Data Semi-Nonparametric Inferences for Massive Data Guang Cheng 1 Department of Statistics Purdue University Statistics Seminar at NCSU October, 2015 1 Acknowledge NSF, Simons Foundation and ONR. A Joint Work

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

Duke University. Duke Biostatistics and Bioinformatics (B&B) Working Paper Series. Randomized Phase II Clinical Trials using Fisher s Exact Test

Duke University. Duke Biostatistics and Bioinformatics (B&B) Working Paper Series. Randomized Phase II Clinical Trials using Fisher s Exact Test Duke University Duke Biostatistics and Bioinformatics (B&B) Working Paper Series Year 2010 Paper 7 Randomized Phase II Clinical Trials using Fisher s Exact Test Sin-Ho Jung sinho.jung@duke.edu This working

More information

FDR and ROC: Similarities, Assumptions, and Decisions

FDR and ROC: Similarities, Assumptions, and Decisions EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol 21 no 11 2005, pages 2684 2690 doi:101093/bioinformatics/bti407 Gene expression A practical false discovery rate approach to identifying patterns of differential expression

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions

Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions Approximate and Fiducial Confidence Intervals for the Difference Between Two Binomial Proportions K. Krishnamoorthy 1 and Dan Zhang University of Louisiana at Lafayette, Lafayette, LA 70504, USA SUMMARY

More information

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Political Science 236 Hypothesis Testing: Review and Bootstrapping Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The

More information

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 28. Ingo Ruczinski. December 3, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University Lecture 28 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University December 3, 2015 1 2 3 4 5 1 Familywise error rates 2 procedure 3 Performance of with multiple

More information

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis Biostatistics (2010), 11, 4, pp. 599 608 doi:10.1093/biostatistics/kxq023 Advance Access publication on May 26, 2010 Simultaneous variable selection and class fusion for high-dimensional linear discriminant

More information

Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004

Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004 Exceedance Control of the False Discovery Proportion Christopher Genovese 1 and Larry Wasserman 2 Carnegie Mellon University July 10, 2004 Multiple testing methods to control the False Discovery Rate (FDR),

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

Resampling-Based Control of the FDR

Resampling-Based Control of the FDR Resampling-Based Control of the FDR Joseph P. Romano 1 Azeem S. Shaikh 2 and Michael Wolf 3 1 Departments of Economics and Statistics Stanford University 2 Department of Economics University of Chicago

More information

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES Sanat K. Sarkar a a Department of Statistics, Temple University, Speakman Hall (006-00), Philadelphia, PA 19122, USA Abstract The concept

More information

STAT440/840: Statistical Computing

STAT440/840: Statistical Computing First Prev Next Last STAT440/840: Statistical Computing Paul Marriott pmarriott@math.uwaterloo.ca MC 6096 February 2, 2005 Page 1 of 41 First Prev Next Last Page 2 of 41 Chapter 3: Data resampling: the

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Robust methods and model selection. Garth Tarr September 2015

Robust methods and model selection. Garth Tarr September 2015 Robust methods and model selection Garth Tarr September 2015 Outline 1. The past: robust statistics 2. The present: model selection 3. The future: protein data, meat science, joint modelling, data visualisation

More information

A Signed-Rank Test Based on the Score Function

A Signed-Rank Test Based on the Score Function Applied Mathematical Sciences, Vol. 10, 2016, no. 51, 2517-2527 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2016.66189 A Signed-Rank Test Based on the Score Function Hyo-Il Park Department

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Rank conditional coverage and confidence intervals in high dimensional problems

Rank conditional coverage and confidence intervals in high dimensional problems conditional coverage and confidence intervals in high dimensional problems arxiv:1702.06986v1 [stat.me] 22 Feb 2017 Jean Morrison and Noah Simon Department of Biostatistics, University of Washington, Seattle,

More information

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture 27. December 13, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model Minimum Hellinger Distance Estimation in a Semiparametric Mixture Model Sijia Xiang 1, Weixin Yao 1, and Jingjing Wu 2 1 Department of Statistics, Kansas State University, Manhattan, Kansas, USA 66506-0802.

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

11. Bootstrap Methods

11. Bootstrap Methods 11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods

More information

A new approach to intensity-dependent normalization of two-channel microarrays

A new approach to intensity-dependent normalization of two-channel microarrays Biostatistics (2007), 8, 1, pp. 128 139 doi:10.1093/biostatistics/kxj038 Advance Access publication on April 24, 2006 A new approach to intensity-dependent normalization of two-channel microarrays ALAN

More information

Estimation of Quantiles

Estimation of Quantiles 9 Estimation of Quantiles The notion of quantiles was introduced in Section 3.2: recall that a quantile x α for an r.v. X is a constant such that P(X x α )=1 α. (9.1) In this chapter we examine quantiles

More information

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location Bootstrap tests Patrick Breheny October 11 Patrick Breheny STA 621: Nonparametric Statistics 1/14 Introduction Conditioning on the observed data to obtain permutation tests is certainly an important idea

More information

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS

EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS Statistica Sinica 19 (2009), 125-143 EMPIRICAL BAYES METHODS FOR ESTIMATION AND CONFIDENCE INTERVALS IN HIGH-DIMENSIONAL PROBLEMS Debashis Ghosh Penn State University Abstract: There is much recent interest

More information

POSITIVE FALSE DISCOVERY PROPORTIONS: INTRINSIC BOUNDS AND ADAPTIVE CONTROL

POSITIVE FALSE DISCOVERY PROPORTIONS: INTRINSIC BOUNDS AND ADAPTIVE CONTROL Statistica Sinica 18(2008, 837-860 POSITIVE FALSE DISCOVERY PROPORTIONS: INTRINSIC BOUNDS AND ADAPTIVE CONTROL Zhiyi Chi and Zhiqiang Tan University of Connecticut and Rutgers University Abstract: A useful

More information

Matching Methods for Observational Microarray Studies

Matching Methods for Observational Microarray Studies Bioinformatics Advance Access published December 19, 2008 Matching Methods for Observational Microarray Studies Ruth Heller 1,, Elisabetta Manduchi 2 and Dylan Small 1 1 Department of Statistics, Wharton

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling

Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling Test (2008) 17: 461 471 DOI 10.1007/s11749-008-0134-6 DISCUSSION Rejoinder on: Control of the false discovery rate under dependence using the bootstrap and subsampling Joseph P. Romano Azeem M. Shaikh

More information

Comments on: Control of the false discovery rate under dependence using the bootstrap and subsampling

Comments on: Control of the false discovery rate under dependence using the bootstrap and subsampling Test (2008) 17: 443 445 DOI 10.1007/s11749-008-0127-5 DISCUSSION Comments on: Control of the false discovery rate under dependence using the bootstrap and subsampling José A. Ferreira Mark A. van de Wiel

More information

STATISTICS SYLLABUS UNIT I

STATISTICS SYLLABUS UNIT I STATISTICS SYLLABUS UNIT I (Probability Theory) Definition Classical and axiomatic approaches.laws of total and compound probability, conditional probability, Bayes Theorem. Random variable and its distribution

More information

False discovery rates: a new deal

False discovery rates: a new deal Biostatistics (2017) 18, 2,pp. 275 294 doi:10.1093/biostatistics/kxw041 Advance Access publication on October 17, 2016 False discovery rates: a new deal MATTHEW STEPHENS Department of Statistics and Department

More information

discovery rate control

discovery rate control Optimal design for high-throughput screening via false discovery rate control arxiv:1707.03462v1 [stat.ap] 11 Jul 2017 Tao Feng 1, Pallavi Basu 2, Wenguang Sun 3, Hsun Teresa Ku 4, Wendy J. Mack 1 Abstract

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Modification and Improvement of Empirical Likelihood for Missing Response Problem

Modification and Improvement of Empirical Likelihood for Missing Response Problem UW Biostatistics Working Paper Series 12-30-2010 Modification and Improvement of Empirical Likelihood for Missing Response Problem Kwun Chuen Gary Chan University of Washington - Seattle Campus, kcgchan@u.washington.edu

More information

KANSAS STATE UNIVERSITY Manhattan, Kansas

KANSAS STATE UNIVERSITY Manhattan, Kansas SEMIPARAMETRIC MIXTURE MODELS by SIJIA XIANG M.S., Kansas State University, 2012 AN ABSTRACT OF A DISSERTATION submitted in partial fulfillment of the requirements for the degree DOCTOR OF PHILOSOPHY Department

More information

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University Tweedie s Formula and Selection Bias Bradley Efron Stanford University Selection Bias Observe z i N(µ i, 1) for i = 1, 2,..., N Select the m biggest ones: z (1) > z (2) > z (3) > > z (m) Question: µ values?

More information

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks

Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2009 Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks T. Tony Cai University of Pennsylvania

More information

Inverse Sampling for McNemar s Test

Inverse Sampling for McNemar s Test International Journal of Statistics and Probability; Vol. 6, No. 1; January 27 ISSN 1927-7032 E-ISSN 1927-7040 Published by Canadian Center of Science and Education Inverse Sampling for McNemar s Test

More information

Lecture on Null Hypothesis Testing & Temporal Correlation

Lecture on Null Hypothesis Testing & Temporal Correlation Lecture on Null Hypothesis Testing & Temporal Correlation CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Acknowledgement Resources used in the slides

More information