Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates

Size: px
Start display at page:

Download "Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates"

Transcription

1 Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates September 4, 2003 Xiangqin Cui, J. T. Gene Hwang, Jing Qiu, Natalie J. Blades, and Gary A. Churchill The Jackson Laboratory, Bar Harbor, Maine U.S.A. Department of Statistical science, Cornell University, Ithaca, NY U.S.A. Department of Mathematics, Cornell University, Ithaca, NY U.S.A. corresponding author: Gary A. Churchill The Jackson Laboratory 600 Main Street Bar Harbor, Maine U.S.A (voice) (fax) 1

2 Abstract Combining information across genes in the statistical analysis of microarray data is desirable because of the relatively small number of data points obtained for each individual gene. Here we develop an estimator of the error variance that can borrow information across genes using the James-Stein-Lindley shrinkage concept. A new test statistic (F S ) is constructed using this estimator. The new statistic is compared with other statistics used to test for differential expression, namely the gene-specific F test (F 1 ), the pooled-variance F statistic (F 3 ), and a hybrid statistic (F 2 ) that uses the average of the individual and pooled variances. The F S test shows best or nearly best power for detecting differentially expressed genes over a wide range of simulated data in which the variance components associated with individual genes are either homogeneous or heterogeneous. Thus F S provides a powerful and robust approach to test differential expression of genes that utilizes information not available in individual gene testing approaches and does not suffer from biases of the pooled variance approach. Keywords: shrinkage estimator, ANOVA model, linear mixed model, F statistic, permutation, variance components 1 Introduction Microarray technology has become an important tool for simultaneously screening thousands of genes for changes in their patterns of expression. The large amount of data generated by microarray technology is due mainly to the large number of genes represented on the array. For each gene the number of RNA samples assayed is typically small. Therefore, the commonly used approach of testing for differential expression one gene at a time often has low power [1]. Assuming that all of the variances are equal and using the common variance estimator for testing all genes can substantially increase the power to detect differential expression [2] but at the risk of generating false detections when the common variance assumption is not true. Cui and Churchill (2003) reviewed some methods for testing differential expression of genes in microarray experiments. In addition, they defined three test statistics based 2

3 on an analysis of variance (ANOVA) model. (Note that an ANOVA F test compares an estimate of variation across conditions to an estimate of error variance. The t-test is a special case when the number of conditions is two.) One test (F 1 ) uses only data from individual genes and is in fact the classical F statistic when testing is carried out one gene at a time. Another (F 3 ) assumes a common error variance across genes and uses a pooled estimator of the common variance. The third (F 2 ) reaches a compromise by using an average of gene-specific and pooled variance estimates. When applied to real and simulated data, the F 2 test seems to work well; however, we find it hard to justify taking the simple average of variance estimates. The idea of modifying estimators of variance has been presented by others in similar contexts. The SAM t-test [4] adds a small constant to the gene-specific variance estimate to stabilize the small variances. The regularized t-test proposed by Baldi and Long [5] replaces the usual variance estimate with a Bayesian estimator based on a hierarchical prior distribution. Lönnstedt and Speed (2002) proposed an Empirical Bayes approach to testing that combines information across genes. Newton and Kendziorski [7, 8] considered a hierarchical gamma-gamma model to combine information across genes. Each of the Bayesian approaches uses hierarchical models with relatively strong prior assumptions about the distributions of the individual variances. In this paper we propose a shrinkage variance estimator that makes no prior assumptions about the distribution of variances across genes. It is based on the James- Stein-Lindley estimator [9] and uses it to construct a test statistic called F S. We show that the test based on F S has the highest or nearly the highest power among various F - like statistics and that it is robust, performing well under a wide range of assumptions about variance heterogeneity. It behaves well when the variances are truly constant as well as when they vary extensively from gene to gene. In section 2, we describe how to obtain the shrinkage estimator of variance components that provides gene-specific variances but also uses information across all of the genes in the data to improve estimation. In section 3, we show how to use shrinkage estimators of variances to construct F -like statistics for differential expression of genes in the context of the mixed model analysis of variance. In section 4, we validate the properties of the tests based on these statistics using simulations and real data. We 3

4 simulate a canonical case to consider the problem in its most general and abstracted form. We then look at simulations of a simple microarray experiment comparing 5 samples and a more complex microarray experiment with biological replicates (data available at 2 Shrinking Variance Estimators In this section, we show how to construct estimators of variance from an ensemble of data that shrink individual variance estimators toward the common (corrected geometric) mean. The amount of shrinkage depends on the variability of individual variance estimators. When individual variance estimates are similar, indicating homogeneity, the shrinkage estimator effectively pools these estimates. When individual variance esetimates are widely dispersed, indicating heterogeneity, the shrinkage estimator gives greater weight to the gene specific contributions. The key result of this section is the expression in equation (3) below. Let X g be the residual sum of squared errors (denoted by SSE) and σg 2 be the true variance of gene g. For g = 1,..., G (number of genes), it is assumed that X g /σg 2 are independent, each having a Chi-squared distribution with ν degrees of freedom. Such random variable will be denoted as χ 2 ν. Therefore, we have X g = σgχ 2 2 ν. We take a natural logarithmic transformation on X g to obtain a common location problem as shown below. We then have ln X g ν = ln σ2 g + ln χ2 ν ν. (1) Hence, if we denote the mean of ln χ2 ν ν could write equation (1) as as m, by substracting m from both sides, we where X g = ln Xg ν m and ɛ g = ln χ2 ν ν X g = ln σ 2 g + ɛ g m. Let V be the variance of ɛ g. By using the first order Taylor expansion at equation (1), Var(ln χ2 ν ν ) Var( χ2 ν ν ) = 2 ν. In Table 1, we give the ratio of V to 2/ν, which eventually converges to one. When applied to 4

5 X g (1 g G) in estimating ln σ 2 g (1 g G), the traditional James-Stein-Lindley estimator that shrinks toward the common mean X = X g/g is ( ) X (G 3)V + 1 (X g X ) 2 (X g X ) (2) where for any number a, a + denotes max(a, 0). The truncation enacted by the + is necessary to avoid overshrinking. Transformation back to the original scale gives the shrinkage estimator for σg, 2 [ G ( ) ] σ g 2 = (X g /ν) 1/G (G 3)V B exp 1 (ln X (ln Xg ln X g ) 2 g ln X g ), (3) g=1 where ln X g = 1 G ln(xg ), and B = exp( m) is a bias correction. Note that multiplying the geometric mean ( G g=1 (X g/ν)) 1/G by B gives an unbiased estimator of σ 2 when σ 2 g = σ 2 for all g. The values of B (and also V ) depend on ν. They can be simulated easily and values are given in Table 1. Note that B is always larger than one, hence, the geometric mean without B underestimates σ 2 when all σ 2 g are equal to σ 2. Some Taylor expansion applied to the inverse log transformed estimator in equation (3) demonstrates that it is similar to Ghosh et al. s estimator [10] (derivation not shown). If the collection of all X g (g = 1,..., G) is represented by X, it has been shown that Ghosh et al. s estimator dominates X/(ν + 2), which is better than X/ν from the collection of individual variance estimators, according to the sum of squared invariant losses [10]. This provides a theoretical foundation that the estimator in equation (3) may work well as an estimator of variance. Extensive comparisons among several variations on this estimator show that the version presented here behaves best not only as an estimator of variance but also in construction of test statistics as described in section 3. In particular, the estimators in (3) provide a test statistic with better performance than similar statistics based on the Ghosh et al. (1984) estimator Constructing F -like Statistics To illustrate how to construct F -like statistics using different variance estimators, we start with the general F statistic for a general linear mixed model and then introduce 5

6 the statistics based on shrinkage estimators. A general linear mixed model [11] can be written as Y = Xβ + Zu + ɛ (4) where Y is the vector of observations, X is the design matrix of fixed effects β, Z is the design matrix of random effects u, and ɛ is the vector of the residuals. The variances of the random effects u and residuals ɛ in equation (4) can be estimated using restricted maximum likelihood method (REML)[11]. Estimation of the corresponding fixed effects ( ˆβ) and the prediction of the random effects (û) can be obtained through generalized least squares using the estimated variance components [11, 12]. The variance covariance matrix of ˆβ and û can be computed as Ĉ = X ˆR 1 X Z ˆR 1 X X ˆR 1 Z Z ˆR 1 Z + Ĝ 1, (5) where ˆR is a matrix with the estimates of residual variances on the diagonal and 0 elsewhere, and Ĝ is a matrix with the variance components estimated for random effects u on the diagonal and 0 elsewhere. Linear combinations of the fixed effects (denoted by L) in equation (4) can then be tested using an F statistic [13] constructed as F = ˆβ L (L ĈL) 1 L ˆβ. (6) rank(l) When a linear mixed model is fit to microarray data one gene at a time, the design matrices of X and Z are the same for all genes. Therefore, the general linear mixed model for gene g can be expressed as Y = Xβ g + Zu g + ɛ g (7) The statistic defined in equation (6) can then be used to test the fixed effect β g directly for each gene, which is what is called a gene-specific F test (F 1 ) [14]. Because the variance components in this test are estimated based on the information from only one gene, the power of the test is likely to be low in experiments with only a few RNA samples. Other F -like statistics (F 2 and F 3 ) defined by Cui and Churchill [3] can 6

7 borrow information from other genes for estimating the variance components. F 3 uses the pooled variance estimator ˆσ pool 2 for each variance component. For balanced designs, ˆσ 2 pool is an average across genes of the individual variance estimates. F 2 uses the average of ˆσ g 2 and ˆσ pool 2 for each component. We define a new F -like statistic (F S), which uses σ g 2 from the shrinkage estimator in equation (3) as the variance component estimator for each gene. The variance component estimators are then used in equations (5) and (6) to compute the corresponding F statistics. Therefore, if we define the magnitude of the effects to be tested, such as the sum of squares of relative expression levels of the samples in experiments comparing gene expression among multiple samples, as, the four F tests for the gth gene can be written as F 1 = g /ˆσ 2 g, F 2 = g / 1 2 (ˆσ2 g + ˆσ 2 pool ), F 3 = g /ˆσ 2 pool, F S = g / σ 2 g. (8) The justification for choosing one of these four statistics depends on what we are willing to assume about the variability of the variances across genes. If all variance components are constant across genes (homogeneous variances), then F 3 is the right statistic. If the variance components are gene specific (heterogeneous variances), then F 1 is the right statistic. Statistics like F 2 and F S may be more efficient when there is limited information to estimate the gene specific variance components. Comparisons of these tests in different situations are described in section 4. For simple microarray experiments, fixed effects ANOVA models, a special case of the general linear mixed model with empty Z and µ in equation (4), can be used for modeling and computational convenience. The error variance for each gene can be estimated using the residual mean square error (MSE), which is the residual sum square error (SSE) divided by its degrees of freedom (ν). Thus, the the denominators of F 1, F 2, F 3, and F S can be estimated based on these MSEs across the genes in equation (8). The null distributions of the modified F statistics are not readily available. The F 1 test for a fixed effect ANOVA model has a standard F distribution and critical values could be obtained from the F tables under typical distributional assumptions; however, 7

8 when mixed effects ANOVA models are used, the F 1 in equation (8) does not strictly follow the F distribution, although a conservative approximation can be obtained [13]. Since F 2, F 3, and F S are not standard F statistics, their null distributions have to be established by permutation analysis [15]. In fact, because distributional assumptions are sometimes questionable in practice, it may be prudent to establish all critical values by permutation analysis. Permutation is a nonparametric approach to establish the null distribution of a test statistic. The key idea of permutation is to identify units that are exchangeable under the null hypothesis. In microarray experiments, if we allow for gene-specific variance heterogeneity, then the unit must be whole arrays. The array to be shuffled will depend on the design of the experiment and the factor(s) being tested. Two-color arrays are slightly more complex than single color systems as the pairing between the two channels of the array must be maintained in the permuted units. To execute the permutation analysis we generate random shuffles (p = 1,..., P ) of these units and compute a new set of statistics F (p) g (g = 1,..., G). A common threshold for test statistics is established using percentiles of the entire collection of F (p) g values over indices p and g. Due to the large computation demanding, we only permute 100 times, which requires about an hour on a 32-node Beowulf cluster for a 2000-gene experiment with 30 arrays. Computationally efficient methods are under investigation. 4 Simulation Studies To compare the power of tests based on each of the four F statistics, we first simulated an abstracted canonical form and then simulated microarray experiments based on estimated parameters from real data sets. The first microarray experiment is based on a simple experiment comparing 5 samples and the second is based on a more complex experiment with biological replications. 8

9 4.1 Canonical simulation To evaluate the tests based on the four F statistics in a general setting, we simulated data in a canonical form and studied the power of each test at several levels of variance heterogeneity, represented by coefficient of variation (CV ) of the variances and degrees of freedom (ν). We define the canonical form of this problem as ˆθ g,t = θ g,t + ɛ g,t for gene g = 1,..., G and treatment t = 1,..., T, where θ g,t represents the relative expression level of gene g under treatment condition t, and ɛ g,t is the gene-specific residual error (ɛ g,t N(0, σ 2 g)) associated with estimating θ g,t. In this simulation, the residual variances, σ 2 g, were drawn randomly from the residual variance estimates from the tumor data set described in section 4.3. To vary the CV of these residual variances while keeping their geometric means constant, we rescaled them using a tuning parameter τ Z g = σ2τ g gm(σ 2τ g ) gm(σ2 g), (9) where gm stands for geometric mean. When τ = 0, CV = 0, corresponding to the homogeneous variance case. We study four cases where the τ is 0.78, 1.5 and 2.3, which correspond to CV of 0, 1, 4 and 20. The two middle cases are typical of real microarray data. The treatment effect for each gene can be written as g = 1 t 1 T (ˆθ g,t ˆθ g. ) 2. (10) This is also the common numerator for all four F statitics in equation (8). case, the denominators of all F statistics are obtained using residual MSE g t=1 In this in the place of ˆσ 2 g in equation (8). The residual MSE g for each gene was generated by chisquare distribution according to ANOVA theory based on the true residual variance Z g, MSE g Z g χ 2 ν/ν, where ν are the degrees of freedom for MSE g when fitting a fixed ANOVA model for each gene. We studied many different degrees of freedom but only report (ν = 2, 6, and 50) here to represent small, moderate and very large microarray experiments. 9

10 To establish the null distribution for the F tests, we set θ g,t = 0 for all g = 1,..., 5000, t = 1,..., 5. We calculated F 1, F 2, F 3, and F S for each gene and then use the 95% quantiles as the critical values. To calculate the power for each F test, we generated 5000 non-zero θ g. Because the power of a test depends on the magnitude of the effect ( g ), we study the power of the tests as a function of g. Specifically, we let Q g,t N(0, ) and θ g,t = KQ g,t / 5 t=1 Q2 g,t, consequently, K = g (t 1). By varying K, we can vary the treatment effect. Figure 1 shows the power of the four tests as a function of g (t 1) for degrees of freedom, ν = 2, 6, 50, and heterogeneity, CV = 0, 1, 5, and 20. When all the treatments are identical, g (t 1) = 0, the null hypothesis H 0 holds. In general, F 1 shows good power only when ν is large (ν > 6). F 3 only has good power when variance heterogeneity is low (CV < 1). F 2 is similar to F 3 but more robust. It still has good power when CV is about 4. The power of the F 2 and F 3 tests decrease when the CV increases. When the CV is larger than 10, F 3 loses power completely and F 2 loses most of its power. Compared with the other tests, F S is the most robust and is usually most powerful or nearly so. It is more powerful than F 1 in all the situations. The improvement is quite substantial when ν is small. It also has a large advantage over F 2 and F 3 when the CV is large. When the CV is small, the power of F S is still comparable to that of F 2 and F Analysis and simulation of a microarray experiment Case I: Technical replication To compare the four tests in a simple microarray experiment. We applied them to experimental data and performed simulations based on the results of this experiment. The experiment compared two human colon cancer cell lines, CACO2 and HCT116, and three human ovarian cancer cell lines, ES2, MDAH2774 and OV1063, using a loop design as shown in Figure 2A (unpublished data available at /labsite/datasets/index.html). Fluorescent dye labeled cdna targets were hybridized to DNA microarrays containing 9600 human cdna clones from the Research Genetics sequence verified human cdna collection (Invitrogen, Carlsbad, CA) spotted in 10

11 duplicate. Slides were scanned using the GenePix4000 microarray scanner and the median intensities of each spot were calculated using an image processing software (Axon Instruments, Inc., Foster city, CA). To simplify the analysis, the two spots for the same gene on each array were averaged at the original signal level. The data were then intensity lowess transformed [16] and normalized before fitting the following ANOVA model to each gene, y ijk = µ + S i + D j + A k + ɛ ijk. (11) In this model, µ is the gene mean; S i (i = 1,..., 5) is the sample effect; D j (j = 1, 2) is the dye effect; A k (k = 1,..., 10) is the array effect; ɛ ijk is the residual. Terms µ, S i and D j are treated as fixed. Term A k is treated as random. To put this model in the context of the general linear mixed model (equation 7), µ, S i and D j belong to β and the dimension of X matrix is A k belongs to u and the dimension of Z is The variance components of A k and ɛ ijk were estimated [11] for each gene and their distributions were compared (Figure 3A). The array variance is substantially larger than the residual variance but it has smaller heterogeneity (CV = 1.34) than the residual variance (CV = 1.79). We note that array variance has little impact on the F tests because of the experimental design [17]; therefore, the array effect can be treated as a fixed effect in simple experiments like this one for computational simplicity. The four F test statistics were constructed for model (11) and their null distributions were established through permutations analysis [2, 15, 3]. The permutation unit in this case is array (the two columns, Cy3 and Cy5, of data from each array). At a nominal significance level of 0.01, F 1, F 2, F S and F 3 detected 1588, 2012, 1896 and 981 significant genes, respectively. The volcano plot (Figure 3B) illustrates the differences among four F tests. The significant genes for F 1 are located above the horizontal line and those for F 3 are located right of the vertical line. The significant genes identified by F S and F 2 are indicated by red and yellow coloring respectively and are generally in the upper right corner. These two tests are largely concordant but F S is more similar to F 1, indicating that F S is sensitive to variance heterogeneity in these data. To study the type I error rate and power of each F test, we simulated 10 data sets each with 1000 constant genes and 1000 differentially expressed genes based on this 11

12 design. The individual S i were drawn randomly from distribution N(0, ). The µ and D j were drawn from normal distributions N(0, ) and N(0, ), respectively. Fixed effects parameter values were held constant across all simulations. For each simulation, A k was drawn randomly from a normal distribution N(0, ) and the residuals (ɛ ijk ) were drawn randomly from normal distribution N(0, σg), 2 where the gene specific variance σg 2 was drawn randomly from the 9600 estimates of residual variance of the above data set. The variability of the residual variances was controlled by τ in the same fashion as for the canonical simulation, but the value of τ was set to be 0.8, 1, and 1.5 to only cover the ranges of variability that we have seen in real data sets. Corresponding CV s are about 1.2, 1.8, and 3.7. The averaged results of the 10 simulations at nominal significance level of 0.05 are shown in Table 2. Among the 1000 null model genes, fewer than 50 false positives were detected by each F test, which indicates that the actual type I error rate is somewhat lower than the expectation in each case. The number of false positives for F 2 and F 3 are much fewer than expected, which indicates that these two F tests may be overly conservative. Among the 1000 differentially expressed genes, the majority were identified by all four F tests, but the number of identified true positives decrease as CV increases. The decrease rate of F 1 and F S is smaller than that of F 2 and F 3. F S identifies fewer true positives than F 2 when the CV is around 1.2 and 1.8, but it identifies more than F 2 when the CV is around 3.7. F 2 and F 3 identify fewer genes when the CV is around 3.7. F S identifies more true positives than F 1 and F 3 regardless of the degree of heterogeneity, a relection that F S is more powerful than F 1 and F 3. The power plots of these F tests against the sample effect is shown in Supplemental Figure 1 (A to C). The relationship among all four F tests are similar to those obtained in the canonical simulation. The results of individual simulations are shown in supplemental Tables 1A-1C. Comparison among the false discovery rate (FDR) of each F test shows that F S has a relatively low FDR. In general, a more powerful test tends to have smaller a smaller FDR. This is consistent with the fact that both the FDR proposed by Benjamini and Hochberg (1995) [18] and the positive false discovery rate (pfdr) proposed by Storey (2002) [19] tend to be smaller for smaller p values, a result achieved when a more 12

13 powerful test is used. In other words, when the type I error is controlled at a specified level, a more powerful test will detect more significant genes, therefore, the FDR of the detected gene list is smaller. On the other hand, if we control FDR at certain level, a more powerful test will give a longer significant gene list; however, the type I error rates of F 2 and F 3 are smaller than specified in these simulations, which results in smaller FDRs in some cases, although their powers are not necessarily higher. 4.3 Analysis and simulation of a microarray experiment Case II: Biological replication A recent and promising trend in microarray experiments is to include biological replicates of samples in order to account for inherent biological variation. To accommodate this trend, mixed linear models with biological replicates treated as random effects are required. Here we analyze a representative data set and perform simulations based on this data set to compare the properties of the four F -like tests in this type of experimental setting. The granulosa cell tumor microarray experiment was performed using eight week old SWXJ-9 mice. The effects of dietary androgenic supplementation (DHEA, testo and control) were assessed. RNA samples from each mouse were compared to the Stratagene reference RNA using two microarrays each with dye labeling reversed (Figure 2B). Fluorescent dye labeled cdna targets were hybridized to DNA microarrays printed with the NIA clone set spotted in duplicate. Slides were scanned and the mean intensities of each spot calculated using the GenePix4400 microarray scanner and image processing software (unpublished data are at /index.html). The raw data were preprocessed as described above before fitting the following mixed ANOVA model [14] for each gene, y ijk = µ + A i + D j + T k + M l + R h + ɛ ij, (12) with µ for the gene mean, A i for array effect (i = 1,..., 30), D j (j = 1, 2) for the dye effect, T k (k = 1, 2, 3) for the treatment effect, and M l (h=1,...,15) for mouse effect. R h 13

14 is an indicator of reference (h = 1) versus tissue sample (h = 2), which is determined by the combination of array and dye. We treat µ, D j, T k and R h as fixed effects. The biological replicate, mouse (M l ), effect is treated as a random effect. Therefore, the mouse variance is included along with the error variance in tests that compare treatments (T k ) [20, 21, 3]. The array effect (A i ) is also considered as random effect but it has small effect on the F statistics as mentioned above. The variance components of mouse, array, and residual of this data set were estimated using REML [11]. Their distributions are shown in Figure 3C. The array variance is the largest component and has only moderate heterogeneity (CV = 1.5). The mouse variance is the smallest, but it has greatest heterogeneity (CV = 3.4). Most of the genes have small mouse variance, but a small proportion of genes show large variation across individual mice. The residual variances are intermediate, between array and mouse components, in size and have only moderate heterogeneity (CV = 1.7). The four F statistics were computed for each gene and their null distributions were established using permutation analysis. The permutation unit in this case is mouse, which consists of a dye-swap pair of arrays, because mouse is the nested factor under the tested factor, treatment (Figure 2B). At nominal significance level of 0.01, the F 1, F 2, F 3 and F S tests detect 295, 348, 252 and 333 genes respectively. The volcano plot of these F tests is shown in Figure 3D. To study the type I error and power of the four tests in this experimental setting, we performed 10 simulations each having 1000 constant genes and 1000 differentially expressed genes based on the design and variance components estimates of this experiment. The simulations were similar to those in the previous subsection. The settings for the fixed effects µ, T k and D j were the same as the corresponding fixed effects of model (11). The settings of A i and ɛ ij were the same as before except that the σg 2 for ɛ ij was drawn randomly from the estimates of residual variance of this data set. The settings for the random effect mouse, M l, was sampled from mouse variance component estimates. The reference R h was drawn randomly from distribution N(0, ). The numbers (average over 10 simulations) of true and false positives by each F test at nominal significance level of 0.05 are shown in Table 3. The numbers of false positives are all close to expectation (50), indicating that the type I error is controlled 14

15 at the specified level. The power of each F test decreases as CV increases, especially for F 2 and F 3. Again, F S shows an advantage over F 1. More importantly, it shows more advantage over F 2 and F 3 at large CV s than observed from the microarray simulation without biological replication in Table 2, indicating that when biological variation is included in the computation of the error variance, F S could be advantageous. The power comparison among all four F tests against the treatment effect is shown in Supplemental Figure 1 (D to F). The results from each of the 10 simulations are shown in Supplemental Table 1D-1F. 5 Discussion Variance components in microarray experiments display varying degrees of heterogeneity, across experiments, across variance components, and across genes within a variance component [17]. Assumptions of variance heterogeneity lead to the use of individual gene specific tests, such as F 1, but these tests often suffer from low power due to small degrees of freedom. On the other hand, the assumption of common variance leads to powerful tests, such as F 3, but at the risk of generating false detections in the event that the common variance assumption is not true. A better approach is to use the tests based on variance estimates that are gene specific but combine information across many genes. We gain power by utilizing more information in the data but can also avoid bias. Previous researchers have proposed related approaches such as the Empirical Bayes approach. However, the advantage of many of these methods in operation over the traditional approach such as F 1 either has not been demonstrated or has been demonstrated to be small. In this paper, we apply James-Stein-Lindley shrinkage to improve estimated variance components in a linear mixed model. We show that the resulting test statistic F S performs better than the standard gene-specific test F 1 and the improvement in power can be substantial especially when the degrees of freedom are small, a common situation for microarray experiments. By taking a shrinkage approach to improve variance estimation, we make very weak prior assumptions about the distribution of the variance components. In some simple 15

16 settings, such as estimating a normal mean which has a normal prior, the Empirical Bayes approach and the shrinkage approach lead to exactly the same estimator [22]. But in this setting, the Empirical Bayes approach is very complicated. Our proposed statistic, F S, has an explicit expression and is computationally simple. In summary, we have proposed a simple variation on the general mixed model testing strategy. By using shrinkage estimates of variance components we have obtained a test statistic that is both powerful and robust in the face of variance heterogeneity. Acknowledgment We would like to thank Hao Wu for software support, Qian Li for assistance with data handling, Ann Dorward at the Jackson Laboratory and John Quackenbush at TIGR for providing sample data sets. This research is supported by grants CA88327, HL66620, and HL55001 from the National Institute of Health. References [1] Callow, M. J, Dudoit, S, Gong, E. L, Speed, T. P, & Rubin, E. M. (2000) Genome Research 10, [2] Kerr, M. K, Martin, M, & Churchill, G. A. (2000) J. Comput Biol 7, [3] Cui, X & Churchill, G. A. (2003) Genome Biology 4, art210. [4] Storey, J & Tibshirani, R. (2003) in The analysis of gene expression data: methods and software, eds. Parmigiani, G, Garrett, E. S, Irizarry, R. A, & Zeger, S. L. (Springer, New York). [5] Baldi, P & Long, A. D. (2001) Bioinformatics 17, [6] Lönnstedt, I & Speed, T. (2002) Statistica Sinica 12, [7] Kendziorski, C. M, Newton, M. A, Lan, H, & Gould, M. N. (2003) On parametric empirical bayes methods for comparing multiple groups using replicated gene expression profiles. newton/research/arrays.html. 16

17 [8] Newton, M. A, Noueiry, A, Sarkar, D, & Ahlquist, P. (2003) Detecting differential gene expression with a semiparametric hierarchical mixture method. newton/papers/abstracts/tr1074a.html. [9] Lindley, D. V. (1962) Journal of the Royal Statistical Society Series B 24, [10] Ghosh, M, Hwang, J, & Tsui, K. (1984) Journal of Multivariate Analysis 14, [11] Searle, S, Casella, G, & McCulloch, C. (1992) Variance components. (John Wiley and sons, Inc., New York, NY). [12] Witkovsky, V. (2002) Matlab algorithm mixed.m for solving Henderson s mixed model equations. [13] Littell, R. C, Milliken, G, Stroup, W. W, & Wolfinger, R. D. (1996) SAS system for mixed models. (SAS institute Inc., Cary, NC). [14] Wolfinger, R. D, Gibson, G, Wolfinger, E. D, Bennett, L, Hamadeh, H, Bushel, P, Afshari, C, & Paules, R. S. (2001) J Comput Biol 8, [15] Wu, H, Kerr, M. K, Cui, X, & Churchill, G. A. (2003) in The analysis of gene expression data: methods and software, eds. Parmigiani, G, Garrett, E. S, Irizarry, R. A, & Zeger, S. L. (Springer, New York). [16] Cui, X, Kerr, M. K, & Churchill, G. A. (2003) Statistical Applications in Genetics and Molecular Biology 2, No.1 art4. [17] Cui, X & Churchill, G. A. (2003) in Methods of Microarray Data Analysis III, eds. Lin, S. M & Allred, E. T. (Kluwer Academic Publishers, New York). [18] Benjamini, Y & Hochberg, Y. (1995) Journal of the Royal Statistical Society, Series B 85, [19] Storey, J. D. (2002) Journal of the Royal Statistical Society, Series B 64, [20] McLean, R. A, Sanders, W. L, & Stroup, W. (1991) The American statistician 45, [21] Churchill, G. A. (2002) Nat Genet 32 Suppl 2,

18 [22] Efron, B & Morris, C. (1973) Journal of the American Statistical Association 68,

19 Figure Legends Figure 1. Power comparison among the four F tests using the canonical simulations. In each panel, the power of each F test is plotted against the treatment effect, (t 1). The variability of the individual variances is controlled by τ shown on the top and is reflected by the coefficient of variance (CV ) shown at the upper left corner of each panel. The degrees of freedom (ν) (2, 6, and 50) are noted in each panel at the upper left corner. The nominal type I error rate of 0.05 is indicated by a solid blue line. Figure 2. Illustration of the microarray designs used in the paper. Panel A is a double loop design comparing 5 samples (S1 to S5) used by the TIGR loop experiment. Panel B is a reference design used in the tumor experiment to compare three treatments (T1, T2, and T3) using 5 mice for each treatment and one dye-swap pair of arrays for each mouse. Each arrow represents an array with head pointing to Cy3 labeling and tail pointing to Cy5 labeling. R, reference sample. Figure 3. Variance component plots and volcano plots of the TIGR loop and tumor data sets. Panel A is the smoothed histograms of the two variance components of the TIGR loop data set when A k is treated as random (equation 11). Panel B is the volcano plot from the TIGR loop experiment. Panel C is the smoothed histograms of the three variance components from the tumor data set. Panel D is the volcano plot of the tumor data set. In the volcano plots, the -log 10 of the F 1 p values based on permutation is plotted against the tested effect. Horizontal and vertical lines represent the 0.01 nominal significance level for F 1 and F 3 respectively. Red points, F 2 significant. Yellow square, F S significant; Figure 4 (Supplemental Figure 1). The average power of four F tests from 10 microarray simulations. The data were simulated with the fixed effect ANOVA model (A, B, and C) or mixed effects ANOVA model with two variance components (D, E and F). The values of these variance components are randomly drawn from a typical real data set. The variability of the variances across genes are controlled by τ (0.8, 1, 1.5) and reflected by CV r and CV m. CV r, CV of the residual variance component; CV m, CV of the mouse variance component. Figure 5 (Supplemental Figure 2). Representative volcano plots of the simulations 19

20 based on the TIGR data set. The data were generated for analyzing using fixed ANOVA model (equation 11). CV = 2.2. Panels A, B and C correspond to non-differential, differential and all genes, respectively. Therefore, the selected genes in panel A are false positives and the unselected genes in panel B are false positives. Horizontal and vertical lines represent the 0.01 nominal significance level for F 1 and F 3, respectively. Red, F 2 significant; Orange diamond, F S significant. 20

21 ν B V/(2/ν) ν B V/(2/ν) Table 1: Values of B (bias correction) and V/(2/ν) as a function of ν. These values are used in equation (3) to construct the estimates that shrink the unbiased estimators of variances to its corrected geometric mean. When ν is greater than 50, B and V/(2/ν) are virtually 1. 21

22 CV r = 1.2 CV r = 1.8 CV r = 3.7 TP FP FDR TP FP FDR TP FP FDR F F F S F Table 2: Average number of true and false differential genes identified by each F test from 10 simulations of model (equation 11). Significance level is nominal The total number of genes is 2000, with 1000 constant genes and 1000 differentially expressed genes. CV r, average CV of the residual variance; TP, true positives; FP, false positives. False discovery rate (FDR) is computed as FP/(FP+TP) for each simulation and the average is shown here. The results from individual simulations are shown in Supplemental Tables 1A-1C. CV r = 1.0, CV m = 2.2 CV r = 1.6, CV m = 3.3 CV r = 4.7, CV m = 7.9 TP FP FDR TP FP FDR TP FP FDR F F F S F Table 3: Average number of true and false differential genes identified by each F test from 10 simulations of model (12). Significance level is nominal The total number of genes is 2000, with 1000 constant and 1000 differentially expressed. CV r, average CV of the residual variance; CV m, average CV of the mouse variances; TP, true positives; FP, false positives. False discovery rate (FDR) is computed as FP/(FP+TP) for each simulation and the average is shown here. The results from individual simulations are shown in Supplemental Tables 1D-1F. 22

23 Figure 1: 23

24 A S1 S5 S2 S4 S3 B T1 T2 T3 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 R R R R R R R R R R R R R R R Figure 2: 24

25 Figure 3: 25

26 Figure 4: 26

27 5 Non differential genes 5 Differential genes 4 4 log 10 Pvalpg p ool(f1) 3 2 log 10 Pvalpg p ool(f1) A sqrt(sum(sample 2 )) B sqrt(sum(sample 2 )) 5 All genes 4 log 10 Pvalpg p ool(f1) C sqrt(sum(sample 2 )) Figure 5: 27

28 Supplemental Table 1A τ =0.8 Simulation Number Mean CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos Supplemental Table 1B τ =1 Simulation Number Mean CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos

29 Supplemental Table 1C τ =1.5 Simulation Number Mean CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos Supplemental Table 1D τ =0.8 Simulation Number Mean CV m CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos

30 Supplemental Table 1E τ =1 Simulation Number Mean CV m CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos Supplemental Table 1F τ =1.5 Simulation Number Mean CV m CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos

31 Supplemental Table 1. Number of false and true positives detected by four F tests in the individual simulations. Supp Table 1A to 1C are from the fixed effect ANOVA model simulation described in section 4.2. Supp Table 1C to 1D are from the mixed effect ANOVA model simulation of section 4.3. Each simulation has 1000 non-differential genes and 1000 differential genes. The variability of individual variances is controlled by τ (0.8, 1, and 1.5). In different simulations, the same τ generates slightly different CV s because a random sub-population of 2000 variances was drawn from variance estimates for mouse or residual. CV m, the CV (coefficient of variation) of the mouse variances; CV r, CV of the residual variances. Significance level is nominal 0.05.

Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis

Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis Tiejun TONG and Yuedong WANG Microarray technology allows a scientist to study genomewide patterns of gene expression.

More information

Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis

Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis Statistics Preprints Statistics 4-2007 Optimal Tests Shrinking Both Means and Variances Applicable to Microarray Data Analysis J.T. Gene Hwang Cornell University Peng Liu Iowa State University, pliu@iastate.edu

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments. Xiangqin Cui Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

Optimal Shrinkage Estimation of Variances with Applications to Microarray Data Analysis

Optimal Shrinkage Estimation of Variances with Applications to Microarray Data Analysis Optimal Shrinkage Estimation of Variances with Applications to Microarray Data Analysis Tiejun TON and Yuedong WAN May 7, 005 Abstract Microarray technology allows a scientist to study genome-wide patterns

More information

Design and Analysis of Gene Expression Experiments

Design and Analysis of Gene Expression Experiments Design and Analysis of Gene Expression Experiments Guilherme J. M. Rosa Department of Animal Sciences Department of Biostatistics & Medical Informatics University of Wisconsin - Madison OUTLINE Æ Linear

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis Statistics Preprints Statistics 11-2006 Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis Peng Liu Iowa State University, pliu@iastate.edu

More information

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop Inferential Statistical Analysis of Microarray Experiments 007 Arizona Microarray Workshop μ!! Robert J Tempelman Department of Animal Science tempelma@msuedu HYPOTHESIS TESTING (as if there was only one

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone:

More information

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and

More information

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone:

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data Yujun Wu, Marc G. Genton, 1 and Leonard A. Stefanski 2 Department of Biostatistics, School of Public Health, University of Medicine

More information

SPOTTED cdna MICROARRAYS

SPOTTED cdna MICROARRAYS SPOTTED cdna MICROARRAYS Spot size: 50um - 150um SPOTTED cdna MICROARRAYS Compare the genetic expression in two samples of cells PRINT cdna from one gene on each spot SAMPLES cdna labelled red/green e.g.

More information

Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods

Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods Blythe Durbin, Department of Statistics, UC Davis, Davis, CA 95616 David M. Rocke, Department of Applied Science,

More information

Hierarchical Mixture Models for Expression Profiles

Hierarchical Mixture Models for Expression Profiles 2 Hierarchical Mixture Models for Expression Profiles MICHAEL A. NEWTON, PING WANG, AND CHRISTINA KENDZIORSKI University of Wisconsin at Madison Abstract A class of probability models for inference about

More information

FDR and ROC: Similarities, Assumptions, and Decisions

FDR and ROC: Similarities, Assumptions, and Decisions EDITORIALS 8 FDR and ROC: Similarities, Assumptions, and Decisions. Why FDR and ROC? It is a privilege to have been asked to introduce this collection of papers appearing in Statistica Sinica. The papers

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples Experimental Design Credit for some of today s materials: Jean Yang, Terry Speed, and Christina Kendziorski Experimental design Choice of platform rray design Creation of probes Location on the array Controls

More information

Empirical Bayes Moderation of Asymptotically Linear Parameters

Empirical Bayes Moderation of Asymptotically Linear Parameters Empirical Bayes Moderation of Asymptotically Linear Parameters Nima Hejazi Division of Biostatistics University of California, Berkeley stat.berkeley.edu/~nhejazi nimahejazi.org twitter/@nshejazi github/nhejazi

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Bioconductor Project Working Papers

Bioconductor Project Working Papers Bioconductor Project Working Papers Bioconductor Project Year 2004 Paper 6 Error models for microarray intensities Wolfgang Huber Anja von Heydebreck Martin Vingron Department of Molecular Genome Analysis,

More information

Large-Scale Hypothesis Testing

Large-Scale Hypothesis Testing Chapter 2 Large-Scale Hypothesis Testing Progress in statistics is usually at the mercy of our scientific colleagues, whose data is the nature from which we work. Agricultural experimentation in the early

More information

Tools and topics for microarray analysis

Tools and topics for microarray analysis Tools and topics for microarray analysis USSES Conference, Blowing Rock, North Carolina, June, 2005 Jason A. Osborne, osborne@stat.ncsu.edu Department of Statistics, North Carolina State University 1 Outline

More information

Topics on statistical design and analysis. of cdna microarray experiment

Topics on statistical design and analysis. of cdna microarray experiment Topics on statistical design and analysis of cdna microarray experiment Ximin Zhu A Dissertation Submitted to the University of Glasgow for the degree of Doctor of Philosophy Department of Statistics May

More information

Linear Models and Empirical Bayes Methods for Microarrays

Linear Models and Empirical Bayes Methods for Microarrays Methods for Microarrays by Gordon Smyth Alex Sánchez and Carme Ruíz de Villa Department d Estadística Universitat de Barcelona 16-12-2004 Outline 1 An introductory example Paper overview 2 3 Lönnsted and

More information

Sample Size and Power Calculation in Microarray Studies Using the sizepower package.

Sample Size and Power Calculation in Microarray Studies Using the sizepower package. Sample Size and Power Calculation in Microarray Studies Using the sizepower package. Weiliang Qiu email: weiliang.qiu@gmail.com Mei-Ling Ting Lee email: meilinglee@sph.osu.edu George Alex Whitmore email:

More information

Multiplicative background correction for spotted. microarrays to improve reproducibility

Multiplicative background correction for spotted. microarrays to improve reproducibility Multiplicative background correction for spotted microarrays to improve reproducibility DABAO ZHANG,, MIN ZHANG, MARTIN T. WELLS, March 12, 2006 Department of Statistics, Purdue University, West Lafayette,

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

1 Mixed effect models and longitudinal data analysis

1 Mixed effect models and longitudinal data analysis 1 Mixed effect models and longitudinal data analysis Mixed effects models provide a flexible approach to any situation where data have a grouping structure which introduces some kind of correlation between

More information

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2 Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2 Fall, 2013 Page 1 Random Variable and Probability Distribution Discrete random variable Y : Finite possible values {y

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data

A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data Biostatistics (2007), 8, 4, pp. 744 755 doi:10.1093/biostatistics/kxm002 Advance Access publication on January 22, 2007 A moment-based method for estimating the proportion of true null hypotheses and its

More information

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis March 18, 2016 UVA Seminar RNA Seq 1 RNA Seq Gene expression is the transcription of the

More information

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution

A comparison of inverse transform and composition methods of data simulation from the Lindley distribution Communications for Statistical Applications and Methods 2016, Vol. 23, No. 6, 517 529 http://dx.doi.org/10.5351/csam.2016.23.6.517 Print ISSN 2287-7843 / Online ISSN 2383-4757 A comparison of inverse transform

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

False discovery rate procedures for high-dimensional data Kim, K.I.

False discovery rate procedures for high-dimensional data Kim, K.I. False discovery rate procedures for high-dimensional data Kim, K.I. DOI: 10.6100/IR637929 Published: 01/01/2008 Document Version Publisher s PDF, also known as Version of Record (includes final page, issue

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models

Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models Bayesian Analysis (2009) 4, Number 4, pp. 707 732 Spiked Dirichlet Process Prior for Bayesian Multiple Hypothesis Testing in Random Effects Models Sinae Kim, David B. Dahl and Marina Vannucci Abstract.

More information

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models

Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models Simultaneous Inference for Multiple Testing and Clustering via Dirichlet Process Mixture Models David B. Dahl Department of Statistics Texas A&M University Marina Vannucci, Michael Newton, & Qianxing Mo

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

changes in gene expression, we developed and tested several models. Each model was

changes in gene expression, we developed and tested several models. Each model was Additional Files Additional File 1 File format: PDF Title: Experimental design and linear models Description: This additional file describes in detail the experimental design and linear models used to

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

REPLICATED MICROARRAY DATA

REPLICATED MICROARRAY DATA Statistica Sinica 1(), 31-46 REPLICATED MICROARRAY DATA Ingrid Lönnstedt and Terry Speed Uppsala University, University of California, Berkeley and Walter and Eliza Hall Institute Abstract: cdna microarrays

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

Bayesian ANalysis of Variance for Microarray Analysis

Bayesian ANalysis of Variance for Microarray Analysis Bayesian ANalysis of Variance for Microarray Analysis c These notes are copyrighted by the authors. Unauthorized use is not permitted. Bayesian ANalysis of Variance p.1/19 Normalization Nuisance effects,

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Normalization, testing, and false discovery rate estimation for RNA-sequencing data

Normalization, testing, and false discovery rate estimation for RNA-sequencing data Biostatistics Advance Access published October 14, 2011 Biostatistics (2011), 0, 0, pp. 1 16 doi:10.1093/biostatistics/kxr031 Normalization, testing, and false discovery rate estimation for RNA-sequencing

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

The optimal discovery procedure: a new approach to simultaneous significance testing

The optimal discovery procedure: a new approach to simultaneous significance testing J. R. Statist. Soc. B (2007) 69, Part 3, pp. 347 368 The optimal discovery procedure: a new approach to simultaneous significance testing John D. Storey University of Washington, Seattle, USA [Received

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances Advances in Decision Sciences Volume 211, Article ID 74858, 8 pages doi:1.1155/211/74858 Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances David Allingham 1 andj.c.w.rayner

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

Statistics Applied to Bioinformatics. Tests of homogeneity

Statistics Applied to Bioinformatics. Tests of homogeneity Statistics Applied to Bioinformatics Tests of homogeneity Two-tailed test of homogeneity Two-tailed test H 0 :m = m Principle of the test Estimate the difference between m and m Compare this estimation

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

A variance-stabilizing transformation for gene-expression microarray data

A variance-stabilizing transformation for gene-expression microarray data BIOINFORMATICS Vol. 18 Suppl. 1 00 Pages S105 S110 A variance-stabilizing transformation for gene-expression microarray data B. P. Durbin 1, J. S. Hardin, D. M. Hawins 3 and D. M. Roce 4 1 Department of

More information

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

October 1, Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study

October 1, Keywords: Conditional Testing Procedures, Non-normal Data, Nonparametric Statistics, Simulation study A comparison of efficient permutation tests for unbalanced ANOVA in two by two designs and their behavior under heteroscedasticity arxiv:1309.7781v1 [stat.me] 30 Sep 2013 Sonja Hahn Department of Psychology,

More information

Chapter 10. Semi-Supervised Learning

Chapter 10. Semi-Supervised Learning Chapter 10. Semi-Supervised Learning Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Outline

More information

Bumpbars: Inference for region detection. Yuval Benjamini, Hebrew University

Bumpbars: Inference for region detection. Yuval Benjamini, Hebrew University Bumpbars: Inference for region detection Yuval Benjamini, Hebrew University yuvalbenj@gmail.com WHOA-PSI-2017 Collaborators Jonathan Taylor Stanford Rafael Irizarry Dana Farber, Harvard Amit Meir U of

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol 21 no 11 2005, pages 2684 2690 doi:101093/bioinformatics/bti407 Gene expression A practical false discovery rate approach to identifying patterns of differential expression

More information

DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS

DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS Sankhyā : The Indian Journal of Statistics Special issue in memory of D. Basu 2002, Volume 64, Series A, Pt. 3, pp 706-720 DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS By TERENCE P. SPEED

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

Microarray Preprocessing

Microarray Preprocessing Microarray Preprocessing Normaliza$on Normaliza$on is needed to ensure that differences in intensi$es are indeed due to differen$al expression, and not some prin$ng, hybridiza$on, or scanning ar$fact.

More information

Some properties of Likelihood Ratio Tests in Linear Mixed Models

Some properties of Likelihood Ratio Tests in Linear Mixed Models Some properties of Likelihood Ratio Tests in Linear Mixed Models Ciprian M. Crainiceanu David Ruppert Timothy J. Vogelsang September 19, 2003 Abstract We calculate the finite sample probability mass-at-zero

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Estimation of a Two-component Mixture Model

Estimation of a Two-component Mixture Model Estimation of a Two-component Mixture Model Bodhisattva Sen 1,2 University of Cambridge, Cambridge, UK Columbia University, New York, USA Indian Statistical Institute, Kolkata, India 6 August, 2012 1 Joint

More information

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE

ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE Statistica Sinica 22 (2012), 1689-1716 doi:http://dx.doi.org/10.5705/ss.2010.255 ESTIMATING THE PROPORTION OF TRUE NULL HYPOTHESES UNDER DEPENDENCE Irina Ostrovnaya and Dan L. Nicolae Memorial Sloan-Kettering

More information

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018 Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng

More information

Analysis of variance, multivariate (MANOVA)

Analysis of variance, multivariate (MANOVA) Analysis of variance, multivariate (MANOVA) Abstract: A designed experiment is set up in which the system studied is under the control of an investigator. The individuals, the treatments, the variables

More information

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Kneib, Fahrmeir: Supplement to Structured additive regression for categorical space-time data: A mixed model approach Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

A nonparametric two-sample wald test of equality of variances

A nonparametric two-sample wald test of equality of variances University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 211 A nonparametric two-sample wald test of equality of variances David

More information

Confidence Interval Estimation of Small Area Parameters Shrinking Both Means and Variances

Confidence Interval Estimation of Small Area Parameters Shrinking Both Means and Variances Confidence Interval Estimation of Small Area Parameters Shrinking Both Means and Variances Sarat C. Dass 1, Tapabrata Maiti 1, Hao Ren 1 and Samiran Sinha 2 1 Department of Statistics & Probability, Michigan

More information

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department

More information

Identification of dispersion effects in replicated two-level fractional factorial experiments

Identification of dispersion effects in replicated two-level fractional factorial experiments Identification of dispersion effects in replicated two-level fractional factorial experiments Cheryl Dingus 1, Bruce Ankenman 2, Angela Dean 3,4 and Fangfang Sun 4 1 Battelle Memorial Institute 2 Department

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Semi-Penalized Inference with Direct FDR Control

Semi-Penalized Inference with Direct FDR Control Jian Huang University of Iowa April 4, 2016 The problem Consider the linear regression model y = p x jβ j + ε, (1) j=1 where y IR n, x j IR n, ε IR n, and β j is the jth regression coefficient, Here p

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Mixed effects models - Part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

Are a set of microarrays independent of each other?

Are a set of microarrays independent of each other? Are a set of microarrays independent of each other? Bradley Efron Stanford University Abstract Having observed an m n matrix X whose rows are possibly correlated, we wish to test the hypothesis that the

More information