Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates

Size: px

Start display at page:

Download "Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates"

Prudence Hancock
6 years ago
Views:

1 Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates September 4, 2003 Xiangqin Cui, J. T. Gene Hwang, Jing Qiu, Natalie J. Blades, and Gary A. Churchill The Jackson Laboratory, Bar Harbor, Maine U.S.A. Department of Statistical science, Cornell University, Ithaca, NY U.S.A. Department of Mathematics, Cornell University, Ithaca, NY U.S.A. corresponding author: Gary A. Churchill The Jackson Laboratory 600 Main Street Bar Harbor, Maine U.S.A (voice) (fax) 1

2 Abstract Combining information across genes in the statistical analysis of microarray data is desirable because of the relatively small number of data points obtained for each individual gene. Here we develop an estimator of the error variance that can borrow information across genes using the James-Stein-Lindley shrinkage concept. A new test statistic (F S ) is constructed using this estimator. The new statistic is compared with other statistics used to test for differential expression, namely the gene-specific F test (F 1 ), the pooled-variance F statistic (F 3 ), and a hybrid statistic (F 2 ) that uses the average of the individual and pooled variances. The F S test shows best or nearly best power for detecting differentially expressed genes over a wide range of simulated data in which the variance components associated with individual genes are either homogeneous or heterogeneous. Thus F S provides a powerful and robust approach to test differential expression of genes that utilizes information not available in individual gene testing approaches and does not suffer from biases of the pooled variance approach. Keywords: shrinkage estimator, ANOVA model, linear mixed model, F statistic, permutation, variance components 1 Introduction Microarray technology has become an important tool for simultaneously screening thousands of genes for changes in their patterns of expression. The large amount of data generated by microarray technology is due mainly to the large number of genes represented on the array. For each gene the number of RNA samples assayed is typically small. Therefore, the commonly used approach of testing for differential expression one gene at a time often has low power [1]. Assuming that all of the variances are equal and using the common variance estimator for testing all genes can substantially increase the power to detect differential expression [2] but at the risk of generating false detections when the common variance assumption is not true. Cui and Churchill (2003) reviewed some methods for testing differential expression of genes in microarray experiments. In addition, they defined three test statistics based 2

3 on an analysis of variance (ANOVA) model. (Note that an ANOVA F test compares an estimate of variation across conditions to an estimate of error variance. The t-test is a special case when the number of conditions is two.) One test (F 1 ) uses only data from individual genes and is in fact the classical F statistic when testing is carried out one gene at a time. Another (F 3 ) assumes a common error variance across genes and uses a pooled estimator of the common variance. The third (F 2 ) reaches a compromise by using an average of gene-specific and pooled variance estimates. When applied to real and simulated data, the F 2 test seems to work well; however, we find it hard to justify taking the simple average of variance estimates. The idea of modifying estimators of variance has been presented by others in similar contexts. The SAM t-test [4] adds a small constant to the gene-specific variance estimate to stabilize the small variances. The regularized t-test proposed by Baldi and Long [5] replaces the usual variance estimate with a Bayesian estimator based on a hierarchical prior distribution. Lönnstedt and Speed (2002) proposed an Empirical Bayes approach to testing that combines information across genes. Newton and Kendziorski [7, 8] considered a hierarchical gamma-gamma model to combine information across genes. Each of the Bayesian approaches uses hierarchical models with relatively strong prior assumptions about the distributions of the individual variances. In this paper we propose a shrinkage variance estimator that makes no prior assumptions about the distribution of variances across genes. It is based on the James- Stein-Lindley estimator [9] and uses it to construct a test statistic called F S. We show that the test based on F S has the highest or nearly the highest power among various F - like statistics and that it is robust, performing well under a wide range of assumptions about variance heterogeneity. It behaves well when the variances are truly constant as well as when they vary extensively from gene to gene. In section 2, we describe how to obtain the shrinkage estimator of variance components that provides gene-specific variances but also uses information across all of the genes in the data to improve estimation. In section 3, we show how to use shrinkage estimators of variances to construct F -like statistics for differential expression of genes in the context of the mixed model analysis of variance. In section 4, we validate the properties of the tests based on these statistics using simulations and real data. We 3

4 simulate a canonical case to consider the problem in its most general and abstracted form. We then look at simulations of a simple microarray experiment comparing 5 samples and a more complex microarray experiment with biological replicates (data available at 2 Shrinking Variance Estimators In this section, we show how to construct estimators of variance from an ensemble of data that shrink individual variance estimators toward the common (corrected geometric) mean. The amount of shrinkage depends on the variability of individual variance estimators. When individual variance estimates are similar, indicating homogeneity, the shrinkage estimator effectively pools these estimates. When individual variance esetimates are widely dispersed, indicating heterogeneity, the shrinkage estimator gives greater weight to the gene specific contributions. The key result of this section is the expression in equation (3) below. Let X g be the residual sum of squared errors (denoted by SSE) and σg 2 be the true variance of gene g. For g = 1,..., G (number of genes), it is assumed that X g /σg 2 are independent, each having a Chi-squared distribution with ν degrees of freedom. Such random variable will be denoted as χ 2 ν. Therefore, we have X g = σgχ 2 2 ν. We take a natural logarithmic transformation on X g to obtain a common location problem as shown below. We then have ln X g ν = ln σ2 g + ln χ2 ν ν. (1) Hence, if we denote the mean of ln χ2 ν ν could write equation (1) as as m, by substracting m from both sides, we where X g = ln Xg ν m and ɛ g = ln χ2 ν ν X g = ln σ 2 g + ɛ g m. Let V be the variance of ɛ g. By using the first order Taylor expansion at equation (1), Var(ln χ2 ν ν ) Var( χ2 ν ν ) = 2 ν. In Table 1, we give the ratio of V to 2/ν, which eventually converges to one. When applied to 4

5 X g (1 g G) in estimating ln σ 2 g (1 g G), the traditional James-Stein-Lindley estimator that shrinks toward the common mean X = X g/g is ( ) X (G 3)V + 1 (X g X ) 2 (X g X ) (2) where for any number a, a + denotes max(a, 0). The truncation enacted by the + is necessary to avoid overshrinking. Transformation back to the original scale gives the shrinkage estimator for σg, 2 [ G ( ) ] σ g 2 = (X g /ν) 1/G (G 3)V B exp 1 (ln X (ln Xg ln X g ) 2 g ln X g ), (3) g=1 where ln X g = 1 G ln(xg ), and B = exp( m) is a bias correction. Note that multiplying the geometric mean ( G g=1 (X g/ν)) 1/G by B gives an unbiased estimator of σ 2 when σ 2 g = σ 2 for all g. The values of B (and also V ) depend on ν. They can be simulated easily and values are given in Table 1. Note that B is always larger than one, hence, the geometric mean without B underestimates σ 2 when all σ 2 g are equal to σ 2. Some Taylor expansion applied to the inverse log transformed estimator in equation (3) demonstrates that it is similar to Ghosh et al. s estimator [10] (derivation not shown). If the collection of all X g (g = 1,..., G) is represented by X, it has been shown that Ghosh et al. s estimator dominates X/(ν + 2), which is better than X/ν from the collection of individual variance estimators, according to the sum of squared invariant losses [10]. This provides a theoretical foundation that the estimator in equation (3) may work well as an estimator of variance. Extensive comparisons among several variations on this estimator show that the version presented here behaves best not only as an estimator of variance but also in construction of test statistics as described in section 3. In particular, the estimators in (3) provide a test statistic with better performance than similar statistics based on the Ghosh et al. (1984) estimator Constructing F -like Statistics To illustrate how to construct F -like statistics using different variance estimators, we start with the general F statistic for a general linear mixed model and then introduce 5

6 the statistics based on shrinkage estimators. A general linear mixed model [11] can be written as Y = Xβ + Zu + ɛ (4) where Y is the vector of observations, X is the design matrix of fixed effects β, Z is the design matrix of random effects u, and ɛ is the vector of the residuals. The variances of the random effects u and residuals ɛ in equation (4) can be estimated using restricted maximum likelihood method (REML)[11]. Estimation of the corresponding fixed effects ( ˆβ) and the prediction of the random effects (û) can be obtained through generalized least squares using the estimated variance components [11, 12]. The variance covariance matrix of ˆβ and û can be computed as Ĉ = X ˆR 1 X Z ˆR 1 X X ˆR 1 Z Z ˆR 1 Z + Ĝ 1, (5) where ˆR is a matrix with the estimates of residual variances on the diagonal and 0 elsewhere, and Ĝ is a matrix with the variance components estimated for random effects u on the diagonal and 0 elsewhere. Linear combinations of the fixed effects (denoted by L) in equation (4) can then be tested using an F statistic [13] constructed as F = ˆβ L (L ĈL) 1 L ˆβ. (6) rank(l) When a linear mixed model is fit to microarray data one gene at a time, the design matrices of X and Z are the same for all genes. Therefore, the general linear mixed model for gene g can be expressed as Y = Xβ g + Zu g + ɛ g (7) The statistic defined in equation (6) can then be used to test the fixed effect β g directly for each gene, which is what is called a gene-specific F test (F 1 ) [14]. Because the variance components in this test are estimated based on the information from only one gene, the power of the test is likely to be low in experiments with only a few RNA samples. Other F -like statistics (F 2 and F 3 ) defined by Cui and Churchill [3] can 6

7 borrow information from other genes for estimating the variance components. F 3 uses the pooled variance estimator ˆσ pool 2 for each variance component. For balanced designs, ˆσ 2 pool is an average across genes of the individual variance estimates. F 2 uses the average of ˆσ g 2 and ˆσ pool 2 for each component. We define a new F -like statistic (F S), which uses σ g 2 from the shrinkage estimator in equation (3) as the variance component estimator for each gene. The variance component estimators are then used in equations (5) and (6) to compute the corresponding F statistics. Therefore, if we define the magnitude of the effects to be tested, such as the sum of squares of relative expression levels of the samples in experiments comparing gene expression among multiple samples, as, the four F tests for the gth gene can be written as F 1 = g /ˆσ 2 g, F 2 = g / 1 2 (ˆσ2 g + ˆσ 2 pool ), F 3 = g /ˆσ 2 pool, F S = g / σ 2 g. (8) The justification for choosing one of these four statistics depends on what we are willing to assume about the variability of the variances across genes. If all variance components are constant across genes (homogeneous variances), then F 3 is the right statistic. If the variance components are gene specific (heterogeneous variances), then F 1 is the right statistic. Statistics like F 2 and F S may be more efficient when there is limited information to estimate the gene specific variance components. Comparisons of these tests in different situations are described in section 4. For simple microarray experiments, fixed effects ANOVA models, a special case of the general linear mixed model with empty Z and µ in equation (4), can be used for modeling and computational convenience. The error variance for each gene can be estimated using the residual mean square error (MSE), which is the residual sum square error (SSE) divided by its degrees of freedom (ν). Thus, the the denominators of F 1, F 2, F 3, and F S can be estimated based on these MSEs across the genes in equation (8). The null distributions of the modified F statistics are not readily available. The F 1 test for a fixed effect ANOVA model has a standard F distribution and critical values could be obtained from the F tables under typical distributional assumptions; however, 7

8 when mixed effects ANOVA models are used, the F 1 in equation (8) does not strictly follow the F distribution, although a conservative approximation can be obtained [13]. Since F 2, F 3, and F S are not standard F statistics, their null distributions have to be established by permutation analysis [15]. In fact, because distributional assumptions are sometimes questionable in practice, it may be prudent to establish all critical values by permutation analysis. Permutation is a nonparametric approach to establish the null distribution of a test statistic. The key idea of permutation is to identify units that are exchangeable under the null hypothesis. In microarray experiments, if we allow for gene-specific variance heterogeneity, then the unit must be whole arrays. The array to be shuffled will depend on the design of the experiment and the factor(s) being tested. Two-color arrays are slightly more complex than single color systems as the pairing between the two channels of the array must be maintained in the permuted units. To execute the permutation analysis we generate random shuffles (p = 1,..., P ) of these units and compute a new set of statistics F (p) g (g = 1,..., G). A common threshold for test statistics is established using percentiles of the entire collection of F (p) g values over indices p and g. Due to the large computation demanding, we only permute 100 times, which requires about an hour on a 32-node Beowulf cluster for a 2000-gene experiment with 30 arrays. Computationally efficient methods are under investigation. 4 Simulation Studies To compare the power of tests based on each of the four F statistics, we first simulated an abstracted canonical form and then simulated microarray experiments based on estimated parameters from real data sets. The first microarray experiment is based on a simple experiment comparing 5 samples and the second is based on a more complex experiment with biological replications. 8

9 4.1 Canonical simulation To evaluate the tests based on the four F statistics in a general setting, we simulated data in a canonical form and studied the power of each test at several levels of variance heterogeneity, represented by coefficient of variation (CV ) of the variances and degrees of freedom (ν). We define the canonical form of this problem as ˆθ g,t = θ g,t + ɛ g,t for gene g = 1,..., G and treatment t = 1,..., T, where θ g,t represents the relative expression level of gene g under treatment condition t, and ɛ g,t is the gene-specific residual error (ɛ g,t N(0, σ 2 g)) associated with estimating θ g,t. In this simulation, the residual variances, σ 2 g, were drawn randomly from the residual variance estimates from the tumor data set described in section 4.3. To vary the CV of these residual variances while keeping their geometric means constant, we rescaled them using a tuning parameter τ Z g = σ2τ g gm(σ 2τ g ) gm(σ2 g), (9) where gm stands for geometric mean. When τ = 0, CV = 0, corresponding to the homogeneous variance case. We study four cases where the τ is 0.78, 1.5 and 2.3, which correspond to CV of 0, 1, 4 and 20. The two middle cases are typical of real microarray data. The treatment effect for each gene can be written as g = 1 t 1 T (ˆθ g,t ˆθ g. ) 2. (10) This is also the common numerator for all four F statitics in equation (8). case, the denominators of all F statistics are obtained using residual MSE g t=1 In this in the place of ˆσ 2 g in equation (8). The residual MSE g for each gene was generated by chisquare distribution according to ANOVA theory based on the true residual variance Z g, MSE g Z g χ 2 ν/ν, where ν are the degrees of freedom for MSE g when fitting a fixed ANOVA model for each gene. We studied many different degrees of freedom but only report (ν = 2, 6, and 50) here to represent small, moderate and very large microarray experiments. 9

10 To establish the null distribution for the F tests, we set θ g,t = 0 for all g = 1,..., 5000, t = 1,..., 5. We calculated F 1, F 2, F 3, and F S for each gene and then use the 95% quantiles as the critical values. To calculate the power for each F test, we generated 5000 non-zero θ g. Because the power of a test depends on the magnitude of the effect ( g ), we study the power of the tests as a function of g. Specifically, we let Q g,t N(0, ) and θ g,t = KQ g,t / 5 t=1 Q2 g,t, consequently, K = g (t 1). By varying K, we can vary the treatment effect. Figure 1 shows the power of the four tests as a function of g (t 1) for degrees of freedom, ν = 2, 6, 50, and heterogeneity, CV = 0, 1, 5, and 20. When all the treatments are identical, g (t 1) = 0, the null hypothesis H 0 holds. In general, F 1 shows good power only when ν is large (ν > 6). F 3 only has good power when variance heterogeneity is low (CV < 1). F 2 is similar to F 3 but more robust. It still has good power when CV is about 4. The power of the F 2 and F 3 tests decrease when the CV increases. When the CV is larger than 10, F 3 loses power completely and F 2 loses most of its power. Compared with the other tests, F S is the most robust and is usually most powerful or nearly so. It is more powerful than F 1 in all the situations. The improvement is quite substantial when ν is small. It also has a large advantage over F 2 and F 3 when the CV is large. When the CV is small, the power of F S is still comparable to that of F 2 and F Analysis and simulation of a microarray experiment Case I: Technical replication To compare the four tests in a simple microarray experiment. We applied them to experimental data and performed simulations based on the results of this experiment. The experiment compared two human colon cancer cell lines, CACO2 and HCT116, and three human ovarian cancer cell lines, ES2, MDAH2774 and OV1063, using a loop design as shown in Figure 2A (unpublished data available at /labsite/datasets/index.html). Fluorescent dye labeled cdna targets were hybridized to DNA microarrays containing 9600 human cdna clones from the Research Genetics sequence verified human cdna collection (Invitrogen, Carlsbad, CA) spotted in 10

11 duplicate. Slides were scanned using the GenePix4000 microarray scanner and the median intensities of each spot were calculated using an image processing software (Axon Instruments, Inc., Foster city, CA). To simplify the analysis, the two spots for the same gene on each array were averaged at the original signal level. The data were then intensity lowess transformed [16] and normalized before fitting the following ANOVA model to each gene, y ijk = µ + S i + D j + A k + ɛ ijk. (11) In this model, µ is the gene mean; S i (i = 1,..., 5) is the sample effect; D j (j = 1, 2) is the dye effect; A k (k = 1,..., 10) is the array effect; ɛ ijk is the residual. Terms µ, S i and D j are treated as fixed. Term A k is treated as random. To put this model in the context of the general linear mixed model (equation 7), µ, S i and D j belong to β and the dimension of X matrix is A k belongs to u and the dimension of Z is The variance components of A k and ɛ ijk were estimated [11] for each gene and their distributions were compared (Figure 3A). The array variance is substantially larger than the residual variance but it has smaller heterogeneity (CV = 1.34) than the residual variance (CV = 1.79). We note that array variance has little impact on the F tests because of the experimental design [17]; therefore, the array effect can be treated as a fixed effect in simple experiments like this one for computational simplicity. The four F test statistics were constructed for model (11) and their null distributions were established through permutations analysis [2, 15, 3]. The permutation unit in this case is array (the two columns, Cy3 and Cy5, of data from each array). At a nominal significance level of 0.01, F 1, F 2, F S and F 3 detected 1588, 2012, 1896 and 981 significant genes, respectively. The volcano plot (Figure 3B) illustrates the differences among four F tests. The significant genes for F 1 are located above the horizontal line and those for F 3 are located right of the vertical line. The significant genes identified by F S and F 2 are indicated by red and yellow coloring respectively and are generally in the upper right corner. These two tests are largely concordant but F S is more similar to F 1, indicating that F S is sensitive to variance heterogeneity in these data. To study the type I error rate and power of each F test, we simulated 10 data sets each with 1000 constant genes and 1000 differentially expressed genes based on this 11

12 design. The individual S i were drawn randomly from distribution N(0, ). The µ and D j were drawn from normal distributions N(0, ) and N(0, ), respectively. Fixed effects parameter values were held constant across all simulations. For each simulation, A k was drawn randomly from a normal distribution N(0, ) and the residuals (ɛ ijk ) were drawn randomly from normal distribution N(0, σg), 2 where the gene specific variance σg 2 was drawn randomly from the 9600 estimates of residual variance of the above data set. The variability of the residual variances was controlled by τ in the same fashion as for the canonical simulation, but the value of τ was set to be 0.8, 1, and 1.5 to only cover the ranges of variability that we have seen in real data sets. Corresponding CV s are about 1.2, 1.8, and 3.7. The averaged results of the 10 simulations at nominal significance level of 0.05 are shown in Table 2. Among the 1000 null model genes, fewer than 50 false positives were detected by each F test, which indicates that the actual type I error rate is somewhat lower than the expectation in each case. The number of false positives for F 2 and F 3 are much fewer than expected, which indicates that these two F tests may be overly conservative. Among the 1000 differentially expressed genes, the majority were identified by all four F tests, but the number of identified true positives decrease as CV increases. The decrease rate of F 1 and F S is smaller than that of F 2 and F 3. F S identifies fewer true positives than F 2 when the CV is around 1.2 and 1.8, but it identifies more than F 2 when the CV is around 3.7. F 2 and F 3 identify fewer genes when the CV is around 3.7. F S identifies more true positives than F 1 and F 3 regardless of the degree of heterogeneity, a relection that F S is more powerful than F 1 and F 3. The power plots of these F tests against the sample effect is shown in Supplemental Figure 1 (A to C). The relationship among all four F tests are similar to those obtained in the canonical simulation. The results of individual simulations are shown in supplemental Tables 1A-1C. Comparison among the false discovery rate (FDR) of each F test shows that F S has a relatively low FDR. In general, a more powerful test tends to have smaller a smaller FDR. This is consistent with the fact that both the FDR proposed by Benjamini and Hochberg (1995) [18] and the positive false discovery rate (pfdr) proposed by Storey (2002) [19] tend to be smaller for smaller p values, a result achieved when a more 12

13 powerful test is used. In other words, when the type I error is controlled at a specified level, a more powerful test will detect more significant genes, therefore, the FDR of the detected gene list is smaller. On the other hand, if we control FDR at certain level, a more powerful test will give a longer significant gene list; however, the type I error rates of F 2 and F 3 are smaller than specified in these simulations, which results in smaller FDRs in some cases, although their powers are not necessarily higher. 4.3 Analysis and simulation of a microarray experiment Case II: Biological replication A recent and promising trend in microarray experiments is to include biological replicates of samples in order to account for inherent biological variation. To accommodate this trend, mixed linear models with biological replicates treated as random effects are required. Here we analyze a representative data set and perform simulations based on this data set to compare the properties of the four F -like tests in this type of experimental setting. The granulosa cell tumor microarray experiment was performed using eight week old SWXJ-9 mice. The effects of dietary androgenic supplementation (DHEA, testo and control) were assessed. RNA samples from each mouse were compared to the Stratagene reference RNA using two microarrays each with dye labeling reversed (Figure 2B). Fluorescent dye labeled cdna targets were hybridized to DNA microarrays printed with the NIA clone set spotted in duplicate. Slides were scanned and the mean intensities of each spot calculated using the GenePix4400 microarray scanner and image processing software (unpublished data are at /index.html). The raw data were preprocessed as described above before fitting the following mixed ANOVA model [14] for each gene, y ijk = µ + A i + D j + T k + M l + R h + ɛ ij, (12) with µ for the gene mean, A i for array effect (i = 1,..., 30), D j (j = 1, 2) for the dye effect, T k (k = 1, 2, 3) for the treatment effect, and M l (h=1,...,15) for mouse effect. R h 13

14 is an indicator of reference (h = 1) versus tissue sample (h = 2), which is determined by the combination of array and dye. We treat µ, D j, T k and R h as fixed effects. The biological replicate, mouse (M l ), effect is treated as a random effect. Therefore, the mouse variance is included along with the error variance in tests that compare treatments (T k ) [20, 21, 3]. The array effect (A i ) is also considered as random effect but it has small effect on the F statistics as mentioned above. The variance components of mouse, array, and residual of this data set were estimated using REML [11]. Their distributions are shown in Figure 3C. The array variance is the largest component and has only moderate heterogeneity (CV = 1.5). The mouse variance is the smallest, but it has greatest heterogeneity (CV = 3.4). Most of the genes have small mouse variance, but a small proportion of genes show large variation across individual mice. The residual variances are intermediate, between array and mouse components, in size and have only moderate heterogeneity (CV = 1.7). The four F statistics were computed for each gene and their null distributions were established using permutation analysis. The permutation unit in this case is mouse, which consists of a dye-swap pair of arrays, because mouse is the nested factor under the tested factor, treatment (Figure 2B). At nominal significance level of 0.01, the F 1, F 2, F 3 and F S tests detect 295, 348, 252 and 333 genes respectively. The volcano plot of these F tests is shown in Figure 3D. To study the type I error and power of the four tests in this experimental setting, we performed 10 simulations each having 1000 constant genes and 1000 differentially expressed genes based on the design and variance components estimates of this experiment. The simulations were similar to those in the previous subsection. The settings for the fixed effects µ, T k and D j were the same as the corresponding fixed effects of model (11). The settings of A i and ɛ ij were the same as before except that the σg 2 for ɛ ij was drawn randomly from the estimates of residual variance of this data set. The settings for the random effect mouse, M l, was sampled from mouse variance component estimates. The reference R h was drawn randomly from distribution N(0, ). The numbers (average over 10 simulations) of true and false positives by each F test at nominal significance level of 0.05 are shown in Table 3. The numbers of false positives are all close to expectation (50), indicating that the type I error is controlled 14

15 at the specified level. The power of each F test decreases as CV increases, especially for F 2 and F 3. Again, F S shows an advantage over F 1. More importantly, it shows more advantage over F 2 and F 3 at large CV s than observed from the microarray simulation without biological replication in Table 2, indicating that when biological variation is included in the computation of the error variance, F S could be advantageous. The power comparison among all four F tests against the treatment effect is shown in Supplemental Figure 1 (D to F). The results from each of the 10 simulations are shown in Supplemental Table 1D-1F. 5 Discussion Variance components in microarray experiments display varying degrees of heterogeneity, across experiments, across variance components, and across genes within a variance component [17]. Assumptions of variance heterogeneity lead to the use of individual gene specific tests, such as F 1, but these tests often suffer from low power due to small degrees of freedom. On the other hand, the assumption of common variance leads to powerful tests, such as F 3, but at the risk of generating false detections in the event that the common variance assumption is not true. A better approach is to use the tests based on variance estimates that are gene specific but combine information across many genes. We gain power by utilizing more information in the data but can also avoid bias. Previous researchers have proposed related approaches such as the Empirical Bayes approach. However, the advantage of many of these methods in operation over the traditional approach such as F 1 either has not been demonstrated or has been demonstrated to be small. In this paper, we apply James-Stein-Lindley shrinkage to improve estimated variance components in a linear mixed model. We show that the resulting test statistic F S performs better than the standard gene-specific test F 1 and the improvement in power can be substantial especially when the degrees of freedom are small, a common situation for microarray experiments. By taking a shrinkage approach to improve variance estimation, we make very weak prior assumptions about the distribution of the variance components. In some simple 15

16 settings, such as estimating a normal mean which has a normal prior, the Empirical Bayes approach and the shrinkage approach lead to exactly the same estimator [22]. But in this setting, the Empirical Bayes approach is very complicated. Our proposed statistic, F S, has an explicit expression and is computationally simple. In summary, we have proposed a simple variation on the general mixed model testing strategy. By using shrinkage estimates of variance components we have obtained a test statistic that is both powerful and robust in the face of variance heterogeneity. Acknowledgment We would like to thank Hao Wu for software support, Qian Li for assistance with data handling, Ann Dorward at the Jackson Laboratory and John Quackenbush at TIGR for providing sample data sets. This research is supported by grants CA88327, HL66620, and HL55001 from the National Institute of Health. References [1] Callow, M. J, Dudoit, S, Gong, E. L, Speed, T. P, & Rubin, E. M. (2000) Genome Research 10, [2] Kerr, M. K, Martin, M, & Churchill, G. A. (2000) J. Comput Biol 7, [3] Cui, X & Churchill, G. A. (2003) Genome Biology 4, art210. [4] Storey, J & Tibshirani, R. (2003) in The analysis of gene expression data: methods and software, eds. Parmigiani, G, Garrett, E. S, Irizarry, R. A, & Zeger, S. L. (Springer, New York). [5] Baldi, P & Long, A. D. (2001) Bioinformatics 17, [6] Lönnstedt, I & Speed, T. (2002) Statistica Sinica 12, [7] Kendziorski, C. M, Newton, M. A, Lan, H, & Gould, M. N. (2003) On parametric empirical bayes methods for comparing multiple groups using replicated gene expression profiles. newton/research/arrays.html. 16

17 [8] Newton, M. A, Noueiry, A, Sarkar, D, & Ahlquist, P. (2003) Detecting differential gene expression with a semiparametric hierarchical mixture method. newton/papers/abstracts/tr1074a.html. [9] Lindley, D. V. (1962) Journal of the Royal Statistical Society Series B 24, [10] Ghosh, M, Hwang, J, & Tsui, K. (1984) Journal of Multivariate Analysis 14, [11] Searle, S, Casella, G, & McCulloch, C. (1992) Variance components. (John Wiley and sons, Inc., New York, NY). [12] Witkovsky, V. (2002) Matlab algorithm mixed.m for solving Henderson s mixed model equations. [13] Littell, R. C, Milliken, G, Stroup, W. W, & Wolfinger, R. D. (1996) SAS system for mixed models. (SAS institute Inc., Cary, NC). [14] Wolfinger, R. D, Gibson, G, Wolfinger, E. D, Bennett, L, Hamadeh, H, Bushel, P, Afshari, C, & Paules, R. S. (2001) J Comput Biol 8, [15] Wu, H, Kerr, M. K, Cui, X, & Churchill, G. A. (2003) in The analysis of gene expression data: methods and software, eds. Parmigiani, G, Garrett, E. S, Irizarry, R. A, & Zeger, S. L. (Springer, New York). [16] Cui, X, Kerr, M. K, & Churchill, G. A. (2003) Statistical Applications in Genetics and Molecular Biology 2, No.1 art4. [17] Cui, X & Churchill, G. A. (2003) in Methods of Microarray Data Analysis III, eds. Lin, S. M & Allred, E. T. (Kluwer Academic Publishers, New York). [18] Benjamini, Y & Hochberg, Y. (1995) Journal of the Royal Statistical Society, Series B 85, [19] Storey, J. D. (2002) Journal of the Royal Statistical Society, Series B 64, [20] McLean, R. A, Sanders, W. L, & Stroup, W. (1991) The American statistician 45, [21] Churchill, G. A. (2002) Nat Genet 32 Suppl 2,

18 [22] Efron, B & Morris, C. (1973) Journal of the American Statistical Association 68,

19 Figure Legends Figure 1. Power comparison among the four F tests using the canonical simulations. In each panel, the power of each F test is plotted against the treatment effect, (t 1). The variability of the individual variances is controlled by τ shown on the top and is reflected by the coefficient of variance (CV ) shown at the upper left corner of each panel. The degrees of freedom (ν) (2, 6, and 50) are noted in each panel at the upper left corner. The nominal type I error rate of 0.05 is indicated by a solid blue line. Figure 2. Illustration of the microarray designs used in the paper. Panel A is a double loop design comparing 5 samples (S1 to S5) used by the TIGR loop experiment. Panel B is a reference design used in the tumor experiment to compare three treatments (T1, T2, and T3) using 5 mice for each treatment and one dye-swap pair of arrays for each mouse. Each arrow represents an array with head pointing to Cy3 labeling and tail pointing to Cy5 labeling. R, reference sample. Figure 3. Variance component plots and volcano plots of the TIGR loop and tumor data sets. Panel A is the smoothed histograms of the two variance components of the TIGR loop data set when A k is treated as random (equation 11). Panel B is the volcano plot from the TIGR loop experiment. Panel C is the smoothed histograms of the three variance components from the tumor data set. Panel D is the volcano plot of the tumor data set. In the volcano plots, the -log 10 of the F 1 p values based on permutation is plotted against the tested effect. Horizontal and vertical lines represent the 0.01 nominal significance level for F 1 and F 3 respectively. Red points, F 2 significant. Yellow square, F S significant; Figure 4 (Supplemental Figure 1). The average power of four F tests from 10 microarray simulations. The data were simulated with the fixed effect ANOVA model (A, B, and C) or mixed effects ANOVA model with two variance components (D, E and F). The values of these variance components are randomly drawn from a typical real data set. The variability of the variances across genes are controlled by τ (0.8, 1, 1.5) and reflected by CV r and CV m. CV r, CV of the residual variance component; CV m, CV of the mouse variance component. Figure 5 (Supplemental Figure 2). Representative volcano plots of the simulations 19

20 based on the TIGR data set. The data were generated for analyzing using fixed ANOVA model (equation 11). CV = 2.2. Panels A, B and C correspond to non-differential, differential and all genes, respectively. Therefore, the selected genes in panel A are false positives and the unselected genes in panel B are false positives. Horizontal and vertical lines represent the 0.01 nominal significance level for F 1 and F 3, respectively. Red, F 2 significant; Orange diamond, F S significant. 20

21 ν B V/(2/ν) ν B V/(2/ν) Table 1: Values of B (bias correction) and V/(2/ν) as a function of ν. These values are used in equation (3) to construct the estimates that shrink the unbiased estimators of variances to its corrected geometric mean. When ν is greater than 50, B and V/(2/ν) are virtually 1. 21

22 CV r = 1.2 CV r = 1.8 CV r = 3.7 TP FP FDR TP FP FDR TP FP FDR F F F S F Table 2: Average number of true and false differential genes identified by each F test from 10 simulations of model (equation 11). Significance level is nominal The total number of genes is 2000, with 1000 constant genes and 1000 differentially expressed genes. CV r, average CV of the residual variance; TP, true positives; FP, false positives. False discovery rate (FDR) is computed as FP/(FP+TP) for each simulation and the average is shown here. The results from individual simulations are shown in Supplemental Tables 1A-1C. CV r = 1.0, CV m = 2.2 CV r = 1.6, CV m = 3.3 CV r = 4.7, CV m = 7.9 TP FP FDR TP FP FDR TP FP FDR F F F S F Table 3: Average number of true and false differential genes identified by each F test from 10 simulations of model (12). Significance level is nominal The total number of genes is 2000, with 1000 constant and 1000 differentially expressed. CV r, average CV of the residual variance; CV m, average CV of the mouse variances; TP, true positives; FP, false positives. False discovery rate (FDR) is computed as FP/(FP+TP) for each simulation and the average is shown here. The results from individual simulations are shown in Supplemental Tables 1D-1F. 22

23 Figure 1: 23

24 A S1 S5 S2 S4 S3 B T1 T2 T3 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 R R R R R R R R R R R R R R R Figure 2: 24

25 Figure 3: 25

26 Figure 4: 26

27 5 Non differential genes 5 Differential genes 4 4 log 10 Pvalpg p ool(f1) 3 2 log 10 Pvalpg p ool(f1) A sqrt(sum(sample 2 )) B sqrt(sum(sample 2 )) 5 All genes 4 log 10 Pvalpg p ool(f1) C sqrt(sum(sample 2 )) Figure 5: 27

28 Supplemental Table 1A τ =0.8 Simulation Number Mean CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos Supplemental Table 1B τ =1 Simulation Number Mean CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos

29 Supplemental Table 1C τ =1.5 Simulation Number Mean CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos Supplemental Table 1D τ =0.8 Simulation Number Mean CV m CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos

30 Supplemental Table 1E τ =1 Simulation Number Mean CV m CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos Supplemental Table 1F τ =1.5 Simulation Number Mean CV m CV r F 1 false pos F 2 false pos F S false pos F 3 false pos F 1 true pos F 2 true pos F S true pos F 3 true pos

31 Supplemental Table 1. Number of false and true positives detected by four F tests in the individual simulations. Supp Table 1A to 1C are from the fixed effect ANOVA model simulation described in section 4.2. Supp Table 1C to 1D are from the mixed effect ANOVA model simulation of section 4.3. Each simulation has 1000 non-differential genes and 1000 differential genes. The variability of individual variances is controlled by τ (0.8, 1, and 1.5). In different simulations, the same τ generates slightly different CV s because a random sub-population of 2000 variances was drawn from variance estimates for mouse or residual. CV m, the CV (coefficient of variation) of the mouse variances; CV r, CV of the residual variances. Significance level is nominal 0.05.

Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis

Optimal Shrinkage Estimation of Variances With Applications to Microarray Data Analysis Tiejun TONG and Yuedong WANG Microarray technology allows a scientist to study genomewide patterns of gene expression.