BIOINFORMATICS ORIGINAL PAPER

Size: px

Start display at page:

Download "BIOINFORMATICS ORIGINAL PAPER"

Mercy Webster
6 years ago
Views:

1 BIOINFORMATICS ORIGINAL PAPER Vol 21 no , pages doi:101093/bioinformatics/bti407 Gene expression A practical false discovery rate approach to identifying patterns of differential expression in microarray data Gregory R Grant, Junmin Liu and Christian J Stoeckert Jr Center for Bioinformatics, University of Pennsylvania, 1429 Blockley Hall, 423 Guardian Drive, Philadelphia, PA , USA Received on December 15, 2004; revised on March 16, 2005; accepted on March 22, 2005 Advance Access publication March 29, 2005 ABSTRACT Summary: Searching for differentially expressed genes is one of the most common applications for microarrays, yet statistically there are difficult hurdles to achieving adequate rigor and practicality False discovery rate (FDR) approaches have become relatively standard; however, how to define and control the FDR has been hotly debated Permutation estimation approaches such as SAM and PaGE can be effective; however, they leave much room for improvement We pursue the permutation estimation method and describe a convenient definition for the FDR that can be estimated in a straightforward manner We then discuss issues regarding the choice of statistic and data transformation It is impossible to optimize the power of any statistic for thousands of genes simultaneously, and we look at the practical consequences of this For example, the log transform can both help and hurt at the same time, depending on the gene We examine issues surrounding the SAM fudge factor parameter, and how to handle these issues by optimizing with respect to power Availability: Java and Perl implementations are available at www cbilupennedu/page Contact: ggrant@pcbiupennedu 1 INTRODUCTION A common use of microarrays is to find differentially expressed genes between two experimental conditions The statistical problem of controlling the error rates has proven to be difficult using straightforward classical statistics, and instead the relatively new false discovery rate (FDR) approaches (Benjamini and Hochberg, 1995) have become widely accepted as appropriate (Ge et al, 2003) Therefore we will not argue about the merits of the FDR approach, but we will take it as our starting point The FDR approach is to accept some false positives, while attempting to control the proportion of them in the set of all genes predicted Benjamini and Hochberg (1995) did not provide a general and powerful method for achieving FDR control Instead their original method relies on strong assumptions and even when those hold the method can be quite conservative The method starts with any gene-by-gene p-values, and then for a desired FDR Q it adjusts the p-values and a cutoff is determined so that the genes whose adjusted p-values are smaller than the cutoff have the expected proportion of false-positives no more than Q By describing one method as having greater power than another, we will mean that the method predicts a larger set of genes at the desired FDR To whom correspondence should be addressed There have been many methods proposed to make the Benjamini and Hochberg methods more general and more powerful (eg Benjamini and Yekultieli, 2001; Pounds and Cheng, 2004); however, there are always assumptions required to obtain the gene-by-gene p-values Typical assumptions are normality of intensity distributions, independence of p-value distributions across genes or identically distributed t-statistics For the most part the degree to which these assumptions affect the results on real data is unknown The methods work well on simulated data, but since we do not yet understand the nature of real data well enough to properly simulate them, it is difficult to compare methods in a meaningful way When there are only a few replicates available in each condition, and thousands of genes on the array, permutation p-values, or rank-based p-values such as the Wilcoxon rank sum statistic, which would overcome some of the parametric assumptions, are too granular to be useful For example, with three replicates in each condition, there are only 20 permutations, so an array with genes would necessarily clump 1000 of them into the highest significance value 005 The empirical Bayes method of Efron and Tibshirani (2002) is based on the Wilcoxon statistic and so requires a similar number of replicates PaGE is a permutation based method which attempts to avoid as many parametric assumptions as possible, while also avoiding the granularity of p-values and rank-based statistics altogether PaGE (Manduchi et al, 2001) and SAM (Tusher et al, 2001) are methods that attempt to control the FDR without the use of corrected p-values SAM and PaGE use methods of permutation estimation, as described below 1 SAM has become a popular application; however, it relies on some heuristics the consequences of which are not well documented and which in fact present several limitations as we will show below We propose a slightly different definition of the FDR and we will show why it is more straightforward for permutation estimation We will then describe a permutation method to control the FDR The most serious issue regarding any method is the choice of statistic SAM has relied on a modified t-statistic which depends on an extra parameter We will show how sensitive results can be to this parameter, and will show how the SAM method of setting this parameter does not generally have good power properties The current PaGE approach to this problem is different, as described in Section 8 1 PaGE 10 does not use permutations; the original PaGE algorithm was replaced by a permutation algorithm in PaGE The Author 2005 Published by Oxford University Press All rights reserved For Permissions, please journalspermissions@oupjournalsorg Downloaded from

2 Identifying patterns of differential approach Other issues relating to the statistic and data transformations are discussed The conclusion is that no statistic is best for finding all differentially expressed genes, and relying on one statistic over another generally always involves a tradeoff PaGE can be used to find differentially expressed genes between two conditions, and to generate patterns across several conditions PaGE was introduced by (Manduchi et al, 1999), and though the algorithm has changed significantly, the general approach of generating discrete patterns using an FDR-based confidence measure has remained unchanged Implementations of PaGE are available, in Java and Perl, at A complete description of the PaGE algorithm as implemented is available in the technical manual documentation, 2 DIFFERENTIAL EXPRESSION We assume that there are two well-defined experimental conditions, and that each gene has a measurable expression intensity that follows some (unknown) distribution in each condition Differential expression of a gene means that these distributions are different between the two conditions The distributions can differ in any possible way, but the statistics we use are designed to be sensitive to primarily a difference in the means (eg the t-statistic) Even so, the hypotheses being tested are of equality of distributions This is a necessary consequence of using the permutation methods that we do The data are assumed to consist of multiple quantified microarray experiments in each condition There are many ways to design a study for comparative analysis We will focus here on the two-sample case where there is a separate measurement in each condition, for each gene Another popular design, the direct comparison design, is to hybridize the two conditions to the two channels of a two-channel array, respectively, and to generate some number of replicate arrays The PaGE software handles all of these designs; however, for concision, the direct comparison design will not be discussed here, nor other cases such as paired or reference designs; those cases are discussed in detail in the technical manual The theory for all cases is similar We begin by considering just two experimental conditions, called condition 0 and condition 1 Condition 0 will be referred to as the reference condition Up-regulation of a gene will mean the gene s mean intensity is higher in condition 1 as compared to condition 0, and analogously for down-regulation 3 THE DATA The data will be assumed to consist of some number m of replicate arrays in condition 0 and some number n of replicates in condition 1 We put the data in a matrix as in Equation (1) below C 1 C 2 C m D 1 D 2 D n G 1 c 11 c 12 c 1m d 11 d 12 d 1n G 2 c 21 c 22 c 2m d 21 d 22 d 2n G 3 c 31 c 32 c 3m d 31 d 32 d 3n G g c g1 c g2 c gm d g1 d g2 d gn The m + n columns in Equation (1) correspond to hybridizations and the g rows correspond to genes The columns labeled with (1) the C i s correspond to hybridizations from condition 0, and the columns labeled with the D i s correspond to the hybridizations from condition 1 A permutation of the data will consist of choosing some number k of columns from condition 0, and then choosing the same number of columns from condition 1, and switching them We then obtain two new conditions from the first m and the last n columns of the permuted data matrix What is important in a permutation is which columns end up in which condition, and not the order that they happen to be listed in Therefore there are a total of ( m+n) n possible permutations in the two-sample case Denote a row of the data matrix by r Ifp is a permutation of the columns, we denote the correspondingly permuted row by r p 4 THE STATISTICS We denote by S any two-class statistic, and think of it as a function which maps rows of the data matrix r to real numbers Suppose there is some center point c such that S > c indicates up-regulation and S<cindicates down-regulation The statistics we have in mind are: The modified t-statistic ((c 1, c 2,, c m ), (d 1, d 2,, d n )) µ 1 µ 0 (2) α + σ where µ 0 is the mean of (c 1, c 2,, c m ), µ 1 is the mean of (d 1, d 2,, d n ), σ0 2 σ = (m 1) + σ 1 2 (n 1) m + n 2 where σ0 2 = (1/(m 1)) m j=1 (c j µ 0 ) 2 and σ1 2 = (1/(n 1)) n j=1 (d j µ 1 ) 2 In this case the center c equals 0 When α = 0 this is the standard two-sample t-statistic As we will see (Section 8), results can be extremely sensitive to the value of α; therefore we refer to α as the t-statistic tuning parameter 2 The second statistic is the ratio of the means in the two conditions ((c 1, c 2,, c m ), (d 1, d 2,, d n )) µ 1 µ 0 where µ 0 is the mean of (c 1, c 2,, c m ) and µ 1 is the mean of (d 1, d 2,, d n ) In this case the center c = 1 A statistic need not make sense in all cases For example if there are negative intensities in the unlogged data then the ratio statistic is not sensible We could apply the t-statistic to the unlogged data, or to the logged data, or to any transformation of the data Therefore, even with fixed α, the t-statistic is not one statistic but a family of statistics This will be discussed further in Section 9 The ratio statistic makes the most sense when the data is on a multiplicative scale, such as two-channel ratios The log ratio is similar to the modified t-statistic with a large value of α, so we do not consider it separately Let M be the data matrix, p a permutation of the columns and M p the permuted data matrix Denote the value of the statistic S on row r of M by S r and on row r of M p by S rp 2 This is analogous to the t-statistic fudge factor introduced by Tusher et al (2001) 2685 Downloaded from

3 GRGrant et al 5 THE FDR We will focus on up-regulation in condition 1 versus condition 0 The case of down-regulation follows by switching the roles of conditions 0 and 1 We will also assume that larger values of S are more significant This is the case for all of the statistics above The case where smaller values are more significant follows by switching the direction of the inequalities For any row r we take the null hypothesis Hr 0 to be that the distribution for any observation in condition 0 is identical to the distribution for any observation in condition 1 Suppose that for g 0 of the rows the null hypothesis is true Let g 1 = g g 0 For each real number k>c, let G k be the set of rows r of M such that S r k G k is the set of predictions if we use k as the cutoff for the statistic Let R k be the size of G k Let V k be the number of rows in G k for which the null hypothesis is true With this set-up, we will have made V k false predictions out of R k total predictions Provided R k > 0, we call the ratio V k /R k the false discovery proportion of this set of predictions We choose k so that this proportion is controlled to a desired level There are many ways to define what it means to control this proportion, and our FDR definition differs from the original one of Benjamini and Hochberg (1995), as well as that of Storey (2002) and Storey and Tibshirani (2003) Throughout we will define the FDR of the procedure as { E(V k )/R k, R k > 0 (3) 0, R k = 0 This differs from the original definition of Benjamini and Hochberg (1995) which is given by { E(V k /R k ), R k > 0 (4) 0, R k = 0 The advantage of the original definition (4) is that it takes into account the dependence between V k and R k However an advantage of definition (3) is that it can be more realistically estimated via permutation distributions 3 The goal is to find the least conservative (ie smallest) value of k so that (3) is acceptably low Sometimes one is willing to tolerate a relatively high FDR such as 05; other times a low FDR such as 005 is desired 6 FDR ESTIMATION For each permutation p of the data matrix we obtain a value V p k, the number of rows whose permutation statistic S rp k Thus we obtain a permutation distribution D k of V k under the complete null hypothesis (that is, when all null hypotheses are true) Note that the distribution D k restricted to the null genes depends on the joint 3 Indeed to estimate definition (4) one needs to know something about the random properties of V/R If we permute the columns of the data matrix (1), we can obtain some kind of approximation to an observation of V under the complete null hypothesis, but this tells us nothing about V/R under the true distribution of the data The bootstrap distribution obtained by sampling with replacement from the two conditions separately will give us an approximation to the distribution of R, but again tells us nothing about V/R To obtain information about V/R most authors have had to make strong assumptions about the data distribution of the S r, over all, and the joint distribution of the S r restricted to the null genes is maintained by permuting the data matrix in columns Let µ k be the mean of D k Typically permutation distributions are used to derive p-values, for which there is substantial theory Here, however, we are interested in actually estimating E(V k ) from the permutation distribution, and this requires some justification Note that similar permutation estimates are utilized in the SAM theory of Storey and Tibshirani (2003); however they do not address their properties (Pan, 2003) There are two problems in using µ k as an estimate of E(V k ) First, since it is calculated under the complete null hypothesis, it is at best a measure of how many hypotheses would be falsely rejected if they were all true So, assuming that some hypotheses are false, µ k would be an overestimate Second, unless all hypotheses are true, the false hypotheses can cause the distribution of V k to be different from what it would be if we could consider only the true hypotheses in defining D k Since we do not know which hypotheses are true, we must allow the false hypotheses to contribute to the counts involved in D k Regarding the second issue, we argue that µ k is conservative Suppose that null hypothesis r is false Then for permutations p which switch only one or a few columns between conditions, the false hypothesis r will tend to have large values of the statistic S r, and will therefore tend to contribute to the count of V p k more than it would if Hr 0 were true Similarly, down-regulated genes will tend to over-contribute to the count V p k for those permutations which switch most or all of the columns between the conditions Therefore the estimate µ k will tend to be larger than the true value of E(V k ) The more hypotheses that are false, the more conservative µ k will be Turning to the first issue, since µ k is an overestimate of E(V k ), R k µ k (5) is an underestimate of the number of true positives (the rows in G k for which the null hypothesis is false) Therefore g (R k µ k ) is an overestimate of the number of true hypotheses Originally µ k was calculated as an estimate of V k assuming all hypotheses are true If we recalculate assuming there are g (R k µ k ) true hypotheses, then we obtain µ k (1) = µ k g [g (R k µ k )] Since g (R k µ k ) is an overestimate of the number of true hypotheses, µ k (1) is still an overestimate of E(V k ); however it is a better estimate than µ k Using the same logic, we calculate µ k (2) = µ k(1) [g (R k µ k (1))] g and in general µ k (i + 1) = µ k(i) g [g (R k µ k (i))] This sequence is decreasing and bounded below, and therefore converges In fact it converges quickly and PaGE takes µ k (n) as its final estimate for V k, where µ k (n) µ k (n 1) <00001 We denote µ k (n) by Ṽ k We take as estimate of the FDR FDR k = Ṽ k /R k It is useful to also define the quantity CONF k = 1 FDR k CONF k is an estimate of the probability that any gene taken at random from G k is a true positive 2686 Downloaded from

4 Identifying patterns of differential approach We assign confidences not just to the sets G k, but to the rows of the data matrix themselves, by CONF r = min k such that r G k CONF k where r is a row of the data matrix In this way, we have confidence of at least γ that a row r with CONF r = γ represents a truly differentially expressed gene 7 LEVELS AND PATTERNS SAM offers a multi-class mode which tests the hypothesis that the means in all conditions are equal This does not, however, give an indication of what exactly is going on in which conditions, or of a way of sorting the genes by their behavior across the conditions In order to visualize the behavior of genes across multiple conditions, PaGE was designed to perform a series of pairwise comparisons and to generate patterns from the results (Manduchi et al, 2000) One condition is chosen as a reference and the remaining conditions are compared to this reference condition, which we call condition 0 If there are n conditions including condition 0, then a simple way to generate patterns of length n 1 would be to put +1 in position i if there is up-regulation at the desired confidence in condition i versus the reference, and similarly 1 if there is down-regulation; if there is no differential regulation at the desired confidence, then put 0 This would allow one to see where the differential expression is happening However, in order to make this strategy more descriptive, we do not restrict to just patterns of 0, ±1, but use the rest of the integers as well Higher levels represent a higher confidence of differential expression than lower levels The levels are determined as follows First, the user chooses a confidence γ A cutoff C for the statistic is determined by setting it to the minimum k for which CONF k >γ, if such a C exists (CONF k >γis defined in Section 6) The set of all rows r such that S r >Cthen gives the least conservative set of predictions with a confidence of at least γ Depending on the data, there may not be any such value of C that achieves confidence γ, in which case C is set to be Now, depending on whether the statistic is on an additive scale (such as the t-statistic), or on a multiplicative scale (such as the ratio statistic), the levels are created differently In the additive case if the statistic S r <C, then row r is given level 0 If C S r < 2C, then it is given level 1 If 2C S r < 3C then it is given level 2 In general if nc S r <(n+ 1)C then it is given level n If the statistic is on a multiplicative scale, then row r is given level n if C n S r <C n+1 When the data are multiplicative and the ratio statistic is used then the patterns take on the intuitive meaning of fold-change The parameter γ is referred to as the level confidence As one raises the level confidence, fewer levels are produced and the genes assigned to the levels have higher confidence We have found this to be a convenient way of visualizing the results of a study with many conditions 8 THE T -STATISTIC TUNING PARAMETER When using the t-statistic, the number of genes found at the desired confidence can be dramatically affected by the value of the tuning Table 1 The effect of the t-statistic tuning parameter a α Number predicted a The data consist of three replicates, direct comparison simulated data of 1000 rows, with 50 differentially expressed (up-regulated) genes Each column gives the number of genes found at confidence = 05 for the corresponding value of α The data is available at cbilupennedu/page/doc/testdata0txt Table 2 Similar to Table 1, the effect of the t-statistic tuning parameter on real data consisting of four replicates per condition, log transformed two-class mouse pancreas cells 0 and 2 h post-partial hepatectomy [data published in White et al (2005)] a α Up Down a Differential expression results were obtained at confidence = 08 The data are available at wwwcbilupennedu/page/doc/0vs2txt Table 3 Similar to Tables 1 and 2, the effect of the t-statistic tuning parameter on real data consisting of eight replicates direct comparison data a α Up Down a The cells were taken from pig heart valve endothelial tissue comparing regions subject to two different kinds of hemodynamic flow [data published in Simmons et al (2005)] Differential expression results were obtained at confidence = 08 The data is available at wwwcbilupennedu/page/doc/pig1txt parameter α in Equation (2) This is particularly true when there are only a few replicates per condition The FDR is conservatively estimated regardless of the value of α, so choosing a value of α which maximizes the number of results at the desired confidence is desirable If too many genes are found, the confidence can be raised to find a smaller set We observe this effect empirically in a wide range of datasets, both real and simulated; see for example the data in Tables 1 3 In these tables the number of genes found differentially expressed, at a fixed confidence, is given as a function of α Table 1 represents simulated data while Tables 2 and 3 are from real data For the simulated data, 50 genes are up-regulated out of 1000 genes The intensities in each row for each condition are given by beta distributions With beta distributions, by varying the two parameters as well as the range, we can produce heterogeneous behavior: from unimodal, to bimodal, are highly skewed Since gene expression data are highly heterogeneous, the beta provides a better model than the normal, with respect to the heterogeneity of distributions For the complete details of the simulation engine, see Grant et al (2005; wwwcbilupennedu/expression_simulator/) 2687 Downloaded from

GRGrant et al In any case, the sensitivity on α as shown in Table 1 was not dependent on the arbitrary parameter choices used to simulate the data but is a general phenomenon we found to varying

is that the standard t- statistic (α = 0) is large for those genes with small variances, and the algorithm is forced to be more conservative in its predictions to avoid picking them up As α grows,

5 GRGrant et al In any case, the sensitivity on α as shown in Table 1 was not dependent on the arbitrary parameter choices used to simulate the data but is a general phenomenon we found to varying degrees regardless of the dataset being real or simulated Though the nature of the dependence on α varies from dataset to dataset, the dependence itself is typical The reason for this dependence on α is that the standard t- statistic (α = 0) is large for those genes with small variances, and the algorithm is forced to be more conservative in its predictions to avoid picking them up As α grows, this effect is minimized; however, when α is too large, then for the differentially expressed genes with small mean difference and small variance, α dominates the statistic, which tends to obscure the differential expression of these genes in the noise Therefore some genes get lost as α goes down, while other genes get lost as α goes up What value to set α to depends on the nature of the differentially expressed genes as well as the non-differentially expressed genes There is no known formula that can be applied to the data matrix to determine the value of α which maximizes the power However, since the (expected) confidence of the set of predicted genes is the same regardless of α, a power criterion to determine α is desirable and we attempt to choose α to maximize the power PaGE tries a range of 10 values of α, from small to large, and chooses the one which gives the greatest number of results This is the default value of α Other values of α can, however, find genes that the default value misses, as we will see below Therefore it is important that the user has control over this parameter Since we are taking the maximum over 10 values of α we potentially introduce another multiple testing issue If these were 10 independent runs this problem might become serious, but they are highly dependent, even for very different values of α, and simulation studies indicate that the confidence is not significantly affected The extreme at which point this effect might have a significant impact on the FDR is when there are no differentially expressed genes and a high FDR is requested In this case a false positive set might consist of one or a very few genes, and the maximization can have a significant effect on the proportion because the numbers involved are small Because of this we do not recommend raising the FDR much higher than 05 when there are very few genes found The SAM (Tusher et al, 2001) algorithm uses an approach which depends on a smoothing criterion to select α The rationale is that the t-statistic distributions should be identical for all (null) genes, so they impose a uniformity criteria for the t-statistics to determine α They do not present the theory however which shows their criteria achieves reasonably optimal power, and in fact this method can give values of α that are quite far from optimal with regard to the power of the results This could be due to the fact that they are smoothing over all genes and not just the null ones To illustrate this we generated several simulated datasets, using again the simulation engine from Grant et al (2005) The first dataset has 5000 genes, 300 of which are differentially expressed Differentially expressed genes have varying mean differences The first 25 rows have mean difference 12 between the two conditions The next 25 rows have mean difference 13 The next have 14, etc Specifically the 12 blocks of 25 have mean differences 12, 13, 14, 15, 2, 22, 24, 26, 4, 43, 46 and 49, respectively This gives us 100 with relatively small mean differences; 100 with medium mean differences and 100 with relatively large mean differences These are the rows numbered The remaining 4700 non-differentially Table 4 The power of the results as a function of the t-statistic tuning parameter a a The realized confidence is the actual confidence achieved Table 5 The effect of the t-statistic tuning parameter on different spectra of differentially expressed genes expressed genes were generated by 4700 randomly chosen beta distributions with randomly chosen parameters Similarly to the data simulated for Table 1, the intensities in each conditions are given by beta distributions For the differentially expressed genes, the variances of the distributions increase over each block of 25 from very low to very high So row 0 has µ 1 µ 0 = 12 and very low variances, row 1 has µ 1 µ 0 = 12 but slightly higher variances, up to row 24 which has µ 1 µ 0 = 12 and high variances Row 25 then has µ 1 µ 0 = 13 with low variances, row 26 has µ 1 µ 0 = 13 with slightly higher variances, etc In this way a full range of mean differences and variances is represented in the data The full dataset can be downloaded at cbilupennedu/page/doc/testdata1txt Using the first three replicates of each condition, we ran PaGE with five different values of α and also ran SAM Table 4 has a summary of the results obtained with an 80% confidence cutoff for five values of α and SAM The realized confidence is the actual confidence achieved, which can be determined since this is simulated data and we know exactly what are the true positives and true negatives Table 5 breaks down the results by type of differentially expressed gene Note that the number of genes found in each row of the table is maximized on the diagonal of the shaded portion Therefore small values of α find genes with small mean differences and small variance; large values of α find genes with large mean differences and large variance The above example is far from the worst case If the non-differentially expressed genes have bimodal distributions and the differentially expressed genes have large variance, then SAM performance can be quite far from optimal The page cbilupennedu/page/doc/example1htm has the complete results for the genes found at 08 confidence or higher Columns represent different runs for different values of α The final column gives the SAM results The top three rows of the output consist of the actual values of R V (the number of true positives), R (the total number of predictions) and the realized confidence The 300 differentially expressed genes are listed, numbered 0 299, and an X means that that gene was found in that run 2688 Downloaded from

6 Identifying patterns of differential approach Table 6 Simulated dataset of 1000 genes, with 100 differentially expressed, and with the remaining genes each having a bimodal distribution Table 7 Comparison of results using the ratio of means statistic versus the t-statistic, on simulated data a Confidence ID Confidence ratio of means Confidence t-statistic Number of PaGE predictions Number of PaGE true positive Number of SAM predictions Number of SAM true positive SAM produces results very close to the setting α = 01 The power is maximized around α = 25 The PaGE default is α = 35 At this value 160 genes are reported, as opposed to SAM s 145 But the overlap between the two sets is only 125 genes SAM finds 20 that PaGE does not with the default α, and PaGE finds 35 that SAM does not Note that it is not necessary to use the PaGE default, and in the software the user has control over the value of α The default is simply designed to maximize the number of results It is important to keep in mind that an FDR only makes sense within the context of a set of predictions Even though we apply confidences to the individual genes in the sets, these confidences mean the chance of making a mistake when considering genes at random among the set of predictions made Therefore, if one merges sets of genes found from different runs, the FDR may be increased If two methods produce sets with the same FDR, but the results overlap proportionally more on the true positives than the false positives, then the FDR of the union of the two sets will be greater than the FDR of either set individually So when looking for more genes, it is not generally a good idea to use all possible choices of parameter settings and transformations, and merge the results, but rather to find the one or few that work best, and always take the meaning of the confidences in the context of the separate sets of results The dataset and parameter settings in the previous example were not particular; we repeated this with many variations and the same behavior was seen in general In fact, it is possible to generate simple datasets for which SAM performs much worse To demonstrate this, we generated a simulated dataset of 1000 genes, with 100 differentially expressed, and with the non-differentially expressed genes each having a bimodal distribution The dataset, SAM q-values and the PaGE confidences can be obtained at cbilupennedu/page/doc/files/bmodhtml The q-values are equal to one minus the confidence The lowest q-values produced by SAM on this dataset are 05 (14 genes) In contrast, PaGE reports 17 genes at confidence >08, all but one of which are true positives (Table 6) 9 THE CHOICE OF STATISTIC, TRANSFORMATION AND OTHER PARAMETERS 91 Using the ratio of means versus the t-statistic For most datasets the user will probably want to start with the t-statistic option If the t-statistic does not return many results, then keeping in mind the caveats of the previous section about multiple runs, one can try the other statistic or the log transformation options Max others a Genes with ID 0 are low-intensity differentially expressed Genes with ID 1 are high-intensity differentially expressed Genes with ID 2 99 are medium intensity nondifferentially expressed The bottom row shows the maximum confidence achieved by all other genes Each method picks up one of the two differentially expressed genes at high confidence Table 8 Comparison of results using the logged versus unlogged data with the t-statistic a ID Confidence logged Confidence unlogged Max others a Same data as in Table 7 The bottom row shows the maximum confidence achieved by all other genes Using the unlogged data was much better at finding gene 1, while using the logged data performed better on Gene 0 and completely lost gene 1 in the noise Even if there are many genes found, however, different statistics can pick up different kinds of differential expression, as we saw in the previous section regarding using the t-statistic with different values of the tuning parameter α To illustrate this further, we generated a simulated dataset with two conditions, 100 genes and four replicates per condition Two of the genes are differentially expressed The 98 non-differentially expressed genes have moderate intensity: (beta with mean 50, spread 35) Gene 0 is differentially expressed in the low-intensity range (means of 4 and 9 in the two conditions, respectively) Gene 1 is differentially expressed in the high-intensity range (means of 400 and 450 in the two conditions, respectively) The data are available at cbilupennedu/page/doc/testdata2txt Table 7 shows the results using the ratio of means statistic (left) and the t-statistic (right) Using the ratio of means, the low-intensity differentially expressed gene (gene 0) is much more significant than the high-intensity differentially expressed gene (gene 1) Conversely, when using the t-statistic, the high-intensity differentially expressed gene is much more significant than the low-intensity differentially expressed gene 92 Using logged versus unlogged data The caveats about choice of statistic apply also to the different possible data transformations one can perform Perhaps the most common is the log transformation PaGE offers the option of performing this transformation when one is using the t-statistic Using the same test datasets as above, Table 8 shows what happens to the confidence of gene 1 when the data are logged versus unlogged The confidence of Gene 0 goes down while the confidence of Gene 1 goes up 2689 Downloaded from

7 GRGrant et al Thus one cannot trust either approach to perform better for all genes simultaneously This happened because applying logs to the data, and then applying the t-statistic, which focuses on differences, is similar to first taking ratios, and then taking logs Gene 1 in the above example, whose intensities are in the high range of the spectrum, has a relatively small ratio, compared to the null genes whose intensities are in a lower range of the spectrum 93 Other statistics Statistics which are invariant to monotonic transformations, for example the Wilcoxon signed rank statistic, would allow us to avoid some of the issues above However, without sufficiently many replicates, the Wilcoxon statistic suffers from the same granularity problems as permutation p-values Once there are enough replicates for it to work, the null genes have a higher chance of appearing significant with the Wilcoxon because it puts small and large differences on an equal footing, and we are comparing the statistic across many genes This can obscure the true signal and as a result we were unable to find any datasets for which this method gave superior results We experimented with other statistics as well, including using p-values themselves as statistics, even adjusted p- values We were unable to find a significant example where these gave improved results, so we do not include them as options in the PaGE implementation Ultimately there is no best statistic or transformation A statistic can often be optimized for a single test, and there is substantial statistical theory about how to do this But a statistic cannot typically be optimized for thousands of genes at once Unfortunately each dataset is particular and must be treated as a special case, but by starting with the defaults the user can home in on reasonable parameter settings to suit their needs ACKNOWLEDGEMENTS We thank Dr Elisabetta Manduchi and Warren Ewens for valuable comments and discussions This work was supported in part by NIH grant K25-HG A1 REFERENCES Benjamini,Y and Hochberg,Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing Proc R Statist Soc B, 57, Benjamini,Y and Yekultieli,D (2001) The control of the false discovery rate in multiple testing under dependency Ann Statist, 29, Efron,B and Tibshirani,R (2002) Empirical Bayes methods and false discovery rates for microarrays Genet Epidemiol, 23, Ge,Y et al (2003) Resampling-based multiple testing for DNA microarray data analysis Test, 12, 1 44 Grant,GR, Sokolowski,S and Stoeckert,CJ, Jr (2005) Performance analysis of differential expression prediction algorithms using simulated array data Technical Report Manduchi,E et al (2000) Generation of patterns from gene expression data by assigning confidence to differentially expressed genes Bioinformatics, 16, Pan,W (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression Bioinformatics, 19, Pounds,S and Cheng,C (2004) Improving false discovery rate estimation Bioinformatics, 20, Simmons,CA et al (2005) Spatial heterogeneity of endothelial phenotypes correlates with side-specific vulnerability to calcification in normal porcine aortic valves Circ Res, in press Storey,JD (2002) A direct approach to false discovery rates J Roy Statist Soc, 84, Storey,JD and Tibshirani,R (2003) In Parmigiani,G Garrett,ES, Irizarry,RA and Zeger,S (eds), The Analysis of Gene Expression Data Springer, New York, pp Tusher,VG et al (2001) Significance analysis of microarrays applied to the ionizing radiation response Proc Natl Acad Sci USA, 98, White,P et al (2005) Identification of transcriptional networks during liver regeneration J Biol Chem, 28, Downloaded from

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca