Selecting an Optimal Rejection Region for Multiple Testing

Selecting an Optial Rejection Region for Multiple Testing A decision theory alternative to FDR control, with an application to icroarrays David R. Bickel Office of Biostatistics and Bioinforatics Medical College of Georgia Augusta, GA 30912-4900 dbickel@ail.cg.edu www.davidbickel.co Noveber 26, 2002 Archives: arxiv.org, athpreprints.co Abstract. As a easure of error in testing ultiple hypotheses, the decisive false discovery rate (dfdr), the ratio of the expected nuber of false discoveries to the expected total nuber of discoveries, has advantages over the false discovery rate (FDR) and positive FDR (pfdr). The dfdr can be optiized and often controlled using decision theory, and soe previous estiators of the FDR can estiate the dfdr without assuing weak dependence or the randoness of hypothesis truth values. While it is suitable in frequentist analyses, the dfdr is also exactly equal to a posterior probability under rando truth values, even without independence. The test statistic space in which null hypotheses are rejected, called the rejection region, can be deterined by a nuber of ultiple testing procedures, including those controlling the faily-wise error rate (FWER) or the FDR at a specified level. An alternate approach, which akes use of the dfdr, is to reject null hypotheses to axiize a desirability function, given the cost of each false discovery and the benefit of each true discovery. The focus of this ethod is on discoveries, unlike related approaches based on false nondiscoveries. A ethod is provided for achieving the highest possible desirability under general conditions, without relying on density estiation. A Bayesian treatent of differing costs and benefits associated with different tests is related to the weighted FDR when there is a coon rejection region for all tests. An application to DNA icroarrays of patients with different types of leukeia illustrates the proposed use of the dfdr in decision theory. Coparisons between ore than two groups of patients do not call for changes in the results, as when the FWER is strictly controlled by adjusting p-values for ultiplicity.

2 Introduction In ultiple hypothesis testing, the ratio of the nuber of false discoveries (false positives) to the total nuber of discoveries is often of interest. Here, a discovery is the rejection of a null hypothesis and a false discovery is the rejection of a true null hypothesis. To quantify this ratio statistically, consider hypothesis tests with the ith test yielding H i = 0 if the null hypothesis is true or H i = 1 if it is false and the corresponding alternative hypothesis is true. Let R i = 1 if the ith null hypothesis is rejected and R i = 0 otherwise and define V i = H1 - H i L R i, so that R i is the total nuber of discoveries and V i is the total nuber of false discoveries. Storey (2002) pointed out that the intuitively appealing quantity EH V i ê R i L is not a viable definition of a false discovery rate since it usually cannot be assued that PH R i > 0L = 1. He listed three possible candidates for such a definition: V i ª E i j ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ k R i ƒ V i Q ª E i j ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ k R i ƒ R i > 0 y z Pi j { k R i > 0 y z ; { R i > 0 y z ; { (1) (2) D ª EH V i L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅ R i L. EH Benjaini and Hochberg (1995) called the false discovery rate (FDR) since null hypotheses can be rejected in such a way that a is guaranteed for any significance level a œ H0, 1L. This control of the FDR is accoplished by choosing a suitable rejection region G Õ S, where S is the test-statistic space, and by rejecting the ith null hypothesis if and only if the ith test statistic, t i œ S, is an eleent of G. The FDR was originally controlled under independence of the test statistics (Benjaini and Hochberg, 1995), but can also be controlled for ore general cases (Benjaini and Yekutieli, 2001). The other two candidates, Q and D, cannot be controlled in this way since, if all of the null hypotheses are true, then Q = 1 > a and D = 1 > a (Storey 2002). Nonetheless, Storey (2002) recoended the use of Q, which he called the positive false discovery rate (pfdr), since the ultiplicand PH R i > 0L of the FDR akes it harder to interpret than the pfdr. Although D of Eq. (3) is also easier to interpret than the FDR, Storey (2002) coplained that D fails to describe the siultaneous fluctuations in V i and R i, in spite of its attraction as a siple easure. However, it will be seen that D, unlike and Q, can be optiized using decision theory without dependence assuptions and that the optial value of D is autoatically controlled under general conditions. Because these advantages are based on decision theory, I call D the decisive false discovery rate (dfdr). It will also becoe clear that certain estiators of the FDR and pfdr are ore exact when considered as estiators of the dfdr. A possible objection to the use the dfdr is that it is undefined for EH R i L = 0. However, that does not occur as long as the rejection region is a non-null subset of the test statistic space, in which case PH R i > 0L 0. In other words, the dfdr is eaningful as (3)

3 long as a rejection region is not selected in such a trivial way that the nuber of rejected null hypotheses is alost always zero. For exaple, rejecting a null hypothesis only if a t-statistic were exactly equal to soe value t 0 would ake all hypothesis testing procedures useless, including those based on the FDR, pfdr, or dfdr. Denote by b i the benefit of rejecting the ith null hypothesis when it is false, where the benefit can be econoic, such as an aount of oney, or soe other positive desirability in the sense of Jeffrey (1983). Let c i be the cost of rejecting the ith null hypothesis when it is true, with cost here playing the role of a negative desirability, so that we have b i 0, c i 0, and the vectors b ª Hb i L and c ª Hc i L. Then the net desirability is dhb, cl = Hb i HR i - V i L - c i V i L = b i R i - Hb i + c i L V i. If the costs and benefits are independent of the hypothesis, then the net desirability is d = I R i - I1 + c 1 ÅÅÅÅÅÅ M V i M = I1 - I1 + c 1 ÅÅÅÅÅÅ M V i ê R i M R i and its expectation value is EHdL º J1 - J1 + c 1 ÅÅÅÅÅÅÅÅ N QN E i (5) j R y i z k { if PH R i > 0L º 1. (Genovese and Wasseran (2001) instead utilized a arginal risk function that is effectively equivalent to -EHdL ê if = c 1.) Thus, the rejection region can be chosen such that the pfdr (or FDR) and the expected nuber of rejections axiize the approxiate expected desirability, given a cost-to-benefit ratio, assuing the probability of no rejections is close to zero. However, these approxiations are not needed since there is an exact result in ters of the dfdr: EHdL = E i j R i - J1 + c 1 y ÅÅÅÅÅÅÅ N V i (6) k b 1 z = J1 - J1 + c 1 ÅÅÅÅÅÅÅ N DN E i { j y R i z. k { When all of the null hypotheses are true, D = 1 for all nontrivial rejection regions and it follows fro Eq. (6) that EHdL is optiized when EH R i Lö0, as expected. Likewise, when all of the null hypotheses are false, D = 0 for all rejection regions and EH R i L = is obviously optial; in this case, the rejection region would be the whole test statistic space. More generally, if the rejection region is chosen such that EHdL is axial and if that axiu value is positive, then D < I1 + c 1 ÅÅÅÅÅ (7) M -1, as plotted in Fig. 1. Thus, under conditions that ake the rejection of at least one null hypothesis desirable, optiizing the EHdL autoatically controls the dfdr. Consequently, desirability optiization can be seen as a generalization of dfdr control that applies to wider sets of probles. The dfdr inequality (7) can be used to select a cost-to-benefit ratio when none is suggested by the proble at hand. For exaple, analogous to the usual Type I significance level of 5%, optiizing the desirability with a cost-to-benefit ratio of 19 would control the (4)

4 dfdr at the 5% level, provided that the expected desirability is positive for soe rejection region. 0.5 0.4 0.3 D 0.2 0.1 0 0 5 10 15 20 25 30 cost-to-benefit ratio Upper bound of dfdr for ax EHdL if ax EHdL > 0. Figure 1 Estiation of dfdr and optiization of desirability The dfdr can be written in ters of p 0, the proportion of null hypotheses that are true: D = EH H1 - H i L R i L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ R i L EH = p 0 P Ht i œ G» H i = 0L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ. P Ht i œ GL For ease of notation, let G = @t, L, denote by F the distribution of the test statistics, and denote by F H0L the distribution of the test statistics under each null hypothesis. (The sae ethod applies to ore general rejection regions, but a single-interval rejection region avoids the bias entioned by Storey (2002) of choosing a ultiple-interval region fro the data and then using that region to obtain estiates.) Then (8) DHtL = p 0 P Ht t» H = 0L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = p 0 H1 - F H0L HtLL ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅ, P Ht tl 1 - FHtL which is naturally estiated by (9) M p` 0 D` H0L IHTi tl í M HtL ª ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ (10) IHTi tl ê, where p` 0 is an estiate of p 0, IHxL = 1 if x is true and IHxL = 0 if x is false, T i is the observed test statistic of the ith test, and T H0L i is the ith test statistic out of M test statistics calculated by a resapling procedure under the null hypothesis. (Of course, such resapling is not needed when F H0L is known, as when -T i is the p-value of the ith test; Storey (2002) and Genovese and Wasseran (2002) exained this special case.) Making use of the fact that a large proportion of the test statistics that are less than soe threshold l would be associated with true null hypotheses, Efron et al. (2001) and Storey (2002) estiated p 0 by variants of

5 p` 0HlL = IHTi < ll ê ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ M ÅÅÅÅÅÅÅÅÅÅ (11) H0L IITi < lm ë M. Storey (2002) offers soe interesting ways to optiize l, but in the present application to gene expression data, I will follow the ore straightforward approach of Efron et al. (2001) for siplicity. They transfored the observed test statistics to follow F, the standard noral distribution, then counted the nubers of transfored statistics falling in the range H-1 ê 2, 1 ê 2L. The exaple analyses below use Eq. (11) to obtain siilar results, without transforation, by choosing l such that the proportion of observed test statistics that are less than l is as close as possible to FH1 ê 2L - FH-1 ê 2L º 0.382925. Given a value of l and a testindependent cost-to-benefit ratio, Eq. (6) suggests optiizing the dfdr by selecting the value of t œ 8T i < that axiizes D` HtL = I1 - I1 + c 1 ÅÅÅÅÅ M D` HtLM R i HtL. (12) Clearly, the best rejection region can be found without knowledge of, as long as the ratio c 1 ê is specified. Like estiation of the FDR and pfdr, estiation of the dfdr does not require restrictive assuptions about F. The dfdr has the additional advantage that the sae ethods of estiation, such as the one given here, can be used without assuptions about correlations between the test statistics. It will be seen in the next section that certain ethods that require independence or weak dependence for the estiation of the FDR or pfdr can estiate the dfdr without such restrictions. Bayesian foralis dfdr as a posterior probability While the above definition of dfdr and estiation procedure were given in ters agreeable with a frequentist perspective in which the truth or falsehood of each null hypothesis is fixed, they are equally valid in a Bayesian construction. Efron et al. (2001), Genovese and Wasseran (2002), and Storey (2002) considered the case in which each H i is a rando variable and p 0 is the prior probability that a null hypothesis is true, so that the arginal distribution F is a ixture of F H0L and F H1L, the distributions of the statistics conditioned on the truth and falsehood of the null hypothesis: F = p 0 F H0L + H1 - p 0 L F H1L. Storey (2002) pointed out that the dfdr is equal to the posterior probability that a null hypothesis is true: PHH = 0L PHt œ G» H = 0L PHH = 0, t œ GL D = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = PHH = 0» t œ GL. (13) PHt œ GL PHt œ GL He also proved that this holds for the pfdr, assuing that the eleents of 8t i < and 8H i < are independent, with the corollary that D = Q for the sae rejection region G. Thus, under approxiate independence,

6 Q º PHH = 0» t œ GL, with equality only holding under exact independence. Storey (2002) used relation (14) to otivate the ethods of estiating the FDR and pfdr that inspired the variant ethod of Eqs. (10) and (11). (He adjusted his estiates for the conditionality of the pfdr on R i > 0, but the adjustent is not needed to estiate the FDR or dfdr.) Storey (2002) also proved that such estiation procedures are valid asyptotically (ö ) under weak dependence. However, these techniques ore directly estiate the dfdr since Eq. (13) holds exactly, without any assuption about the nature of dependence. Far fro detracting fro the ethods of estiation, this observation broadens their application, since one can use the to ake inferences about the dfdr whether or not they apply to the pfdr for a particular data set. Other ethods of optiization The axiization of D` HtL deterines a rejection region using the dfdr and the cost and benefit of a false or true discovery. Genovese and Wasseran (2001) and Storey (2002), on the other hand, presented cost functions that use the concept of a false nondiscovery, a failure to reject a false null hypothesis. The approach taken herein has uch ore in coon with a Bayes error approach of Storey (2002) than with the approach based on a loss function of Genovese and Wasseran (2001), also considered in Storey (2002). Each will be considered in turn. Storey (2002) introduced a ethod based on the cost of a false nondiscovery instead of on the benefit fro a true discovery. Table 1 clarifies the difference between his forulation and that described above: Change in desirability d i Hc, b, c L R i = 0 R i = 1 H i = 0 0 -c H i = 1 -c +b Contribution of each cost or benefit to the net desirability. Table 1 The approach of Eqs. (4), (5), and (6) requires that c > 0, b > 0, and c = 0, whereas that of Storey (2002) requires that c > 0, b = 0, and c > 0; he pointed out that a scientist can feasibly decide on c ê c for a icroarray experient. If the net desirability is defined as dhc, b, c L ª d i Hc, b, c L, then the Bayes error that Storey iniized, BEHc, 0, c L, is equal to -EHdHc, 0, c LL ê. Since axiizing EHdHc, b, 0LL and iniizing -EHdHc, 0, bll ê yield the sae optial rejection region, the ethods have the sae goal. Thus, stating the proble in ters of false nondiscoveries as opposed to true discoveries is largely a atter of taste, but the latter is sipler and is ore in line with the intuition that otivates reporting false discovery rates, which concerns false and true discoveries, not nondiscoveries. Unlike the technique used here, Storey's (2002) algorith of optiization relies on a knowledge or estiation of probability densities, which can be probleatic. Another advantage of axiizing D` HtL is that it does not require a Bayesian fraework, but is also consistent with fixed values of H i. (14)

7 Genovese and Wasseran (2001) considered a linear cobination of the FDR and the false nondiscovery rate (FNR), and Storey (2002) optiized the corresponding linear cobination the pfdr and the positive false nondiscovery rate (pfnr). The ethod of Storey (2002) would be ore exact using the dfdr (12) and the decisive false nondiscovery rate (dfnr), the ratio of the expected nuber of false nondiscoveries to the expected total nuber of nondiscoveries. However, the axiization of D` HtL or iniization of the estiated Bayes error is preferred since they are based on c ê b, the ratio of the cost of a false discovery to the benefit of a true discovery or, equivalently, on c ê c. The linear cobination ethods, by contrast, are based on the ratio of the discovery rates. The proble is that the two ratios are not the sae in general since the discovery rates have different denoinators. Since it is unrealistic in ost cases to ask an investigator to specify a ratio of discovery rates, the straightforward approach of optiizing a total desirability or Bayes error is ore widely applicable. Test-dependent costs and benefits The use of the dfdr and decision theory does not require all tests to have equal costs of false discoveries and benefits of true discoveries. If the costs and benefits associated with each test are equal to those of a sufficiently large nuber of other tests, as in the following application to icroarray data, then the estiators of Eqs. (10), (11), and (12) and the axiization of D` HtL can be applied separately to each subset of tests with the sae costs and benefits, yielding a different rejection region for each subset. In the notation used above, " " c i = c j, b i = b j, where L `, " i j i k = «, and kœ81,2,...,l< i,jœi k j,kœ81,2,...,l< L k=1 i k ª 81, 2,..., <. It is assued that each set i k has enough ebers that reliable estiates can be obtained by replacing 81, 2,..., < with i k in the sus of Eqs. (10) and (11) and ` with test statistics of the null distribution generated fro i k. Let D k Htk L be the resulting estiate of the net desirability associated with i k for threshold t k. Maxiizing D` k Ht k L independently for each i k axiizes the estiated expected desirability, D k Htk L, but if a co- L ` k=1 on rejection region is desired, then the estiate D` b+c HtL can be axiized instead by finding the optial coon threshold t, using p` 0 Hb + cl, a coon estiate of p 0, found as L described below. Since D` k Ht k L D` b+c HtL if all estiates of p 0 and all null distributions k=1 are equal, the forer approach tends to give a higher desirability, but the latter has an interesting connection with the weighted false discovery rate (WFDR) of Benjaini and Hochberg (1997). In analogy with the WFDR and the dfdr, I define the weighted decisive false discovery rate (WdFDR) as D w ª EH w i V i L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ w i R i L, EH where w is a vector of positive weights, Hw i L. Fro Eq. (4), the expected net desirability can be expressed as (15)

8 EHdL = E i j k J b i ÅÅÅÅÅÅÅ + c i y ÅÅÅÅÅÅÅ N R i i z j Ei j { k k b i ÅÅÅÅÅÅÅ y R i z ì Ei j { k J b i ÅÅÅÅÅÅÅ + c i y ÅÅÅÅÅÅÅ N R i z - D y b+c z, { { Finding the threshold t that axiizes D` b+c HtL, the estiate of EHdL, thus yields an optial estiate of D b+c : M p` 0 Hb + cl Hbi + c i L IHT D` H0L i tl í M b+c HtL ª ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ (17), Hbi + c i L IHT i tl ê with p` 0 Hb + cl and D` b+c HtL siilarly odified fro Eqs. (11) and (12). Here, the null statistics are generated using a pool of all of the data, unlike the ethod in which each D` k Ht k L is independently axiized. Eqs. (16) and (17) indicate that D` b+c HtL can be seen as an approxiate estiate of the WdFDR when each weight is set to the su of the cost and benefit for the corresponding test. In addition, there is a Bayesian interpretation of the WFDR. Given that the distribution of the test statistics is the ixture F = wi F j ê j=1 w j, the dfdr of test statistics distributed as F [Eq. (9)] is equal to the WdFDR of test statistics distributed as F j ê : (16) DHtL = p 0 H1 - F H0L HtLL p 0 I1 - wi F H0L j HtL ê ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = j=1 w j M ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = 1 - FHtL 1 - wi F j HtL ê j=1 w j p 0 wi H1 - F H0L j HtLL ê j=1 w j ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ = EH w i V i L ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ ÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅÅ wi H1 - F j HtLL ê j=1 w j EH w i R i L. (18) Thus, the WdFDR is a posterior probability associated with t ~F, according to Eq. (13). It follows that giving one test ore weight than another is equivalent to increasing its representation in the saple by a proportional aount. Through the ixture F, the FDR and pfdr have parallel relationships with their weighted counterparts. Benjaini and Hochberg (1997) proposed the WFDR to take into account differing values or econoic considerations aong hypotheses, just as traditional Type I error rate control can give different weights for p-values or different significance levels to different hypotheses. However, without a copelling reason to have a coon rejection region, optiizing D` k Ht k L independently for each i k sees to better accoplish the goal of Benjaini and Hochberg (1997), in light of the above inequality.

9 Application to gene expression in cancer patients The optiization of D` HtL was applied to the public icroarray data set of Golub et al. (1999). Levels of gene expression were easured for 7129 genes in 38 patients with B-cell acute lyphoblastic leukeia (ALL) and in 25 patients with acute yeloid leukeia (AML). These preprocessing steps were applied to the raw "average difference" (AD) levels of expression: each AD value was divided by the edian of all AD values of the subject to reduce subject-specific effects, then the noralized AD values were transfored by HAD ê» AD»L lnh1 +» AD»L, as per Bickel (2002). The ith null hypothesis is that there is no stochastic ordering in transfored expression between groups in the ith gene is zero; if it is rejected, then the gene is considered differentially expressed. The goal is to reject null hypotheses such that the expected desirability is axiized, given a cost-to-benefit ratio. That ratio is set to 19 to ensure that the dfdr is controlled at the 5% level (7), provided that the axiu expected desirability is positive. The coputations used the specific values c = 19 and b = 1, without loss of generality. (Alternately, a biologist could estiate the cost of investigating a false discovery to the benefit of finding a truly differentially expressed gene.) To achieve the goal, the absolute value of a two-saple t-statistic, 2 2 T i = ALL,i - AML,i ë Hs ALL,i ê 38 - s ALL,i ê 25L 1ê2, (19) was coputed for each gene for both the original preprocessed data and for M = 1000 rando perutations of the coluns of the 7129 µ H38 + 25L data atrix, following the unbalanced perutation ethod used by Dudoit et al. (2000). (The whole coluns were peruted, instead of the expression values for each gene independently, to preserve the gene-gene correlations.) Finally, the threshold t was found such that D` HtL is axiized, using the estiates coputed fro Eqs. (10), (11), and (12). Fig. 2 displays D` HtL for this data. Table 2 copares the results to those obtained by choosing t such that the highest nuber of discoveries were ade with the constraint D` HtL 5 % as an approxiation to FDR control. Since the ethod of Benjaini and Hochberg (1995) effectively uses p 0 = 1 (Storey, 2002), as does the algorith of Bickel (2002), Table 2 also shows the effect of conservatively replacing p` 0 with 1.

10 Estiated desirability D` HtL for the ALL-AML coparison (12), using the estiate p`0 º 0.59, fro Eq. (11). ax D`HtL p `0 º 0.59 Figure 2 ax D`HtL p `0 ª 1 Table 2 ` D HtL 5 % p `0 º 0.59 D `HtL 5 % p `0 ª 1 Threshold, t 3.14 3.37 2.44 2.73 # discoveries 503 428 1496 1212 Desir., D` HtL with p`0 º 0.59 Desir., D` HtL with p`0 ª 1 dfdr, D` HtL with p`0 º 0.59 dfdr, D` HtL with p`0 ª 1 683 656 1.4 500 524 578-1043 2.3 0.0125 0.0073 0.0500 0.0294 0.0212 0.0124 0.0849 0.0499 Coparison of estiates of the proposed ethod (axiizing D` HtL with p` 0 fro Eq. (11)) to three alternate ethods. Each nuber in bold face is a axial value constrained according to the ethod of its colun. Coparisons between other groups can easily be added to these results. For exaple, one ight be interested not only in knowing which genes are differentially expressed between B-cell ALL and AML patients, but also which ones are differentially expressed between the B- cell ALL group and the 9 patients of Golub et al. (1999) with T-cell ALL. A faily-wise error rate control approach, such as that used by Dudoit et al. (2000), would require additional adjustents to the p-values, so that fewer genes would be found to be differentially expressed as the nuber of group-group coparisons increases, but the approach of independent rejection regions, given by t 1 and t 2, does not require any odification to the results of Table 2. Sup-

11 pose that the benefit of a true discovery of differential expression between the B-cell ALL and T-cell ALL groups is twice that between the B-cell ALL and AML groups, and that the cost of investigating a false discovery is the sae for all coparisons. Then the only difference in the ethod applied to the new set of coparisons is that the benefit of each true discovery is set to 2 instead of 1. In the above notation, i 1 specifies the t-statistics corresponding to the coparisons to the AML group and i 2 specifies those corresponding to the coparisons to the T-cell ALL group. In icroarray experients, this kind of ultiplicity of subject groups as well as a ultiplicity of genes is coon, and the dfdr and decision theory together provide a flexible, powerful fraework for such situations. The coputed threshold is t 2 º 3.64, yielding 177 genes considered differentially expressed and a dfdr estiate of 0.0418; these values correspond to t 1 º 3.14, 503 genes, and a dfdr estiate of 0.0125 fro the first colun of Table 2. The two thresholds are close enough that forcing a coon threshold, as in Eqs. (17) and (18), would not have a drastic effect on the conclusions drawn in this case. By contrast, the arked difference between the two dfdr estiates is a reinder that traditional FDR control does not axiize the desirability. Even so, each estiate is controlled in the sense of satisfying (7). In estiating the pfdr, Storey (2002) considered the expression patterns of genes to be weakly dependent, but that ay presently be ipossible to test statistically since the nuber is genes is typically uch greater than the nuber of cases. Storey (2002) hypothesized weak dependence since genes work in biocheical pathways of sall groups of interacting genes. However, weak dependence would only follow if there were no interaction between one pathway and another, an assuption that could only hold approxiately at best. Indeed, the expression of thousands of genes in eukaryotes is controlled by highly connected networks of transcriptional regulatory proteins (Lee et al., 2002). Thus, I hypothesize instead that while weak dependence is a good approxiation for prokaryotes, there is stronger dependence between eukaryotic genes. Fractal statistics can often conveniently describe biological long-range correlations (Bickel and West, 1998) and ay aid the study of genoe-wide interactions as ore data becoes available. At any rate, the investigator can safely report the dfdr, since that is what is directly estiated, noting that it would be approxiately equal to the FDR or pfdr under certain conditions. References Benjaini, Y. and Hochberg, Y. (1995) "Controlling the false discovery rate: A practical and powerful approach to ultiple testing," Journal of the Royal Statistical Society B 57, 289-300 Benjaini, Y. and Hochberg, Y. (1997) "Multiple hypotheses testing with weights," Scandinavian Journal of Statistics, 24, 407-418 Benjaini, Y. and Yekutieli, D. (2001) "The control of the false discovery rate in ultiple testing under dependency," Annals of Statistics 29, 1165-1188 Bickel, D. R. and West, B. J. (1998) Molecular evolution odeled as a fractal renewal point process in agreeent with the dispersion of substitutions in aalian genes, Journal of Molecular Evolution 47, 551-556

Bickel, D. R. (2002) "Microarray gene expression analysis: Data transforation and ultiple coparison bootstrapping," Coputing Science and Statistics (to appear); available fro www.athpreprints.co Dudoit, S., Yang, Y. H., Callow, M. J., Speed, T. P. (2000) "Statistical ethods for identifying differentially expressed genes in replicated cdna icroarray experients," Technical Report #578, University of California at Berkeley, Departent of Statistics, http://www.stat.berkeley.edu/users/terry/zarray/htl/att.htl Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001) "Epirical Bayes analysis of a icroarray experient," Journal of the Aerican Statistical Association 96, 1151-1160 Genovese, C. and Wasseran, L. (2001) "Operating characteristics and extensions of the FDR procedure," Technical report, Departent of Statistics, Carnegie Mellon University Genovese, C. and Wasseran, L. (2002) "Bayesian and frequentist ultiple testing," unpublished draft of 4/12/02 Golub, T. R., Sloni, D. K., Taayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloofield, C. D., and Lander, E. S. (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression odeling," Science 286, 531-537 Jeffrey, R. C. (1983) The Logic of Decision, second edition, University of Chicago Press: Chicago Lee, T. I., Rinaldi, N. J., Robert, F., Odo, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thopson, C. M., Sion, I., Zeitlinger, J., Jennings, E. G., Murray, H. L., Gordon, D. B., Ren, B., Wyrick, J. J., Tagne, J.-B., Volkert, T. L., Fraenkel., E., Gifford, D. K., and Young, R. A. (2002) "Transcriptional Regulatory Networks in Saccharoyces cerevisiae," Science 298, 799-804 Storey, J. D. (2002) False Discovery Rates: Theory and Applications to DNA Microarrays, PhD dissertation, Departent of Statistics, Stanford University 12