A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Size: px

Start display at page:

Download "A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments"

Allyson Collins
5 years ago
Views:

1 A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone: (484) Sanat K. Sarkar 2 Department of Statistics, Temple University, Philadelphia, PA 19422, U.S.A. Telephone: (215) Short title: Bayesian Threshold for Differential Gene Expression 1. jie chen@merck.com sanat@temple.edu. The research is supported by NSF Grant DMS

2 Abstract The original definitions of false discovery rate (FDR) and false nondiscovery rate (FNR) can be understood as the frequentist risks of false rejections and false non-rejections, respectively, conditional on the unknown parameter, while the Bayesian posterior FDR and posterior FNR are conditioned on the data. From a Bayesian point of view, it seems natural to take into account the uncertainties in both the parameter and the data. In this spirit, we propose averaging out the frequentist risks of false rejections and false non-rejections with respect to some prior distribution of the parameters to obtain the Average FDR (AFDR) and Average FNR (AFNR), respectively. A linear combination of the AFDR and AFNR, called the Average Bayes Error Rate (ABER), is considered as an overall risk. Some useful formulas for the AFDR, AFNR and ABER are developed for normal samples with hierarchical mixture priors. The idea of finding threshold values by minimizing the ABER or controlling the AFDR is illustrated using a gene expression data set. Simulation studies show that the proposed approaches are more powerful and robust than the widely used FDR method. Keywords: Average false discovery rate, Average false non-discovery rate, Average Bayes error rate, Hierarchical mixture model, Microarray experiment. 1 Introduction The emergence of DNA microarray technology allows for the study of sequence, structure and expression of thousands of genes simultaneously. Microarrays are being used increasingly in a wide variety of areas, such as toxicological research (toxicogenomics), gene discovery, disease diagnosis, and drug discovery (pharmacogenomics) (Nuwaysir, Bittner, Trent, Barrett, and Afshari 1999; Afshari, Nuwaysir, and Barrett 1999; Callow, Dudoit, Gong, Speed, and Rubin 2000). 2

3 A typical DNA microarray generates a massive amount of data concerning gene regulations and interactions. A natural question that arises in such a study is whether differential expression of a gene is associated with a certain condition, such as tumor type of breast cancer. This question is commonly posed as a multiple testing problem with the null hypothesis for each gene representing no association of expression level with the condition. With a large number of hypothesis tests performed simultaneously, the probability of misidentifying a gene as differentially expressed when it is not can increase sharply. The traditional concept of familywise error rate (FWER) is too restrictive to adopt in such a multiple testing situation. Instead, the false discovery rate (FDR) of Benjamini and Hochberg (1995) (to be referred to as BH-FDR hereafter) and related measures seem more appropriate as they offer less stringent criteria and thus provide more powerful methods in dealing with this magnitude of multiplicity. For an overview of multiple hypothesis testing in gene expression analysis, the reader is referred to Dudoit, Shaffer, and Boldrick (2003), Reiner, Yekutieli, and Benjamini (2003), and Ge, Dudoit, and Speed (2003). Suppose that we have k genes, with the ith gene having the test statistic or differential expression measurement D i which has a probability distribution depending on an unknown parameter θ i, i = 1,..., k. Let D = {D 1,..., D k } and θ = {θ 1,..., θ k }. The underlying hypotheses of interest are H i : θ i = θ 0 against H i : θ i θ 0, i = 1,..., k, for some known θ 0. Decisions on these k hypotheses, that is, rejection or acceptance of the null hypotheses, are usually based on the magnitudes of the corresponding test statistics D i, i = 1,..., k. Table 1 shows all possible outcomes for the k hypothesis tests. The FDR is defined as the expected proportion of false rejections among the set of rejected hypotheses, i.e., F DR = E{V/(R 1)}, where R 1 = max{1, R}. Storey (2002, 2003) introduces a modified version of FDR, called the positive false 3

4 discovery rate (pfdr), defined as pf DR = E [V/R R > 0], and argues that it is often an appropriate error measure. Storey (2003) showed that the pfdr can be written as a Bayesian posterior probability, which is asymptotically true under fairly general conditions. An analog of FDR in terms of false non-rejections, called the FNR, is introduced by Genovese and Wasserman (2002b) and Sarkar (2005). While Genovese and Wasserman call it the False Non-Discovery Rate, Sarkar calls it the False Negatives Rate. It is the expected proportion of false nonrejections among the set of non-rejected hypotheses, i.e., F NR = E{T/(A 1)}. Sarkar (2005) developed some results on the FDR and FNR in single-step procedures for dependent test statistics under both a model where the number of true null hypotheses is assumed fixed but unknown and a mixture model where different configurations of true and false null hypotheses are assumed to have certain probabilities. He extended some previously known results developed under the assumption of independence and explained how an FDR- or FNR-controlling single-step procedure, such as Bonferroni or Šidák procedure, can potentially be improved using an estimate of k 0. Multiple testing can be viewed as a decision-making process through which one rejects or retains the null hypotheses by controlling some error or risk. While a frequentist measures this error or risk by considering a loss function and averaging it over different possible realizations of the data D conditional on the unknown parameter θ, a Bayesian, on the other hand, defines this risk by averaging the loss over different possible states of nature conditional on the data D. For instance, the FDR and FNR can be understood as frequentist risks of false rejections and false non-rejections, respectively. Genovese and Wasserman (2002a) gave the first fully Bayesian exposition of false discovery rate and introduced the Bayesian posterior FDR (PFDR), defined as P F DR = E θ D [V/(R 1)], assuming certain prior distribution of θ. They also introduced the posterior FNR (PFNR), defined 4

5 as P F NR = E θ D [T/(A 1)]. In this article, we address the multiple testing problem in a microarray experiment by determining a threshold or critical value for each gene from a Bayesian perspective. It seems natural that, when a Bayesian approach is considered, one should take into account the uncertainties in both parameter and data in determining risks. In this spirit, we propose the idea of controlling the Average FDR (AFDR) which is the average of the frequentist risk of false rejections with respect to the prior distribution of θ. The AFDR is also seen as the expected Bayesian posterior risk of false rejections with respect to the marginal distribution of data D. The AFDR approach provides an alternate view on controlling error rate involving false positives and hence is useful in multiple testing problems including those that arise in gene expression data analysis. An analog of the AFDR in terms of false non-rejections, called the Average FNR (AFNR), is also developed to control error rate involving false negatives from a Bayesian viewpoint. An overall Bayes risk is defined as a linear combination of the AFDR and AFNR. We call this an Average Bayes Error Rate (ABER) and propose to determine the threshold value of test statistics by minimizing the ABER. Our simulation studies show that the proposed approach of minimizing the ABER or controlling the AFDR is more powerful (in terms of average power) and robust to influential data points as well as to the choices of priors than the method controlling the BH-FDR. The paper is organized as follows. In Section 2, we describe a hierarchical mixture model that will be used in identifying differentially expressed genes between two sample types. The concept of AFDR is formally defined in Section 3 and some formulas associated with this measure in a single-step procedure are developed. Section 4 is devoted to the similar development of the AFNR. The ABER is described in Section 5. In Section 6, we apply our ABER approach to 5

6 a breast cancer data set and illustrate how the threshold is determined for detection of differentially expressed genes in terms of minimizing the ABER. Section 7 is devoted to simulation studies comparing our approach of minimizing the ABER to the BH-FDR controlling method. We also add in the simulations the approach where only the AFDR is controlled. The AFDR, ABER and BH-FDR approaches are compared in terms of average power and robustness to influential data points and to the choice of priors. Some possible alternative definitions of measures of false positives and false negatives are discussed in Section 8. 2 Hierarchical Mixture Model for Gene Expressions We present in this section the development of our procedure for identifying differentially expressed genes based on a hierarchical mixture model. The mixture model approach has been taken by Efron, Tibshirani, Storey, and Tusher (2001), Storey (2003) and Sarkar (2005). We slightly extend this approach by using hyper-prior distributions. We assume that the microarray data have been preprocessed or normalized to adjust for any bias and systematic variation other than the factor under consideration, and are ready for statistical analysis of significance. For discussions on pre-treatment of microarray data, the reader is referred to Finkelstein, Gollub, and Cherry (2001), Yang, Dudoit, Luu, and Speed (2001), and Chen, Kodell, Sistare, Thompson, Morris, and Chen (2003) We restrict our attention to two-sample comparisons, i.e., to comparisons of gene expression levels from two different types of samples (treatment vs control, disease vs non-disease, two different tumor types, etc.). Let X ijl be the lth (normalized) expression measurement for the ith gene from the jth type of sample and X ijl N(η ij, σ 2 ), for l = 1,..., n ij, j = 1, 2 and i = 1,..., k. One often 6

7 assigns a prior distribution to σ 2. However, for simplicity we obtain the unbiased estimator ˆσ 2 of σ 2, from which the variance of difference in average expression levels between the two sample types can be easily derived. The unbiased estimator ˆσ 2 of σ 2 is simply where X ij ˆσ 2 = 1 n 2k k i=1 n 2 ij ( Xijl X ) 2 ij, (2.1) j=1 l=1 = n ij l=1 X ijl/n ij and n = k i=1 2 j=1 n ij. Let D i = X i1 X i2 and θ i = η i1 η i2 be, respectively, the differences of sample and population means of expression levels for gene i between two sample types, i = 1,..., k. Suppose n i1 = n 1 and n i2 = n 2 for all i = 1,..., k, i.e., the sample sizes within the two sample type are the same for all the genes. This gives the estimated variance of D i as ˆσ 2 D = ˆσ2 (1/n 1 + 1/n 2 ). The problem of identifying differentially expressed genes that are associated with a certain condition is typically approached by simultaneously testing the null hypotheses H i : θ i = 0 against the complementary alternatives H i : θ i 0, i = 1,..., k. Towards this goal, we assume the following hierarchical mixture model D i θ i N ( θ i, ˆσ 2 D), i = 1,..., k; θ i µ, τ 2 π 0 I (θ i = 0) + (1 π 0 ) N ( µ, τ 2) I (θ i 0), i = 1,..., k; µ ξ, τ 2 N ( 0, ξτ 2), where π 0 is the prior probability of the null hypothesis being true and (ξ, τ 2 ) may follow some distribution g 1 (ξ, τ 2 ) or sometimes are assigned arbitrary values. That is, the D i s are conditionally independent given θ i s which are also conditionally independent given µ, τ 2 and ξ. Under this distributional setup, it can be shown that the marginal distribution of the data D = {D 1,..., D k }, 7

8 conditional on τ 2 and ξ, is the following mixture of normals D π 0 N k (0, ˆσ 2 DI k ) + (1 π 0 )N k (0, ψ(τ, ξ)), (2.2) where ψ(τ, ξ) = v(ξ, τ 2 ) c(ξ, τ 2 )... c(ξ, τ 2 ) c(ξ, τ 2 ) v(ξ, τ 2 )... c(ξ, τ 2 ) c(ξ, τ 2 ) c(ξ, τ 2 )... v(ξ, τ 2 ), v(ξ, τ 2 ) = ˆσ D 2 +(1 + ξ) τ 2 and c(ξ, τ 2 ) = ξτ 2 v(ξ, τ 2 ). Berger (1985) and Schervish (1995) provide a detailed proof for normal hierarchical models with non-mixture structure. Some recent articles adopt the Bayesian hierarchical mixture model approaches to identifying differentially expressed genes from microarray experiments (Baldi and Long 2001; Broët, Richardson, and Radvanyi 2002; Ibrahim, Chen, and Gray 2002; Ishwaran and Rao 2003). These procedures, however, are based solely on the posterior distribution of the parameter and such derived FDR s are the Bayesian posterior FDR s conditional on the data (Ishwaran and Rao 2003; Newton, Noueiry, Sarkar, and Ahlquist 2003). To account for the uncertainties in both parameter and data, the concepts of the AFDR and AFNR are introduced, and some useful formulas under the above hierarchical mixture model are developed in the next two sections. 3 Average False Discovery Rate We define the AFDR in the similar way as in Benjamini and Hochberg (1995), except that an additional expectation is taken with respect to some prior distribution of the parameter. 8

9 Definition 1. The Average False Discovery Rate (AFDR) among the set of rejected hypotheses is defined to be AF DR = E θ [ E D θ ( V R 1 )]. (3.1) This quantity is the Bayes risk of false rejections among the set of rejected hypotheses. Note that by reversing the order of integration in (3.1), we obtain the alternate form AF DR = E D [ E θ D ( V R 1 )], (3.2) which is the expected posterior risk with respect to the marginal distribution of the data D. Therefore, AF DR = E D (P F DR). Suppose that a large absolute value of D i compared to a threshold value c leads to the rejection of H i, identifying the corresponding gene to be either under- or over-expressed. Let D ( i) 1:k 1,..., D ( i) k 1:k 1 be the ordered components of { D j : j J ( i) } with J ( i) = J {i} and J = {1,..., k}. Define D 0:k = and D k+1:k =. Then, as in Sarkar (2005), the AFDR of this procedure can be written as AF DR = 1 k [ k P { D i c, θ i = 0} i=1 k 1 P k j=1 If (D i, θ i ), i = 1,..., k, are iid, (3.3) reduces to { D ( i) j:k 1 c, D i c, θ i = 0 (k j)(k j + 1) } ]. (3.3) AF DR = P {R > 0}P { θ 1 = 0 D1 c } ; (3.4) see also Storey (2003). Under the hierarchical mixture model, notice that (D i, θ i ), i = 1,..., k are iid conditional on µ, ξ and τ 2. Therefore, conditional on µ, ξ and τ 2, 9

10 the AFDR can be written as AF DR = [ 1 ν k] 2π 0 [1 Φ (c 0 )], 1 ν k 1 = 2π 0 [1 Φ (c 0 )] ν j, (3.5) where Φ is the c.d.f. of standard normal, j=0 c 0 = ν = π 0 [2Φ (c 0 ) 1] + (1 π 0 ) [Φ(c 2 ) Φ(c 1 )], c, c 1 = c µ ˆσ 2 D ˆσ D + τ, and c 2 2 = ˆσ c µ 2 D + τ. 2 The AFDR is the integral of (3.5) with respect to µ, ξ and τ 2. It is a nonincreasing function of c. This can be seen from the following two results, conditionally given µ, ξ and τ 2. First, P {R > 0} = 1 [P { D 1 < c}] k is nonincreasing in c. Second, P {θ 1 = 0 D1 c} = = [ 1 + (1 π 0) π 0 [ 1 + (1 π 0) π 0 P { D 1 c ] θ1 } 1 P { D 1 c θ1 = 0} φ(θ 1; µ, τ 2 )dθ 1 ] 1, P {χ 2 1(λ) > c 2 0} P {χ 2 1 > c 2 0} φ(θ 1; µ, τ 2 )dθ 1 where φ(x; µ, τ 2 ) is the density of N(µ, τ 2 ), χ 2 1 is the central chi-squared random variable with 1 degree of freedom, and χ 2 1(λ) is the non-central chi-squared random variable with 1 degree of freedom and the non-centrality parameter λ = θ1/ˆσ 2 D 2. This is also nonincreasing in c because the ratio P {χ 2 1(λ) > c 2 0} P {χ 2 1 > c 2 0} is known to be nondecreasing in c 2 0 (DasGupta and Sarkar 1984) and hence in c. (3.6) 10

11 4 Average False Non-Discovery Rate The AFDR defined above is only one part of the Bayes risk of misclassifications. Another quantity measuring the error rate of false non-rejections is the average false non-discovery rate which is defined as follows. Definition 2. The Average False Non-Discovery Rate (AFNR) among the set of non-rejected hypotheses is defined to be AF NR = E θ [ E D θ ( T A 1 )]. (4.1) In other words, the AFNR is the average risk of non-rejections when the hypotheses are false. This quantity is seen as the expected posterior risk of false non-rejections with respect to the marginal distribution of the data D. AFNR of a single-step procedure that rejects H i if D i is large compared to the threshold value c can be written as [ AF NR = 1 k P { D i < c, θ i 0} k i=1 { } k 1 P D ( i) j:k 1 < c, D ] i < c, θ i 0 k ; (4.2) j(j + 1) j=1 see Sarkar (2005). Again, if (D i, θ i ), i = 1,..., k, are iid, then (4.2) reduces to The AF NR = P {A > 0}P {θ 1 0 D 1 < c}, (4.3) see also Storey (2003). Thus, under the above hierarchical mixture model and conditional on µ, ξ, and τ 2, the AFNR can be written as AF NR = where ν, c 1 and c 2 are as defined in (3.5). [ 1 {1 ν} k] (1 π 0) [Φ (c 2 ) Φ (c 1 )] ν, (4.4) 11

12 The AFNR is the integral of (4.4) with respect to µ, ξ, and τ 2. It is a nondecreasing function of c, which can be proved using the same arguments as used in the case of AFDR. 5 Combining the AFDR and AFNR The AFDR and AFNR together constitute the Bayes risk of misclassifications. Our idea here is to determine the threshold that minimizes the Bayes risk in some sense. We consider a weighted linear combination of the AFDR and AFNR, defined as the Average Bayes Error Rate (ABER) of false rejections and false non-rejections, i.e., ABER = waf DR + (1 w)af NR, (5.1) with the weight 0 w 1 to the AFDR being determined by the importance of false rejections relative to false non-rejections, and find the threshold that minimizes the ABER. This is in the spirit of Storey (2003) and Genovese and Wasserman (2002b) who considered similar combinations in terms of the FDR and FNR. Storey (2003) points out that there are two approaches that can be taken for the FDR: fix the FDR at the acceptable level α first and estimate the rejection region, or fix the rejection region first and provide an estimate of the FDR over that region. These approaches are practically useful, since the FDR is a monotonic function of c. Although we focus here on minimizing the ABER and finding its corresponding threshold, one can alternatively consider fixing the ABER, AFDR or AFNR and then estimating the threshold. By doing this, however, one may not be able to achieve the minimum of the ABER since the minimization process requires the input of threshold values. In what follows, we illustrate the 12

13 ABER minimization approach in a gene expression example and a simulation study where the thresholds at which the ABER is minimized will be provided. 6 An Application to Gene Expression Data Hereditary breast cancer is known to be associated with mutations in BRCA1 and BRCA2 proteins. Hedenfalk et al. (2001) report that a group of genes are differentially expressed between tumors with BRCA1 mutations and tumors with BRCA2 mutations. The data, which are publicly available from the web site consist of 22 breast cancer samples, among which n 1 = 7 are BRCA1 mutants, n 2 = 8 are BRCA2 mutants, and n 3 = 7 are sporadic (not used in this illustration). Expression levels in terms of fluorescent intensity ratios of a tumor sample to a common reference sample, are measured for 3226 genes using cdna microarrays. As usual, the base 2 logarithmic transformation of the ratios is performed, from which ˆσ 2 and ˆσ 2 D are estimated to be and , respectively. We then compute the common two-sample t test statistic (t = D/ˆσ D, with 13 d.f.) and its corresponding raw p-value for each gene. Without multiplicity adjustment, there are 378 genes (out of 3226) whose raw p-values However, the most conservative Bonferroni-adjustment method suggests only 2 rejections at FWER 0.05, and the BH-FDR procedure declares 15 differentially expressed genes (adjusted p-value 0.05). Before applying our procedures to this data set, we first assume π 0 = 0.90, which, as Ishwaran and Rao (2003) point out, represents a fairly realistic scenario for gene expression data. Then we define the prior distribution g 1 (ξ, τ 2 ) = (1/τ 2 )g 2 (ξ); thus τ 2 is given the usual noninformative prior and ξ > 0 is given g 2 (ξ) = 1 ( ξ 3/2 exp 1 ), (6.1) 2π 2ξ 13

14 an inverse gamma density IG( 1, 1). This prior results in f(µ τ 2 ) = Cauchy(0, τ 2 ) 2 2 by integrating over ξ; see Berger, Boukai, and Wang (1997) for more discussion on the choice of this prior. The integrations with respect to µ, τ 2 and ξ in the calculations of the AFDR and AFNR under the hierarchical mixture model are carried out using the Monte Carlo integration method. Specifically, we sample τ 2, ξ, and µ from their respective prior distributions and substitute these values into (3.5) and (4.4). The AFDR, AFNR and ABER are then obtained at a given c value by averaging over 5000 iterations. The AFDR and AFNR across c-values for the breast cancer data are graphically displayed (Figure 1). As one would expect, the AFDR is decreasing and the AFNR is increasing in c. The ABER s were obtained and plotted against c for various weights w from 0.5, 0.6, 0.7, 0.8 to 0.9 (Figure 2). Clearly, one can always find a c value that minimizes the ABER for a given w. Note that all the ABER s for various w s cross at c = at which the AFDR and AFNR are approximately equal. We estimate critical value c that minimizes the ABER and then apply these critical values to the breast cancer data (Table 2). For instance, given π 0 = 0.9 and w = 0.9 there are 28 genes with D i 1.48 that are declared differentially expressed between BRCA1 mutation tumors and BRCA2 mutation tumors. The AFDR and AFNR at c = 1.48 are and , respectively. The mean differences in gene expression levels between the two cancer types, together with the results of our approach and BH-FDR procedure, are shown in Figure 3. It is noted that the ABER approach picks up 13 more genes than the BH-FDR method, which is due to a higher power of the ABER approach, as will be shown in the simulation studies of the next section. Since the ABER minimization results are dependent on the variance of D i and prior probability π 0, we provide the critical value c and the corresponding ABER 14

15 for ˆσ D 2 from 0.05 to 0.15 and π 0 = 0.85, 0.90, 0.95 (Tables 3). It can be seen that for the breast cancer data, if ˆσ D 2 decreases from to 0.08, i.e., the number of expression measurements increases to 10 for each tumor type of each gene, then the critical value c decreases from 1.48 to 1.26, resulting in more rejections and smaller misclassification rate (ABER drops from to ). Thus, Table 3 can also be used in designing a microarray experiment. 7 Simulations In this section we compare our proposed AFDR and ABER approaches with the BH-FDR method in terms of some power using simulation studies. Specifically, we study the average power of AFDR, ABER and BH-FDR under the setup of the hierarchical model, and then investigate the influence of outliers and prior parameters on the performance of the methods, i.e., the robustness of the procedures to influential data points and prior information. 7.1 Power Comparison The average power is defined in the frequentist context as the expected proportion of false null hypotheses that are correctly rejected and has been widely used in comparing multiple testing procedures (Benjamini and Hochberg 1995; Shaffer 1999; Benjamini and Liu 1999; Storey 2002). The setup of this simulation for average power comparison is as follows: k = 10, 50, 100, 500, 1000; σd 2 = 0.1, 0.2, 0.3; π 0 = 0.2, 0.5, 0.8, 15

16 and a simulation is conducted for each combination of k, σ 2 D and π 0. Under the null hypothesis, i.e., θ i = 0, a data point is drawn from N(0, σd 2 ), and under the alternative hypothesis, i.e., θ i 0, a data point is drawn from N(θ i, σ 2 D ) where θ i is an independent draw from N(µ, τ 2 ) distribution, and µ follows Cauchy(0, τ 2 ) distribution given τ 2. A total number of kπ 0 null data points and k(1 π 0 ) alternative data points are randomly drawn for each set of k hypotheses. The average powers are obtained over 5000 simulations and plotted against k for AFDR 0.05, ABER with w =0.5, 0.7, 0.9 and BH-FDR 0.05 (Figure 4). It is clear that, as one would expect, the average power for all methods decreases with the increase in the number of hypotheses k, the variance of observed data σ 2 D, and the proportion of true null hypotheses π 0. The main point of the plot, however, is that our proposed approach, either AFDR 0.05 or minimum ABER with different w s, is more powerful than BH-FDR method. This is due to the fact that the BH-FDR procedure assigns the same FDR to all rejected hypotheses, i.e., the FDR is the same for all hypotheses with test statistics in the rejection region. On the other hand, the AFDR or ABER is the weighted average of the BH-FDR (and FNR) with the prior density of the parameter as the weight; consequently, a rejected null hypothesis with a more extreme test statistic is more likely to be given less weight according to the prior density. Therefore, by controlling the error rate at the same level, the AFDR or ABER approach would result in more rejections and hence is more powerful than the BH-FDR method. 7.2 Robustness We study the robustness of AFDR, ABER and BH-FDR approaches to some influential data points and to various choices of prior information. To simplify the investigation, we first fix k = 1000, σd 2 = 0.1, π 0 = 0.80 and let the prior for ξ vary. The prior for τ 2 is still conventionally non-informative 1/τ 2 as it is 16

17 not only computationally convenient but also practically indistinguishable from subjective prior when there are sufficient data (Berger and Deely 1988). Towards this goal, we first generate influential or outlying data points, D i δn(θ i, σ 2 D) + (1 δ)n(θ i, σ 2 D), i = 1,..., k, (7.1) where δ is a random binary variable with P (δ = 1) = γ and θ i follows N(µ, c τ 2 ) distribution with some pre-specified c > 1. The hyper-prior for ξ is chosen as an inverse gamma density IG ( 1 2, β) with β being specified below. Note that IG(ξ; 1 2, β) leads to f(µ τ 2 ) = Cauchy(0, 2τ 2 /β). The following setup is considered for the robustness simulation: γ = 0.90, 0.95, 0.99; c = 2, 4, 8; β = 0.25, 0.5, 1. As in the previous subsection, we consider all configurations of γ, c and β. The resulting average powers are obtained from 5000 simulations and plotted against β for each combination of γ and c (Figure 5). Notice that a large value of β makes the IG ( 1 2, β) density skew to the left, leading to an inflated variance of θ and consequently, a decrease in average power which is seen for all of the approaches. However, the average power for AFDR and ABER s is relatively flat; hence these approaches are more robust to the choice of ξ as compared with BH-FDR method. Also, there is no practical impact of influential data points on the average power for all of the procedures, which is due to the fact that all of the influential data points are generated from alternative population and thus are more likely to be rejected. 17

18 8 Discussion Although we have illustrated in this article our Bayesian approaches to identifying differentially expressed genes from a microarray experiment, they can also be applied to other multiple testing situations. In an attempt to come up with a Bayesian measure of Type I error rate, we have started with the proportion of Type I errors among the total number of rejections, i.e., the proportion of false discoveries, before averaging it with respect to the distributions of data and prior. There is, however, another way one can measure Type I error rate, by averaging the proportion of Type I errors among the hypotheses that are true over data and parameters. In other words, one might consider the following alternative measure of Type I error rate, what we call the Bayesian False Positive Rate (BFPR): BF P R = E θ [ E D θ ( V k 0 1 )]. (8.1) It seems that a Bayesian would prefer (8.1) to the AFDR as a measure of Type I error rate, since it is based on the ratio measuring how many of the null hypotheses believed to be true are rejected by the data. Similarly, the Bayesian False Negative Rate (BFNR), defined as BF NR = E θ [ E D θ ( T k 1 1 )], (8.2) seems to be a more appropriate measure of Type II error rate to a Bayesian than the AFNR. If we use the hierarchical mixture model considered in Section 2 with the same definitions of parameters (i.e., π 0 = 0.9 and w = 0.9), then a linear combination of the BFPR and BFNR is minimized at c = 1.24, resulting in 54 rejections. Our simulations have shown that the AFDR or ABER approach is not only more powerful, but also more robust than the BH-FDR method to some influential data points and to the choice of prior density. Hence it is advantageous to 18

19 apply the proposed approach in controlling errors when some prior information is available. Acknowledgements We would like to thank A. Lawrence Gould of Merck Research Laboratories for practical suggestions on the simulations, the Editor and the anonymous referee for constructive comments that have greatly improved the presentation of this paper. References Afshari, C. A., Nuwaysir, E. F., and Barrett, J. C. (1999). Application of complementary dna microarray technology to carcinogen identification, toxicology, and drug safety evaluation. Cancer Research 59, Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics 17, Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, Benjamini, Y. and Liu, W. (1999). A step-down multiple hypothesis testing procedure that controls the false discovery rate under independence. Journal of Statistical Planning and Inference 82, Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. Berger, J. O., Boukai, B., and Wang, Y. (1997). Unified frequentist and Bayesian testing of a precise hypothesis. Statistical Science 12,

20 Berger, J. O. and Deely, J. (1988). A Bayesian approach to ranking and selection of related means with alternatives to analysis of variance methodology. Journal of the American Statistical Association 83, Broët, P., Richardson, S., and Radvanyi, F. (2002). Bayesian hierarchical model for identifying changes in genes expression from microarray experiments. Journal of Computational Biology 9, Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P., and Rubin, E. M. (2000). Microarrays expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research 10, Chen, Y. J., Kodell, R., Sistare, F., Thompson, K. L., Morris, S., and Chen, J. J. (2003). Normalization methods for analysis of microarray geneexpression data. Journal of Biopharmaceutical Statistics 13, DasGupta, S. and Sarkar, S. K. (1984). On tp 2 and log-concavity. In Y. L. Tong (Ed.), Inequalities in Statistics and Probability, pp Institute of Mathematical Statistics. Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science 18, Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, Finkelstein, D. B., Gollub, J., and Cherry, J. M. (2001). Normalization and systematic measurement error in cdna microarray data. In ASA Proceedings of the Joint Statistical Meetings. Ge, Y., Dudoit, S., and Speed, T. P. (2003). Resampling-based multiple testing for microarray data analysis. Test 12,

21 Genovese, C. and Wasserman, L. (2002a). Bayesian and frequentist multiple testing. Technical Report, Carnegie Mellon University. Genovese, C. and Wasserman, L. (2002b). Operating characteristics and extentions of the false discovery rate procedure. Journal of the Royal Statistical Society B. 64, Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., amd P. Meltzer, R. S., Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J. (2001). Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 344, Ibrahim, J. G., Chen, M. H., and Gray, R. J. (2002). Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association 97, Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microassays usins Bayesian model selection. Journal of the American Statistical Association 98 (462), Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2003). Detecting differential gene expression with a semiparametric hierarchical mixture method. Technical report, Department of Statistics, University of Wisconsin Madison. Technical Report #1074. Nuwaysir, E. F., Bittner, M., Trent, J., Barrett, J. C., and Afshari, C. A. (1999). Microarrays and toxicology: the advent of toxicogenomics. Molecular Carcinogenesis 24, Reiner, A., Yekutieli, D., and Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19,

22 Sarkar, S. K. (2005). False discovery and false non-discovery rates in single-step multiple testing procedures. Annals of Statistics, to appear. Schervish, M. J. (1995). Theory of Statistics. New York: Springer-Verlag. Shaffer, J. P. (1999). A semi-bayesian study of Duncan s Bayesian multiple comparison procedures. Journal of Statistical Planning and Inference 82, Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64, Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics 31, Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001, January). Normalization for cdna microarray data. Technical Report 589, Department of Statistics, UC Berkeley. 22

23 AFNR 0.03 AFDR AFNR 0.15 AFDR c 0.00 Figure 1: The AFDR and AFNR as functions of critical value c for the breast cancer data with π 0 = ABER W= c Figure 2: The ABER as a function of critical value c for the breast cancer data with π 0 =

24 B B B B A A Mean Difference, di A B B B A B A A B AAA A B AA B -3.0 B Gene Figure 3: Detection of differentially expressed genes for the breast cancer data with π 0 = 0.90 and w = 0.9: B declared differentially expressed by both the ABER and BH FDR methods; A declared by the ABER method but not BH FDR method; not declared by either method. The two dashed horizontal lines represent the critical values c = ±1.48. Table 1: Outcomes of k Hypothesis Tests True State Accepted Rejected Total Null Hypotheses U V k 0 Alternative Hypotheses T S k 1 A R k 24

25 Table 2: Critical value c at which the ABER is minimized and the corresponding estimated AFDR and AFNR as well as the number of rejections for breast cancer data with π 0 = 0.9 w c AFDR AFNR ABER No. rejections

26 Table 3: Critical value c and the corresponding ABER for various combinations of π 0 and ˆσ 2 D c (ABER) π 0 ˆσ D 2 w = 0.5 w = 0.6 w = 0.7 w = 0.8 w = (0.0158) 0.84 (0.0129) 0.87 (0.0099) 0.90 (0.0068) 0.95 (0.0035) (0.0170) 0.92 (0.0139) 0.95 (0.0107) 0.99 (0.0073) 1.04 (0.0038) (0.0181) 0.99 (0.0149) 1.03 (0.0114) 1.06 (0.0078) 1.12 (0.0041) (0.0192) 1.06 (0.0157) 1.10 (0.0121) 1.14 (0.0083) 1.20 (0.0043) (0.0201) 1.12 (0.0165) 1.16 (0.0127) 1.21 (0.0087) 1.27 (0.0045) (0.0210) 1.19 (0.0172) 1.22 (0.0132) 1.27 (0.0090) 1.34 (0.0047) (0.0218) 1.24 (0.0179) 1.29 (0.0137) 1.33 (0.0094) 1.41 (0.0049) (0.0226) 1.30 (0.0185) 1.34 (0.0142) 1.40 (0.0097) 1.47 (0.0050) (0.0233) 1.35 (0.0191) 1.40 (0.0146) 1.45 (0.0100) 1.53 (0.0052) (0.0240) 1.40 (0.0196) 1.45 (0.0151) 1.51 (0.0103) 1.59 (0.0053) (0.0247) 1.46 (0.0202) 1.51 (0.0155) 1.57 (0.0106) 1.65 (0.0055) (0.0107) 0.90 (0.0087) 0.92 (0.0067) 0.95 (0.0046) 0.99 (0.0024) (0.0116) 0.98 (0.0094) 1.01 (0.0072) 1.04 (0.0050) 1.09 (0.0026) (0.0123) 1.06 (0.0101) 1.09 (0.0077) 1.12 (0.0053) 1.18 (0.0027) (0.0131) 1.13 (0.0107) 1.16 (0.0082) 1.20 (0.0056) 1.26 (0.0029) (0.0137) 1.20 (0.0112) 1.23 (0.0086) 1.27 (0.0059) 1.33 (0.0030) (0.0143) 1.26 (0.0117) 1.30 (0.0090) 1.34 (0.0061) 1.41 (0.0032) (0.0149) 1.32 (0.0122) 1.36 (0.0093) 1.41 (0.0064) 1.48 (0.0033) (0.0154) 1.38 (0.0126) 1.43 (0.0096) 1.47 (0.0066) 1.55 (0.0034) (0.0159) 1.44 (0.0130) 1.49 (0.0100) 1.54 (0.0068) 1.61 (0.0035) (0.0164) 1.50 (0.0134) 1.54 (0.0103) 1.60 (0.0070) 1.67 (0.0036) (0.0169) 1.55 (0.0138) 1.60 (0.0105) 1.66 (0.0072) 1.74 (0.0037) (0.0056) 0.98 (0.0045) 1.00 (0.0035) 1.03 (0.0024) 1.07 (0.0012) (0.0060) 1.07 (0.0049) 1.09 (0.0038) 1.12 (0.0026) 1.17 (0.0013) (0.0065) 1.15 (0.0053) 1.18 (0.0040) 1.21 (0.0027) 1.26 (0.0014) (0.0068) 1.23 (0.0056) 1.26 (0.0043) 1.30 (0.0029) 1.35 (0.0015) (0.0072) 1.31 (0.0058) 1.34 (0.0045) 1.38 (0.0030) 1.43 (0.0016) (0.0075) 1.38 (0.0061) 1.41 (0.0047) 1.45 (0.0032) 1.51 (0.0016) (0.0078) 1.45 (0.0064) 1.48 (0.0049) 1.53 (0.0033) 1.59 (0.0017) (0.0081) 1.51 (0.0066) 1.55 (0.0050) 1.60 (0.0034) 1.66 (0.0018) (0.0084) 1.58 (0.0068) 1.62 (0.0052) 1.67 (0.0035) 1.73 (0.0018) (0.0086) 1.64 (0.0070) 1.68 (0.0053) 1.73 (0.0036) 1.80 (0.0019) (0.0088) 1.70 (0.0072) 1.74 (0.0055) 1.80 (0.0037) 1.87 (0.0019) 26

27 σ = = = D σ D σ D π 0 = Average Power (%) π 0 = π 0 = Number of Hypotheses Tested AFDR ABERw = 0.5 ABERw = 0.7 ABERw = 0.9 BH-FDR Figure 4: Average power (the proportion of the false null hypotheses which are correctly rejected) for ABER s with different weights w(= 0.5, 0.7, 0.9), AF DR 0.05 and BH F DR 0.05 at various combinations of π 0 and σ 2 D. 27

28 γ=0.90 γ=0.95 γ= c'=8 90 Average Power (%) c'= c'= β AFDR ABERw = 0.5 ABERw = 0.7 ABERw = 0.9 BH-FDR Figure 5: Average power for ABER s with different weights w(= 0.5, 0.7, 0.9), AF DR 0.05 and BH F DR 0.05 at various combinations of the choice of β, γ and c. 28

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone: