Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

Size: px

Start display at page:

Download "Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments"

Jared McKenzie
5 years ago
Views:

1 Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone: (484) Sanat K. Sarkar 2 Department of Statistics, Temple University, Philadelphia, PA 19422, U.S.A. Telephone: (215) Short title: Bayesian Threshold for Differential Gene Expression 1. jie chen@merck.com sanat@surfer.sbm.temple.edu. The research is supported by NSF Grant DMS

2 Abstract The original definition of false discovery rate (FDR) can be understood as the frequentist risk of false rejections conditional on the unknown parameter, while the Bayesian posterior FDR is conditioned on the data, a particular realization of an experiment. From a Bayesian point of view, it seems natural to take into account the uncertainty in both the parameter and the data. In this spirit, we propose the Average FDR (AFDR) and Average FNR (AFNR) approaches in which the frequentist risks of false rejections and false non-rejections are averaged out with respect to some prior distribution of parameter. A linear combination of the AFDR and AFNR, called the Average Bayes Error Rate (ABER), is considered as an overall risk. Some useful formulas for the AFDR, AFNR and ABER are developed for normal samples with hierarchical mixture priors. The idea of finding threshold values minimizing the ABER is illustrated using a gene expression data set. Simulation studies show that the proposed approaches are more powerful and relatively robust than the widely used FDR method. Keywords: Average false discovery rate, Average false non-discovery rate, Average Bayes error rate, Hierarchical mixture model, Microarray experiment. 1 Introduction The emergence of DNA microarray technology allows for the study of sequence, structure and expression of thousands of genes simultaneously. Microarrays are being used increasingly in a wide variety of areas ranging from toxicological research (toxicogenomics), gene discovery, disease diagnosis, to drug discovery (pharmacogenomics) (Nuwaysir, Bittner, Trent, Barrett, and Afshari 1999; Afshari, Nuwaysir, and Barrett 1999; Callow, Dudoit, Gong, Speed, and Rubin 2

3 2000). A typical DNA microarray generates massive amount of data concerning gene regulations and interactions. A natural question that arises in such a study is whether differential expression of a gene is associated with certain condition, such as tumor type of breast cancer. This question is commonly posed as a multiple testing problem with the null hypothesis for each gene representing no association of expression level with the condition. With a large number of hypothesis tests performed simultaneously, the probability of misidentifying a gene as differentially expressed when it is not can increase sharply. The traditional concept of familywise error rate (FWER) is too restrictive to adopt in such a multiple testing situation. Instead, the false discovery rate (FDR) of Benjamini and Hochberg (1995) (hence BH-FDR hereafter) and related measures seem more appropriate as they offer less stringent criteria and thus provide more powerful methods in dealing with this magnitude of multiplicity. For an overview of multiple hypothesis testing in gene expression analysis, the reader is referred to Dudoit, Shaffer, and Boldrick (2003), Reiner, Yekutieli, and Benjamini (2003), and Ge, Dudoit, and Speed (2003). Suppose we have k genes, with the ith gene having the test statistic or differential expression measurement D i which has a probability distribution depending on an unknown parameter θ i, i = 1,..., k. Let D = {D 1,..., D k } and θ = {θ 1,..., θ k }. The underlying hypotheses of interest are H i : θ i = θ 0 against H i : θ i θ 0, i = 1,..., k, for some known θ 0. Decisions on these k hypotheses, that is, rejection or acceptance of the null hypotheses, are usually based on the magnitudes of the corresponding test statistics D i, i = 1,..., k. Table 1 shows all possible outcomes for the k hypothesis tests. The FDR is defined as the expected proportion of false rejections among the set of rejected hypotheses, i.e., F DR = E{V/(R 1)}. Storey (2002, 2003) introduces a modified version of FDR, called the positive false discovery rate 3

4 (pfdr), defined as pf DR = E [V/R R > 0], and argues that the pfdr is a more appropriate and useful error measure, as the probability of at least one rejection can be less than 1. Storey (2003) showed that the pfdr can be written as a Bayesian posterior probability, which is asymptotically true under fairly general conditions. An analog of FDR in terms of false non-rejections, called the false non-discovery rate (FNR), is introduced by Genovese and Wasserman (2002b). It is the expected proportion of false non-rejections among the set of non-rejected hypotheses, i.e., F NR = E{T/(A 1)}. Sarkar (2003) developed some results on the FDR and FNR in single-step procedures for dependent test statistics, under both a model where the number of true null hypotheses is assumed fixed and a mixture model where different configurations of true and false null hypotheses are assumed to have certain probabilities. He extended some previously known results developed under the assumption of independence and explained precisely how an FDR- or FNR-controlling single-step procedure, such as Bonferroni or Šidák procedure, can potentially be improved using an estimate of k 0. Multiple testing can be viewed as a decision making process through which one rejects or retains the null hypotheses by controlling some error or risk. While a frequentist measures this error or risk by considering a loss function and averaging it over different possible realizations of the data D conditional on the parameter θ; a Bayesian, on the other hand, defines this risk by averaging the loss over different possible states of nature conditional on D. For instance, the FDR and FNR can be understood as frequentist risks of false rejections and false nonrejections, respectively. Genovese and Wasserman (2002a) gave the first fully Bayesian exposition of false discovery rate and introduced the Bayesian posterior FDR (PFDR), defined as P F DR = E θ D [V/(R 1)], assuming certain prior distribution of θ. They also introduced the posterior FNR (PFNR), defined as P F NR = E θ D [T/(A 1)]. 4

5 In this article, we address the multiple testing problem in a microarray experiment by determining a threshold or critical value for each gene from a Bayesian perspective. It seems natural that, when a Bayesian approach is taken, one should take into account the uncertainty in both parameter and data in determining risks. In this spirit, we propose the Average FDR (AFDR) approach in which the frequentist risk of false rejections is averaged out with respect to the prior distribution of θ. This is also seen as the expected Bayesian posterior risk of false rejections with respect to the marginal distribution of data D. The AFDR approach provides an alternate view on controlling error rate involving false positives and hence is useful in multiple testing problems, including those arising in gene expression data analysis. An analog of the AFDR in terms of false non-rejections, called Average FNR (AFNR), is also developed to control error rate involving false negatives from a Bayesian viewpoint. An overall Bayes risk is defined as a linear combination of the AFDR and AFNR. We call this an Average Bayes Error Rate (ABER), and propose to determine the threshold value of test statistics by minimizing the ABER. Our simulation studies show that the proposed approaches are more powerful (in terms of average power) and relatively robust to influential data points as well as to the choices of priors than the BH-FDR method. The paper is organized as follows. In Section 2, we describe a hierarchical mixture model that will be used in identifying differentially expressed genes between two sample types. The concept of AFDR is formally defined in Section 3 and some formulas for this measure in a single-step procedure are developed. Section 4 is devoted to the similar development of the AFNR. The ABER is described in Section 5. In Section 6, we apply our AFDR, AFNR and ABER approaches to a breast cancer data set and illustrate how the threshold is determined for detection of differentially expressed genes in terms of minimum ABER. Section 7 5

6 is devoted to simulation studies investigating the average power of AFDR, ABER and BH-FDR as well as the robustness of these methods to influential data points as well as to the choice of priors. Some other possible alternatives defining false positives and false negatives are discussed in Section 8. 2 Hierarchical Mixture Model for Gene Expressions We will present in this section a hierarchical mixture model for identifying differentially expressed genes. The mixture model approaches have been taken by Efron, Tibshirani, Storey, and Tusher (2001), Storey (2003) and Sarkar (2003); we will extend these approaches by using hyper prior distributions on which our proposed procedure is based. We will assume that the microarray data have been pre-processed or normalized to adjust for any bias and systematic variation other than the factor under consideration, and are ready for statistical analysis of significance. For discussions on pre-treatment of microarray data, the reader is referred to Finkelstein, Gollub, and Cherry (2001), Yang, Dudoit, Luu, and Speed (2001), and Chen, Kodell, Sistare, Thompson, Morris, and Chen (2003) We restrict attention to two-sample comparisons, i.e., to compare gene expression levels from two different types of samples (treatment vs control, disease vs non-disease, two different tumor types, etc.). Let X ijl be the lth (normalized) expression measurement for the ith gene from the jth type of sample and X ijl have N(η ij, σ 2 ) distribution, l = 1,..., n ij, j = 1, 2 and i = 1,..., k. One often assigns a prior distribution to σ 2. However, for simplicity we will obtain the unbiased estimator ˆσ 2 of σ 2, from which the variance of difference in average expression levels between two sample types can be easily derived. The unbiased 6

7 estimator ˆσ 2 of σ 2 is simply where X ij ˆσ 2 = 1 n 2k k i=1 n 2 ij ( Xijl X ) 2 ij, (2.1) j=1 l=1 = n ij l=1 X ijl/n ij and n = k i=1 2 j=1 n ij. Let D i = X i1 X i2 and θ i = η i1 η i2 be, respectively, the differences of sample and population means of expression levels for gene i between two sample types, i = 1,..., k. Suppose n i1 = n 1 and n i2 = n 2 for all i = 1,..., k, i.e., the sample sizes within sample type are the same for all of the genes. This gives the estimated variance of D i as ˆσ 2 D = ˆσ2 (1/n 1 + 1/n 2 ). The problem of identifying differentially expressed genes that are associated with certain condition is typically approached by simultaneously testing the null hypotheses H i : θ i = 0 against the complementary alternatives H i : θ 0, i = 1,..., k. Towards this goal, we assume the following hierarchical mixture models D i θ i N ( θ i, ˆσ 2 D), i = 1,..., k; θ i µ, τ 2 π 0 I (θ i = 0) + (1 π 0 ) N ( µ, τ 2) I (θ i 0), i = 1,..., k; µ ξ, τ 2 N ( 0, ξτ 2), where π 0 is the prior probability of the null hypothesis being true and (ξ, τ 2 ) may follow some distribution g 1 (ξ, τ 2 ) or sometimes are assigned arbitrary values. That is, the D i s are conditionally independent given θ i s which are also conditionally independent given µ, τ 2 and ξ. Under this distributional setup, it can be shown that the marginal distribution of the data D = {D 1,..., D k }, conditional on τ 2 and ξ, is given by D N k (0, ψ(τ, ξ)), (2.2) 7

8 where ψ(τ, ξ) = v(ξ, τ 2 ) c(ξ, τ 2 )... c(ξ, τ 2 ) c(ξ, τ 2 ) v(ξ, τ 2 )... c(ξ, τ 2 ) c(ξ, τ 2 ) c(ξ, τ 2 )... v(ξ, τ 2 ), v(ξ, τ 2 ) = ˆσ D 2 +(1 π 0) 2 (1 + ξ) τ 2 and c(ξ, τ 2 ) = (1 π 0 ) 2 ξτ 2. Berger (1985) and Schervish (1995) provide a detailed proof for normal hierarchical models with a similar structure. The property of positive correlation among D in (2.2) is useful in establishing some characteristics of the AFDR and AFNR, as will be seen in Sections 3 and 4. Some recent articles adopt the Bayesian hierarchical mixture model approaches to identifying differentially expressed genes from microarray experiments (Baldi and Long 2001; Broët, Richardson, and Radvanyi 2002; Ibrahim, Chen, and Gray 2002; Ishwaran and Rao 2003). These procedures, however, are based solely on the posterior distribution of the parameter and such derived FDR s are the Bayesian posterior FDR s conditional on the data (Ishwaran and Rao 2003; Newton, Noueiry, Sarkar, and Ahlquist 2003). To account for the uncertainty in both parameter and data, the concepts of the AFDR and AFNR are introduced and some useful formulas under the above hierarchical mixture models are developed in the next two sections. 3 Average False Discovery Rate We define the AFDR in the similar way as in Benjamini and Hochberg (1995), except that an additional expectation is taken with respect to some prior distribution of parameter. Definition 1. The Average False Discovery Rate (AFDR) among the set of 8

9 rejected hypotheses is defined to be AF DR = E θ [ E D θ ( V R 1 )]. (3.1) This quantity is the Bayes risk of false rejections among the set of rejected hypotheses. Noting that reversing the order of integration in (3.1), we obtain the alternate form AF DR = E D [ E θ D ( V R 1 )], (3.2) which is the expected posterior risk with respect to some marginal distribution of the data D. Therefore, AF DR = E D (P F DR). Suppose that a large absolute value of D i compared to a threshold value c leads to the rejection of H i, identifying the corresponding gene to be either under- or over-expressed. Let D ( i) 1:k 1,..., D ( i) k 1:k 1 be the ordered components of { D j : j J ( i) } with J ( i) = J {i} and J = {1,..., k}. Define D 0:k = and D k+1:k =. Then, the AFDR of this procedure can be written as AF DR = 1 k k P { D i c, θ i = 0} i=1 [ k 1 P 1 + k j=1 { D ( i) j:k 1 < c D i c, θ i = 0 (k j)(k j + 1) } ]. (3.3) If D is a multivariate normal with non-negative correlations, as it is the case under the above hierarchical mixture models, then the AFDR is no more than the right hand side of (3.3); see Sarkar (2003). If (D i, θ i ), i = 1,..., k, are iid, (3.3) reduces to AF DR = P {R > 0}P {θ 1 = 0 D 1 c} ; (3.4) 9

10 see, also Storey (2003). Under the hierarchical mixture model, notice that (D i, θ i ), i = 1,..., k are iid conditional on µ, ξ and τ 2. Therefore, conditional on µ, ξ and τ 2, the AFDR can be written as AF DR = [ 1 ν k] 2π 0 [1 Φ (c 0 )], 1 ν k 1 = 2π 0 [1 Φ (c 0 )] ν j, (3.5) where Φ is the c.d.f. of standard normal, j=0 c 0 = ν = π 0 [2Φ (c 0 ) 1] + (1 π 0 ) [Φ(c 2 ) Φ(c 1 )], c, c 1 = c µ ˆσ 2 D ˆσ D + τ, and c 2 2 = ˆσ c µ 2 D + τ. 2 The AFDR is the integral of (3.5) with respect to µ, ξ and τ 2. It is a nonincreasing function of c. This can be seen from the following two results, conditionally given µ, ξ and τ 2. First, P {R > 0} = 1 [P { D 1 < c}] k is nonincreasing in c. Second, P {θ 1 = 0 D 1 c} = = [ 1 + (1 π 0) π 0 [ 1 + (1 π 0) π 0 P { D 1 c θ 1 } P { D 1 c θ 1 = 0} φ(θ 1; µ, τ 2 )dθ 1 ] P {χ 2 1(λ) > c 2 1 0} P {χ 2 1 > c 2 0} φ(θ 1; µ, τ 2 )dθ 1, ] 1 (3.6) where φ(x; µ, τ 2 ) is the density of N(µ, τ 2 ), χ 2 1 is the central chi-squared random variable with 1 degree of freedom, and χ 2 1(λ) is the non-central chi-squared random variable with 1 degree of freedom and the non-centrality parameter λ = θ1/ˆσ 2 D 2, which is also nonincreasing in c because the ratio P {χ 2 1(λ) > c 2 0} P {χ 2 1 > c 2 0} is known to be nondecreasing in c 2 0 (Gupta and Sarkar 1984) and hence in c. 10

11 4 Average False Non-discovery Rate The AFDR defined above is only one part of the Bayes risk of misclassifications. Another quantity measuring the error rate of false non-rejections is the average false non-discovery rate which is defined as follows. Definition 2. The Average False Non-discovery Rate (AFNR) among the set of non-rejected hypotheses is defined to be AF NR = E θ [ E D θ ( T A 1 )]. (4.1) In words, the AFNR is the average risk of non-rejections when the hypotheses are false. This quantity is seen as the expected posterior risk of false nonrejections with respect to the marginal distribution of the data D. The AFNR of a single-step procedure that rejects H i if D i is large compared to the threshold value c can be written as AF NR = 1 k P { D i < c, θ i 0} k i=1 [ { } k 1 P D ( i) j:k 1 c D ] i < c, θ i k. (4.2) j(j + 1) j=1 If D is a multivariate normal with non-negative correlations, as it is the case under the above hierarchical mixture models, then the AFNR is no more than the right hand side of (4.2); see Sarkar (2003). If (D i, θ i ), i = 1,..., k, are iid, then (4.2) reduces to AF NR = P {A > 0}P {θ 1 0 D 1 < c}, (4.3) see also Storey (2003). Thus, under the above hierarchical mixture model and conditional on µ, ξ, and τ 2, the AFNR can be written as [ AF NR = 1 {1 ν} k] (1 π 0) [Φ (c 2 ) Φ (c 1 )], (4.4) ν 11

12 where ν, c 1 and c 2 are as defined in (3.5). The AFNR is the integral of (4.4) with respect to µ, ξ, and τ 2. It is a nondecreasing function of c, which can be proved using the same arguments as used in the case of AFDR. 5 Combining the AFDR and AFNR The AFDR and AFNR together constitute the Bayes risk of misclassifications. Our idea here is to determine the threshold that will minimize the Bayes risk in some sense. We consider a weighted linear combination of the AFDR and AFNR, defined as the Average Bayes Error Rate (ABER) of false rejections and false non-rejections, i.e., ABER = waf DR + (1 w)af NR, (5.1) with the weight 0 w 1 to the AFDR being determined by the importance of false rejections relative to false non-rejections, and find the threshold that minimizes the ABER. This is in the spirit of Storey (2003) and Genovese and Wasserman (2002b) who considered similar combinations in terms of the FDR and FNR. Storey (2003) points out there are two approaches that can be taken for the FDR: fix the FDR at the acceptable level α first and estimate the rejection region, or fix the rejection region first and provide an estimate of the FDR over that region. These approaches are practically useful, since the FDR is a monotonic function of c. Although we focus here on minimizing the ABER and finding its corresponding threshold, one can alternatively consider fixing the ABER, AFDR or AFNR and then estimating the threshold. By doing this, however, one may not be able to achieve the minimum of the ABER since the minimization process requires the input of threshold values. In what follows we will illustrate the 12

13 ABER minimization approach in a gene expression example and a simulation study where the thresholds at which the ABER is minimized will be provided. 6 An Application to Gene Expression Data Hereditary breast cancer is known to be associated with mutations in BRCA1 and BRCA2 proteins. Hedenfalk et al. (2001) report that a group of genes are differentially expressed between tumors with BRCA1 mutations and tumors with BRCA2 mutations. The data, which are publicly available from the web site consist of 22 breast cancer samples, among which n 1 = 7 are BRCA1 mutants, n 2 = 8 are BRCA2 mutants, and n 3 = 7 are sporadic (not used in this illustration). Expression levels, in terms of fluorescent intensity ratios of a tumor sample to a common reference sample, are measured for 3226 genes using cdna microarrays. As usual, the base 2 logarithmic transformation of the ratios was performed, from which ˆσ 2 and ˆσ 2 D were estimated to be and , respectively. We then computed the common two-sample t test statistic (t = D/ˆσ D, with 13 d.f.) and its corresponding raw p-value for each gene. Without multiplicity adjustment, there are 378 genes (out of 3226) whose raw p-values However, the most conservative Bonferroni-adjustment method suggests only 2 rejections at FWER 0.05 and the BH-FDR procedure declares 15 differentially expressed genes (adjusted p-value 0.05). Before applying our procedures to this data set, we first assume π 0 = 0.90, which, as Ishwaran and Rao (2003) pointed out, represents a fairly realistic scenario for gene expression data. Then we define the prior distribution g 1 (ξ, τ 2 ) = (1/τ 2 )g 2 (ξ); thus τ 2 is given the usual noninformative prior and ξ > 0 is given g 2 (ξ) = 1 ( ξ 3/2 exp 1 ). (6.1) 2π 2ξ 13

14 This prior results in f(µ τ 2 ) = Cauchy(0, τ) by integrating over ξ; see Berger, Boukai, and Wang (1997) for more discussion on the choice of this prior. The integrations with respect to µ, τ 2 and ξ in the calculations of the AFDR and AFNR under the hierarchical mixture model are carried out using Monte Carlo integration method. Specifically, we sample τ 2, ξ, and µ from their respective prior distributions and substitute these values into (3.5) and (4.4). The AFDR, AFNR and ABER are then obtained at a given c value by averaging over 5000 iterations. The AFDR and AFNR across c-values for the breast cancer data are graphically displayed (Figure 1). As one expected, the AFDR is decreasing and the AFNR is increasing in c. The ABER s were obtained and plotted against c for various weights w from 0.5, 0.6, 0.7, 0.8 to 0.9 (Figure 2). Clearly, one can always find a c value that minimizes the ABER for a given w. Note that all the ABER s for various w s cross at c = at which the AFDR and AFNR are approximately equal. We estimated critical value c that minimizes the ABER and then applied these critical values to the breast cancer data (Table 2). For instance, with π 0 = 0.9 and w = 0.9 there are 28 genes whose D i 1.48 that are declared differentially expressed between BRCA1 mutation tumors and BRCA2 mutation tumors. The AFDR and AFNR at c = 1.48 are and , respectively. The mean differences in gene expression levels between the two cancer types, together with the results of our approach and BH-FDR procedure, are shown in Figure 3. It is noted that the ABER approach picks up 13 more genes than the BH-FDR method, which is due to a higher power of the approach, as will be shown in the simulation studies of the next section. Since the ABER minimization results are dependent on the variance of D i and prior probability π 0, we provided the critical value c and the corresponding 14

15 ABER for ˆσ D 2 from 0.05 to 0.15 and π 0 = 0.85, 0.90, 0.95 (Tables 3). It can be seen that for the breast cancer data if ˆσ D 2 decreases from to 0.08, i.e., the number of expression measurements increases to 10 for each tumor type of each gene, then the critical value c decreases from 1.48 to 1.26, resulting more rejections and smaller misclassification rate (ABER drops from to ). Thus, Tables 3 can also be used in designing a microarray experiment. 7 Simulations We compare in this section our proposed AFDR and ABER approaches with the BH-FDR method in terms of some power using simulation studies. Specifically, we study the average power of AFDR, ABER and BH-FDR under the setup of hierarchical model, and then investigate the influence of outliers or prior (hyperprior) parameters on the performance of the methods, i.e., the robustness of the procedures to influential data points and prior information. 7.1 Power comparison The average power is defined in the frequentist context as the expected proportion of false null hypotheses that are correctly rejected and has been widely used in comparing multiple testing procedures (Benjamini and Hochberg 1995; Shaffer 1999; Benjamini and Liu 1999; Storey 2002). The setup of this simulation for average power comparison is as follows: k = 10, 50, 100, 5000, 1000; σd 2 = 0.1, 0.2, 0.3; π 0 = 0.2, 0.5, 0.8, 15

16 and a simulation is conducted for each combination of k, σ 2 D and π 0. Under the null hypothesis, i.e., θ i = 0, a data point is drawn from N(0, σd 2 ) and under the alternative hypothesis, i.e., θ i 0, a data point is drawn from N(θ i, σd 2 ), where θ i is an independent draw from N(µ, τ 2 ) distribution and µ follows Cauchy(0, τ 2 ) distribution given τ 2. A total number of kπ 0 null data points and k(1 π 0 ) alternative data points are randomly drawn for each set of k hypotheses. The average powers are obtained over 5000 simulations and plotted against k for AFDR 0.05, ABER with w =0.5, 0.7, 0.9 and BH-FDR 0.05 (Figure 4). It is clear that, as expected, the average power for all methods decreases with the increase in the number of hypotheses k, the variance of observed data σ 2 D, and the proportion of true null hypotheses π 0. The main point of the plot, however, is that our proposed approach, either AFDR 0.05 or minimum ABER with different w s, is more powerful than BH-FDR method. This is due to the fact that the BH-FDR procedure assigns the same FDR to all rejected hypotheses, i.e., the FDR is the same for all hypotheses with test statistics in the rejection region. On the other hand, the AFDR or ABER is the weighted average of the BH-FDR (and FNR) with the prior density of the parameter as the weight. By controlling the error rate at the same level, the AFDR or ABER approach would result in more rejections and hence more powerful than the BH-FDR method. 7.2 Robustness We study the robustness of AFDR, ABER and BH-FDR approaches to some influential data points as well as to various choices of hyper-prior information. To simplify the investigation, we first fix k = 1000, σd 2 = 0.1, π 0 = 0.80 and let the prior for ξ vary. The prior for τ 2 is still conventionally non-informative 1/τ 2, as it is not only computationally convenient but also practically indistinguishable from subjective prior when there are sufficient data (Berger and Deely 1988). 16

17 Towards this goal, we first generate influential or outlying data points, D i δn(θ i, σ 2 D) + (1 δ)n(θ i, σ 2 D), i = 1,..., k, (7.1) where δ is a random binary variable with P (δ = 1) = γ and θ i follows N(µ, c τ 2 ) distribution with some pre-specified c > 1. The hyper-prior for ξ is chosen as an inverse gamma density IG ( 1 2, β) with β being specified below. Note that IG(ξ; 1 2, β) leads to f(µ τ 2 ) = Cauchy(0, τ 2/β). The following setup is considered for the robustness simulation: γ = 0.90, 0.95, 0.99; c = 2, 4, 8; β = 0.25, 0.5, 1. As in the previous section, we considered all configurations of γ, c and β. The resulting average powers are obtained from 5000 simulations and plotted against β for each combination of γ and c (Figure 5). Since a large β makes the IG ( 1 2, β) density skew to the left, leading to an inflated variance of θ and consequently a decrease in average power, which is seen for all of the approaches. However, the average power for AFDR and ABER s is relatively flat and hence these approaches are more robust to the choice of ξ, as compared with BH-FDR method. Also, there is no practical impact of influential data points on the average power for all procedures, which is due to that all of the influential data points are generated from alternative population and are more likely to be rejected. 8 Discussion A different Bayesian approach is taken for measuring the risks of false rejections and false non-rejections in multiple testing. Although we illustrated this approach 17

18 to identifying differentially expressed genes from a microarray experiment, it can also be applied to other multiple testing situations. In an attempt to come up with a Bayesian measure of Type I error rate, we have started with the proportion of Type I errors among the total number of rejections, i.e., the proportion of false discoveries, before averaging it with respect to the distributions of data and prior. There is, however, another way one can measure Type I error rate, which is to consider the proportion of Type I errors among the hypotheses that are true and average it over data and parameters. In other words, one might consider the following alternative measure of Type I error rate, we call Bayesian False Positive Rate (BFPR): [ ( )] V BF P R = E θ E D θ. (8.1) k 0 1 It seems that a Bayesian would prefer (8.1) to the AFDR as a measure of Type I error rate, since it is based on the ratio measuring how many of the null hypotheses believed to be true are rejected by the data. Similarly, the Bayesian False Negative Rate (BFNR), defined as BF NR = E θ [ E D θ ( T k 1 1 )], (8.2) seems to be a more appropriate measure of Type II error rate to a Bayesian than the AFNR. If we use the hierarchical mixture models considered in Section 2 with the same definitions of parameters (i.e., π 0 = 0.9 and w = 0.9), then a linear combination of the BFPR and BFNR is minimized at c = 1.24, resulting in 54 rejections. Some issues surrounding the AFDR, AFNR and ABER, such as robustness of the results, remain to be investigated. We have arbitrarily chosen the prior distributions for convenience; other choices of priors could be considered and the impact of prior input on the results of rejections and non-rejections could also be evaluated. 18

19 Acknowledgement We would like to thank A. Lawrence Gould of Merck Research Laboratories for practical suggestions on the simulation, the Associate Editor and the anonymous referee for constructive comments that have improved the presentation of this paper. References Afshari, C. A., Nuwaysir, E. F., and Barrett, J. C. (1999). Application of complementary dna microarray technology to carcinogen identification, toxicology, and drug safety evaluation. Cancer Research 59, Baldi, P. and Long, A. D. (2001). A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes. Bioinformatics 17, Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, Benjamini, Y. and Liu, W. (1999). A step-down multiple hypothesis testing procedure that controls the false discovery rate under independence. Journal of Statistical Planning and Inference 82, Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). New York: Springer-Verlag. Berger, J. O., Boukai, B., and Wang, Y. (1997). Unified frequentist and Bayesian testing of a precise hypothesis. Statistical Science 12, Berger, J. O. and Deely, J. (1988). A Bayesian approach to ranking and selection of related means with alternatives to analysis of variance methodology. 19

20 Journal of the American Statistical Association 83, Broët, P., Richardson, S., and Radvanyi, F. (2002). Bayesian hierarchical model for identifying changes in genes expression from microarray experiments. Journal of Computational Biology 9, Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P., and Rubin, E. M. (2000). Microarrays expression profiling identifies genes with altered expression in HDL-deficient mice. Genome Research 10, Chen, Y. J., Kodell, R., Sistare, F., Thompson, K. L., Morris, S., and Chen, J. J. (2003). Normalization methods for analysis of microarray geneexpression data. Journal of Biopharmaceutical Statistics 13, Dudoit, S., Shaffer, J. P., and Boldrick, J. C. (2003). Multiple hypothesis testing in microarray experiments. Statistical Science 18, Efron, B., Tibshirani, R., Storey, J. D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, Finkelstein, D. B., Gollub, J., and Cherry, J. M. (2001). Normalization and systematic measurement error in cdna microarray data. In ASA Proceedings of the Joint Statistical Meetings. Ge, Y., Dudoit, S., and Speed, T. P. (2003). Resampling-based multiple testing for microarray data analysis. Test 12, Genovese, C. and Wasserman, L. (2002a). Bayesian and frequentist multiple testing. Technical Report, Carnegie Mellon University. Genovese, C. and Wasserman, L. (2002b). Operating characteristics and extentions of the false discovery rate procedure. Journal of the Royal Statistical Society B. 64,

21 Gupta, S. D. and Sarkar, S. K. (1984). On tp 2 and log-concavity. In Y. L. Tong (Ed.), Inequalities in Statistics and Probability, pp Institute of Mathematical Statistics. Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., amd P. Meltzer, R. S., Gusterson, B., Esteller, M., Kallioniemi, O. P., Wilfond, B., Borg, A., and Trent, J. (2001). Gene-expression profiles in hereditary breast cancer. The New England Journal of Medicine 344, Ibrahim, J. G., Chen, M. H., and Gray, R. J. (2002). Bayesian models for gene expression with DNA microarray data. Journal of the American Statistical Association 97, Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microassays usins Bayesian model selection. Journal of the American Statistical Association 98 (462), Newton, M. A., Noueiry, A., Sarkar, D., and Ahlquist, P. (2003). Detecting differential gene expression with a semiparametric hierarchical mixture method. Technical report, Department of Statistics, University of Wisconsin Madison. Technical Report #1074. Nuwaysir, E. F., Bittner, M., Trent, J., Barrett, J. C., and Afshari, C. A. (1999). Microarrays and toxicology: the advent of toxicogenomics. Molecular Carcinogenesis 24, Reiner, A., Yekutieli, D., and Benjamini, Y. (2003). Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 19, Sarkar, S. K. (2003). False discovery and false non-discovery rates in single-step multiple testing procedures. Technical report, Temple University. Schervish, M. J. (1995). Theory of Statistics. New York: Springer-Verlag. 21

22 Shaffer, J. P. (1999). A semi-bayesian study of Duncan s Bayesian multiple comparison procedures. Journal of Statistical Planning and Inference 82, Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society, Series B 64, Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Annals of Statistics 31, Yang, Y. H., Dudoit, S., Luu, P., and Speed, T. P. (2001, January). Normalization for cdna microarray data. Technical Report 589, Department of Statistics, UC Berkeley. 22

23 AFNR 0.03 AFDR AFNR 0.15 AFDR c 0.00 Figure 1: The AFDR and AFNR as a function of critical value c for the breast cancer data with π 0 = ABER W= c Figure 2: The ABER as a function of critical value c for the breast cancer data with π 0 =

24 B B B B A A Mean Difference, di A B B B A B A A B AAA A B AA B -3.0 B Gene Figure 3: Detection of differentially expressed genes for the breast cancer data with π 0 = 0.90 and w = 0.9: B declared differentially expressed by both the ABER and BH FDR methods; A declared by the ABER method but not BH FDR method; not declared by either method. The two dashed horizontal lines represent the critical values c = ±1.48. Table 1: Outcomes of k Hypothesis Tests True State Accepted Rejected Total Null Hypotheses U V k 0 Alternative Hypotheses T S k 1 A R k 24

25 Table 2: Critical value c at which the ABER is minimized and the corresponding estimated AFDR and AFNR as well as the number of rejections for breast cancer data with π 0 = 0.9 w c AFDR AFNR ABER No. rejections

26 Table 3: Critical value c and the corresponding ABER for various combinations of π 0 and ˆσ 2 D c (ABER) π 0 ˆσ D 2 w = 0.5 w = 0.6 w = 0.7 w = 0.8 w = (0.0158) 0.84 (0.0129) 0.87 (0.0099) 0.90 (0.0068) 0.95 (0.0035) (0.0170) 0.92 (0.0139) 0.95 (0.0107) 0.99 (0.0073) 1.04 (0.0038) (0.0181) 0.99 (0.0149) 1.03 (0.0114) 1.06 (0.0078) 1.12 (0.0041) (0.0192) 1.06 (0.0157) 1.10 (0.0121) 1.14 (0.0083) 1.20 (0.0043) (0.0201) 1.12 (0.0165) 1.16 (0.0127) 1.21 (0.0087) 1.27 (0.0045) (0.0210) 1.19 (0.0172) 1.22 (0.0132) 1.27 (0.0090) 1.34 (0.0047) (0.0218) 1.24 (0.0179) 1.29 (0.0137) 1.33 (0.0094) 1.41 (0.0049) (0.0226) 1.30 (0.0185) 1.34 (0.0142) 1.40 (0.0097) 1.47 (0.0050) (0.0233) 1.35 (0.0191) 1.40 (0.0146) 1.45 (0.0100) 1.53 (0.0052) (0.0240) 1.40 (0.0196) 1.45 (0.0151) 1.51 (0.0103) 1.59 (0.0053) (0.0247) 1.46 (0.0202) 1.51 (0.0155) 1.57 (0.0106) 1.65 (0.0055) (0.0107) 0.90 (0.0087) 0.92 (0.0067) 0.95 (0.0046) 0.99 (0.0024) (0.0116) 0.98 (0.0094) 1.01 (0.0072) 1.04 (0.0050) 1.09 (0.0026) (0.0123) 1.06 (0.0101) 1.09 (0.0077) 1.12 (0.0053) 1.18 (0.0027) (0.0131) 1.13 (0.0107) 1.16 (0.0082) 1.20 (0.0056) 1.26 (0.0029) (0.0137) 1.20 (0.0112) 1.23 (0.0086) 1.27 (0.0059) 1.33 (0.0030) (0.0143) 1.26 (0.0117) 1.30 (0.0090) 1.34 (0.0061) 1.41 (0.0032) (0.0149) 1.32 (0.0122) 1.36 (0.0093) 1.41 (0.0064) 1.48 (0.0033) (0.0154) 1.38 (0.0126) 1.43 (0.0096) 1.47 (0.0066) 1.55 (0.0034) (0.0159) 1.44 (0.0130) 1.49 (0.0100) 1.54 (0.0068) 1.61 (0.0035) (0.0164) 1.50 (0.0134) 1.54 (0.0103) 1.60 (0.0070) 1.67 (0.0036) (0.0169) 1.55 (0.0138) 1.60 (0.0105) 1.66 (0.0072) 1.74 (0.0037) (0.0056) 0.98 (0.0045) 1.00 (0.0035) 1.03 (0.0024) 1.07 (0.0012) (0.0060) 1.07 (0.0049) 1.09 (0.0038) 1.12 (0.0026) 1.17 (0.0013) (0.0065) 1.15 (0.0053) 1.18 (0.0040) 1.21 (0.0027) 1.26 (0.0014) (0.0068) 1.23 (0.0056) 1.26 (0.0043) 1.30 (0.0029) 1.35 (0.0015) (0.0072) 1.31 (0.0058) 1.34 (0.0045) 1.38 (0.0030) 1.43 (0.0016) (0.0075) 1.38 (0.0061) 1.41 (0.0047) 1.45 (0.0032) 1.51 (0.0016) (0.0078) 1.45 (0.0064) 1.48 (0.0049) 1.53 (0.0033) 1.59 (0.0017) (0.0081) 1.51 (0.0066) 1.55 (0.0050) 1.60 (0.0034) 1.66 (0.0018) (0.0084) 1.58 (0.0068) 1.62 (0.0052) 1.67 (0.0035) 1.73 (0.0018) (0.0086) 1.64 (0.0070) 1.68 (0.0053) 1.73 (0.0036) 1.80 (0.0019) (0.0088) 1.70 (0.0072) 1.74 (0.0055) 1.80 (0.0037) 1.87 (0.0019) 26

27 σ = = = D σ D σ D π 0 = Average Power (%) π 0 = π 0 = Number of Hypotheses Tested AFDR ABERw = 0.5 ABERw = 0.7 ABERw = 0.9 BH-FDR Figure 4: Average power (the proportion of the false null hypotheses which are correctly rejected) for ABER s with different weights w(= 0.5, 0.7, 0.9), AFDR and BH-FDR at various combinations of π 0 and σ 2 D. 27

28 γ=0.90 γ=0.95 γ= c'=8 90 Average Power (%) c'= c'= β AFDR ABERw = 0.5 ABERw = 0.7 ABERw = 0.9 BH-FDR Figure 5: Average power for ABER s with different weights w(= 0.5, 0.7, 0.9), AFDR and BH-FDR at various combinations of the choice of β, γ and c. 28

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments

A Bayesian Determination of Threshold for Identifying Differentially Expressed Genes in Microarray Experiments Jie Chen 1 Merck Research Laboratories, P. O. Box 4, BL3-2, West Point, PA 19486, U.S.A. Telephone: