The Pennsylvania State University The Graduate School A BAYESIAN APPROACH TO FALSE DISCOVERY RATE FOR LARGE SCALE SIMULTANEOUS INFERENCE

Size: px

Start display at page:

Download "The Pennsylvania State University The Graduate School A BAYESIAN APPROACH TO FALSE DISCOVERY RATE FOR LARGE SCALE SIMULTANEOUS INFERENCE"

Estella Park
5 years ago
Views:

1 The Pennsylvania State University The Graduate School A BAYESIAN APPROACH TO FALSE DISCOVERY RATE FOR LARGE SCALE SIMULTANEOUS INFERENCE A Thesis in Statistics by Bing Han c 2007 Bing Han Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2007

2 The thesis of Bing Han was reviewed and approved by the following: Steven F. Arnold Professor of Statistics Thesis Advisor Naomi S. Altman Associate Professor of Statistics Chair of Committee Bing Li Professor of Statistics Claude depamphilis Professor of Biology Bruce G. Lindsay Willaman Professor of Statistics Department Head Signatures are on file in the Graduate School.

3 iii Abstract Microarray data and other applications have inspired many recent developments in the area of large scale simultaneous inference. For microarray data, the number of simultaneous tests of differential gene expression in a typical experiment ranges from 1,000 to 100,000. The traditional family-wise type I error rate (FWER), which is defined as the probability of at most one occurrence of type I error in all the tests, is over-stringent in this context due to the large scale of simultaneity. More recently, false discovery rate (FDR), was defined as the expected proportion of type I errors among the rejections. Controlling the less stringent FDR criterion has less loss in detection capability than controlling the FWER and hence is preferable for large scale multiple tests. From the Bayesian point of view, the posterior version of FDR and of false nondiscovery rate (FNR) is easier to study. We study Bayesian decision rules to control Bayes FDR and FNR. A hierarchical mixture model is developed to estimate the posterior probability of hypotheses. The posterior distribution can also be used to estimate the false discovery percentage (FDP) defined as the integrand of the FDR. The model in conjunction with Bayesian decision rules displays satisfying performance in simulations and in the analysis of the Affymetrix Latin Square HG-U133A spike-in data.

4 iv Table of Contents List of Tables vii List of Figures viii Acknowledgments xi Chapter 1. Introduction The false discovery rate (FDR) and relevant concepts False coverage-statement rate (FCR) A brief review of relevant statistical procedures The major application areas of large scale inference DNA microarray Other applications Chapter 2. Bayesian decision rules based on BFDR and BFDP Notations and bases in Bayesian hypothesis tests Nonrandomized decision rules Further discussions on the rules A randomized decision rule Bayes false discovery percentage (BFDP) Chapter 3. A Bayes model for 2-condition comparison P-value paradox and the probability of a null hypothesis

5 v 3.2 Introduction to Bayesian finite mixture model Model for treatment/control comparison Exchangeable model on 2-condition comparison The exchangeable model Markov Chain Monte Carlo (MCMC) and posterior inference Extensions of model (3.14) Unequal variance model Model on structured covariance by random effects Model with G-side covariance matrix A parsimonious model Summary of the model variants Chapter 4. Model Comparison Posterior model checking Predictive density and information criteria Bayesian information criterion (BIC) Deviance information criterion (DIC) Using information criteria Chapter 5. Extension to comparisons among multiple conditions Direct modeling of multiple conditions by mixture models Approximate approaches: marginal modeling using finite mixture models Testing an interval type null hypothesis by Bayesian linear models. 84

6 vi Chapter 6. Numerical examples Simulation study on 2-condition comparison Affymetrix latin square spike-in HG-U133 data The analysis Model Diagnosis Simulation study on multiple conditions without control Chapter 7. Discussion Bibliography

7 vii List of Tables 1.1 Common notation used in multiple testing and FDR control procedures. Cells are counts of true/false hypotheses being rejected/accepted Two versions of predicted residuals in prototype model DIC comparisons in simulation scenarios between the Bayesian models K = 1 and K = 2. Scenario 1 has 5% +1.5 and 5% -1.5 difference in mean expressions. Scenario 2 has 5% +1.5 and 5% -1.2 difference in mean expressions. Scenario 3 has 10% +1.5 difference in mean expressions Method comparisons on simulated data with 2,500 differential expressed in n = 25, 000 genes. The differential expression size is +1.5 for 1,250 genes and -1.5 for another 1250 genes Method comparisons on simulated data with 2,500 differential expressed in n = 25, 000 genes. The differential expression size is +1.5 for 1,250 genes and -1.2 for another 1250 genes Method comparisons on simulated data with 2,500 differential expressed in n = 25, 000 genes. The differential expression size is Difference among 4 conditions in simulation

8 viii List of Figures 2.1 Plot of BFDR and BFNR against rejections R where the sorted H (i), 1 i R are rejected. Solid: BFDR, dashed: BFNR. 25,000 null hyp. s with 90% true. Notice BFDR(R) ˆπ 0, as R n, BFNR(R) 1 ˆπ 0, as R 0, R BFDR(R) + W (1 BFNR(R)) = ˆπ Histogram of posterior samples from the BFDP, i.e., f( V R Y ), in a simulation study. The left histogram used the optimal rule applied with q =.005, and the observed FDP is The right one tuning the mode at.005 has 200 rejections, and the observed FDP is Plot of Bayes residual versus genes and predicted values, when the errors are generated from the model assumption N(0,1) Plot of classical residual versus genes and predicted values, when the errors are generated from the model assumption N(0,1) Plot of absolute value of residuals versus predicted values with loess fit, when the errors are generated from the model assumption N(0,1) Histograms of Bayes residuals by arrays, when the errors are generated from the model assumption N(0,1) Histograms of classical residuals by arrays, when the errors are generated from the model assumption N(0,1) Plot of Bayes residual versus levels and predicted values, when the errors are proportional to differential expression levels

9 ix 4.7 Plot of classical residual versus levels, predicted values, when the errors are proportional to differential expression levels Plot of absolute residuals versus predicted values with loess fit, when the errors are proportional to differential expression levels Example of disconnected multiple comparisons (DMP) The raw differential expressions for the simulated data with the true difference size 1.5 for the first 10% genes Performance of the optimal BFDR rule in the third simulation scenario. (a)estimated total mistakes versus rejections; (b) the solid line is the observed FDP and the dashed line is the estimated BFDR versus rejections; (c) the solid line is the observed FNP and the dashed line is the estimated BFNR versus rejections; (d) the observed total mistakes are the solid line, and the estimated mistakes are the dashed line Plot observed differential expressions Z 2 against gene ID. Small dots are null genes (true non discovery). : true discovery, +: false discovery, : false non discovery. Detection used the optimal rule 2 at q = The bottom graph removes all the null genes Residual plots for the Latin Square spike-in experiment. (a) Plot residuals against predicted mean expressions; (b) plot residuals against gene IDs

10 x 6.5 Observed FDR in 1,000 simulated datasets applying K = 1 prototype model with nominal FDR level From top to bottom and left to right, four pooling scenarios are tried: all pairwise comparisons, comparisons with control, disconnected pairwise comparisons, connected pairwise comparisons

11 xi Acknowledgments I am most grateful and indebted to my thesis advisors, Steven F. Arnold and Naomi S. Altman, for their guidance, encouragement, and assistance throughout the entire process of my study and dissertation research in Penn State. I would like to thank many people who help me to make the completion of this dissertation possible. I am very grateful to my doctoral committee, Dr. Bing Li and Dr. Claude depamphilis, for their encouragement and support that help my journey through the statistics doctoral program. Finally, I wish to say thank you to my wife, Dan Wei, a Ph.D. candidate in Geography Department, Penn State University, for her invaluable help in supporting me and in reading my drafts.

12 1 Chapter 1 Introduction Data intensive areas such as high throughput biology lead to applications in which thousands of simultaneous tests may be performed. For example, microarrays may measure the expression of 1,000 to over 100,000 gene fragments. Under typical levels for hypothesis-wise control of type I error, this leads to an unacceptable number of false detections, i.e., false rejections of the null hypothesis. By contrast, the traditional family-wise type I error rate (FWER), which is defined as the probability of at most one occurrence of type I error in all the tests, is overly stringent in this context due to the large scale of simultaneity. Investigators are willing to tolerate a small percentage of false detections to increase the power to detect true differences. False discovery rate (FDR) was defined as the expected proportion of type I errors among the rejections. Controlling the less stringent FDR criterion induces less loss in detection capability than controlling the FWER and hence is preferable for large scale multiple tests. In this introductory chapter, we present the fundamentals of the FDR and briefly review the relevant research in this area. In addition, we also review the background of the major application fields, primarily consisting of microarray data analysis and image analysis.

13 2 1.1 The false discovery rate (FDR) and relevant concepts The false discovery rate (FDR) is defined as the expected proportion of type I errors among rejections. For simplicity in statement, Table 1.1 introduces the standard notation in large scale multiple tests. Table 1.1. Common notation used in multiple testing and FDR control procedures. Cells are counts of true/false hypotheses being rejected/accepted Hypothesis Decision D H D = 0 for accept D = 1 for reject H 0 true U V m 0 = U + V H 0 false T S m 1 = T + S W = U + T R = V + S m = W + R The FDR is defined as (Benjamini and Hochberg, 1995) E( V R R > 0), FDR = 0, if R = 0. (1.1) The definition of FDR is from the classical statistics perspective. Hypotheses are known either true or false for sure. But the outcomes are unknown to researchers. Notice that in the more interesting case when R > 0, the FDR is defined as the expectation of a ratio. The numerator V and the denominator R are dependent random variables. This causes much trouble in estimating the FDR.

14 The random variable V R, namely the integrand of the FDR, is discrete and has 3 finite support. It is also of research interest. To distinguish V R from its expectation, we call the random variable V R false discovery percentage (FDP). We can define FDP= 0 when R = 0. The FDR and FDP can be viewed as the expansion of classical type I error in large scale multiple tests. It is then natural to define the counterpart of type II error: false non-discovery rate (FNR) as E( W T W > 0), FNR = 0, if W = 0 (1.2) and the false non-discovery percentage (FNP) as the random variable T W. When W = 0 define FNP= 0. Storey (2002, 2003) has suggested an updated criterion, the positive FDR (pfdr), defined as the conditional expectation of V R given at least one rejection, i.e., E(V R R > 0), which can be used to ensure slightly more stringent control especially when all null hypotheses are true. We are working to adapt the classical concepts into Bayesian settings. Let denote all of the parameters involved and let Y denote the data. Frequentists consider the expectation over repeated sampling. However, from the Bayesian perspective it is preferable to study the posterior version FDR=E( V R R > 0, Y ) conditioned on the data Y, with the randomness residing in the parameters. The posterior argument can greatly simplify the estimation of FDR, especially for a nonrandomized test where the number

15 4 of rejections is a deterministic function of data. We will have a clearer discussion of this point in Chapter 2. To distinguish between the Bayesian and classical criterion, we define the posterior version of FDR as BFDR throughout the thesis, where B stands for the Bayesian perspective. E( V R R > 0, Y ), BFDR = 0, if R = 0. (1.3) Similarly define the BFNR as the posterior version of FNR. E( W T W > 0, Y ), BFNR = 0, if W = 0. (1.4) We define the Bayes false discovery percentage, BFDP, as a random variable with the distribution f( V R Y ) when R > 0 and degenerated at 0 otherwise. The BFNP is defined similarly as a random variable with the distribution f( W T Y ) when W > 0 and degenerate at 0 otherwise. Many recent research papers have discussed or applied those relatively new concepts in large scale multiple tests. Since its introduction in Benjamini and Hochberg (1995), the FDR criterion has been used in hundreds of papers in statistics, bioinformatics, and other fields. For example, Benjamini and Hochberg (2000); Efron et al. (2001); Storey (2002, 2003); Efron (2004); Benjamini and Yekutieli (2005); Hu et al. (2005); Sarkar (2006), and many more. The FNR that evaluates the type II error is also

16 5 recognized and emphasized in conjunction with the FDR by many. For example, Ishwaran and Rao (2003); Delongchamp et al. (2004) and others. Among others, Genovese and Wasserman (2002, 2004) focused on the random variables FDP and FNP. Bayesian approaches have been developed more recently. Ideas similar to our BFDR and BFNR are discussed in Cohen and Sackrowitz (2005); Muller et al. (2006). In sum, this is a fruitful area and has attracted many researchers over the past few years. 1.2 False coverage-statement rate (FCR) Traditional multiple interval estimation emphasizes the simultaneous confidence level. The Bayesian counterpart is the simultaneous credible level. Based on the posterior joint distribution simulated by Markov Chain Monte Carlo (MCMC), the exact simultaneous credible region is always estimable (Held, 2004). On the other hand, in general cases classical simultaneous confidence regions are usually conservative due to the difficulty in deriving the exact procedure. In a loose sense, if confidence and credibility are comparable, Bayes always has equal or more precise interval estimation in simultaneous interval estimations. Refer to Held (2004) for more low dimensional examples. However, our simulation study has shown that the improvement by exactness quickly diminished in high dimensional problems. Simultaneous credibility as well as simultaneous confidence is perhaps not the right criterion in large scale interval estimation problems. The false coverage-statement rate (FCR) (Benjamini and Yekutieli, 2005) in interval estimations is an analogous concept to the FDR in hypothesis tests. The FCR is defined as the expected proportion of parameters not covered by their CI s among

17 6 the selected parameters (Benjamini and Yekutieli, 2005). There are two major changes from simultaneity to the FCR: first, the proportion of non-coverage instead of a single occurrence of non-coverage is the target of interest; second, the proportion is with respect to the number of selected intervals instead of all intervals. The selection is usually done by an associated or independent multiple test procedure, such as a FDR-control procedure that gives a list of detected non-zero parameters for further interval estimations. The FCR then measures the proportion of individual coverages among the selected parameters. There are some concerns about the selection step. A selection bias could be introduced and the possible negative effect is beyond the control of the FCR (Edwards, 2005). The interpretation of a FCR-control procedure is conditional confidence or coverage, which is quite different from the initial goal of interval estimation. This shift may not be favored from both theoretical and application point of view (Shaffer, 2005; Westfall, 2005). 1.3 A brief review of relevant statistical procedures Benjamini and Hochberg (1995) proposed the linear step-up procedure (referred as the BH step-up procedure) for multiple tests. Let the ordered p-values from the multiple tests be p (1) p (2)... p (n). Let H (1), H (2),..., H (m) be the corresponding null hypotheses. Let k = arg max i {p (i) q i/n}. The step-up procedure rejects H (1),..., H (k). For independent test statistics or positively correlated test statistics of the true null hypotheses, the BH step-up procedure controls the FDR at the nominal level q conservatively. An adjustment of the critical value q can improve the performance

18 7 under the same nominal level by using the adjusted critical value q = q π 0, where π 0 is the proportion of the true null hypotheses. In practice, a consistent point estimator ˆπ 0 can be used instead. One such estimator can be found in Benjamini and Hochberg (2000). The improved procedure is called the adaptive BH step-up procedure. There are many similar ideas in controlling the FDR by manipulating p-values, which can be seen as adjustments on classical p-values, e.g., Reiner et al. (2003); Lystiga (2003); Hu et al. (2005). The usual setting in large scale multiple testing uses a suitable summary statistic z i for each null hypothesis H i, i = 1... n. For example, z i could be the two-sided p-value from the permutation test or t-test. It is assumed that z i is i.i.d. from a mixture of π 0 f 0 + (1 π 0 )f 1, where f 0 and f 1 are the conditional densities of z i given respectively that the null or the alternative hypothesis is true. This setup was used, for example, in Benjamini and Hochberg (1995); Efron et al. (2001); Storey (2002, 2003); Cohen and Sackrowitz (2005). Efron et al. (2001) used the inverse z-scores instead of p-values. By the Bayes theorem, the posterior probability of a null hypothesis, if defined from the Bayesian point of view, is then π 0 f 0 (z)/[π 0 f 0 (z) + (1 π 0 )f 1 (z)]. If viewed as a function of z, the posterior probability of the null hypothesis is also called the local false discovery rate (local fdr) by Efron et al. (2001). The mixture density can be estimated without much trouble. When all z i s are independent, the null density f 0 is usually known. For example, when z i is a p-value f 0 is U(0, 1), and if z i is the inverse z-score f 0 is the standard normal. It is then enough to estimate the posterior probability of the hypothesis, or say the local fdr. Efron (2004) has further suggested careful calibration

19 8 of f 0 to account for possible correlations and other violations to assumptions used in individual hypothesis tests. Another usual approach is to apply an ANOVA-like model where factors are indices of hypotheses and conditions under comparison. Hence it is a multiway design with at least two factors. For example, the ANOVA setting is used in Wu et al. (2003); Ishwaran and Rao (2003); Cho and Lee (2004). This type of approach is characterized by global modeling, unlike most of the p-value adjustment procedures that treat parameters of interest as if arising from separate problems. In other words, a big model, probably quite sophisticated, is applied to all of the available data. Hypotheses are related inherently through the model framework, where complete independence is quite rare. Our work also adopts this idea. Approaches incorporating both FDR and FNR are sometimes preferred to procedures controlling only FDR. Genovese and Wasserman (2002) have showed that the popular BH step-up method cannot simultaneously control both criteria. Using p-values as the summary scores, Delongchamp et al. (2004) have developed a method that emphasizes controlling FNR. Cohen and Sackrowitz (2005) have discussed optimal procedures controlling the total risk by assigning different loss functions to false discoveries and false nondiscoveries. Genovese and Wasserman (2004) have emphasized using FDP and FNP but not their classical expectations. In addition, it seems that the concept of FDR and FDP are sometimes confused in practice. When the FDR is controlled by a procedure, it does not mean that the realized percentage of false discoveries among all discoveries is q, but rather the classical expectation of the quantity is q. The percentage is a random variable

20 9 and usually different from its classical expectation. In an extreme case when the FDP has a large variance, merely controlling the FDR will have dubious utility, because the actual FDP could be either much smaller or much bigger than the nominal level q. An extremely large FDP is not favorable at all. A full Bayesian approach in multiple tests can estimate the BFDP and the BFNP, i.e., the Bayesian version of FDP and FNP, through posterior inference. From the Bayesian perspective, the criteria BFDR and BFNR are based on posterior inference. The Bayesian procedures to control the BFDR and BFNR are easier to develop by the convenience of posterior argument. The crucial transition is from adjusting classical p-values to utilizing the (posterior) probability of hypotheses. Berger (1985), Berger and Delampady (1987) have criticized the misleading role of p-values in hypothesis tests. Cohen and Sackrowitz (2005) and Muller et al. (2006) have discussed the idea of BFDR from the decision theory point of view. 1.4 The major application areas of large scale inference DNA microarray According to the Wikipedia, a DNA microarray is a collection of microscopic DNA spots attached to a solid surface, such as glass, plastic or silicon chip forming an array for the purpose of expression profiling, i.e., monitoring expression levels for thousands of genes simultaneously. The original observed data from microarray experiments are usually processed before entering the second-stage statistical analysis. This step is called normalization.

21 10 Numerous sources of variation exist, such as array fabrication, sample preparation, hybridization, and signal detection and so on (Nguyen et al., 2002). Normalization can partially stabilize the variation. Efron et al. (2001) consider the normalization step as a data reduction process. Nguyen et al. (2002) summarizes three phases in microarray experiments where statisticians are needed: the design of microarray experiment, the normalization process, and the analysis on the normalized data. Identification of differences in gene expression among biological conditions is one of the immediate goals. We aim the current research at this target. The task is to inspect hundreds or thousands of genes simultaneously and identify the differentially expressed ones based on the normalized data from microarray experiments Other applications Following its introduction, astronomers quickly realized the importance of FDR in source detection. Astronomical images are always analyzed pixel by pixel to detect and measure the properties of objects such as a star of a galaxy. A threshold is defined such that concentrations of pixels above a specified level will be indicative to real sources instead of noise, regardless of the wavelength at which an image has been made (Hopkins et al., 2002). Roughly speaking, the number of statistical tests is equal to the number of pixels in an image. Then there are easily thousands of tests to be done in even in a small image. FDR control provides more scientific justification than a rather arbitrary or empirical threshold in source detection.

22 11 Although many developments in the FDR literature are in the context of microarray analysis, clearly the application of FDR is not limited to microarrays. Nowadays with the trend of increasing dataset size, the potential of FDR is promising. In a general online search, we have found that the concept has been used in variable selection (Ghosh et al., 2004), ecological studies (Garcia, 2003), gravitational wave survey (Baggio and Prodi, 2005), and more areas in bioinformatics, such as quantitative trait loci mapping (Lee et al., 2002), detection of positively selected amino acid sites (Guidon et al., 2005), and more.

23 12 Chapter 2 Bayesian decision rules based on BFDR and BFDP In this chapter, we investigate Bayesian decision rules to control BFDR, assuming that the posterior probabilities of hypotheses are available. 2.1 Notations and bases in Bayesian hypothesis tests For simplicity, throughout the paper we denote the null hypotheses with indicators H 1,..., H n, i.e., H i = 1 means the ith null hypothesis is true, i.e., in microarray analysis the ith gene is not differentially expressed. Let D i denote the decision made on the ith null hypothesis, where D i = 0 means acceptance (nondiscovery, nondetection) and 1 means rejection (discovery, detection). D = (D 1,..., D n ) is a random vector with finite support. Let d i and d be a realization of the decision, d = D(y), where y in lower case represents the observed data. In simultaneous inference it is possible that the entries of D(Y ) are either stochastically dependent through data Y or functionally dependent by the specific form of D(Y ), or both. Given a decision D with i D i > 0, its conditional FDR is E( V R D) = i P (H i = 1 D i = 1) i D. (2.1) i From the Bayesian point of view, the BFDR is conditioned on all of the data Y. The decision vector D should be a function of Y. When the functional form does not have

24 randomness, it is called a nonrandomized decision and otherwise it is called a randomized 13 decision. When the probability P (H i = 1) is defined and estimable, we can easily estimate the BFDR given a decision vector D. To control the BFDR at the nominal level q, we simply need to find a decision D that has a corresponding BFDR less than q. We state the following propositions as the basis of the decision rules in this chapter. Proposition 2.1 indicates that for a nonrandomized decision rule the BFDR is a ratio of the sum of posterior probabilities of null hypotheses to the total number of rejections. Similarly, the BFNR is a ratio of the sum of posterior probabilities of alternative hypotheses to the total number of acceptances. Proposition 2.2 indicates that randomized decision rules have a similar result. Proposition 2.1. A nonrandomized rule D(Y ) has BFDR= 0, and BFNR= i (1 D i (Y ))(1 E(H i Y )) i (1 D i (Y )) if i (1 D i (Y )) > 0. i D i (Y )E(H i Y ) i D i (Y ) if i D i (Y ) > Proof. The proposition is a direct result of Eqn (2.1). Notice D(Y ) has no randomness after conditioning on data Y. The results are then easy to see. Also, note that the posterior estimate for the mean number of false discoveries is EV = i D i E(H i Y ), and the posterior estimate for the mean number of false nondiscoveries is ET = i (1 D i ))[1 E(H i Y )]. i D i (Y ) > 0 or i (1 D i (Y )) > 0. These two estimates do not require Proposition 2.2. A randomized rule D(Y ) has BFDR= E ( i D i H i D i Y ) if i D i (Y ) > 0, and BFNR= E ( i (1 D i )(1 H i ) 1 D Y ) if i i (1 D i (Y )) > 0. Proof. Again following Eqn (2.1).

25 Nonrandomized decision rules We propose several nonrandomized decision rules based on Proposition 2.1. For simplicity in notation, let q denote the nominal BFDR level and let r i = E(H i Y ). The following decision rule is straightforward and is similar to the discussion of local fdr in Efron et al. (2001). Rule 2.1 (Control local fdr). Let D i = 1 if r i = E(H i Y ) q and 0 otherwise. Then the BFDR is controlled at q and the BFNR is controlled at 1 q. By Proposition 2.1, it is easy to see that a decision based on Rule 2.1 has a BFDR less than or equal to q, and similarly it has a BFNR less than or equal to 1 q. However, the actual BFDR is always smaller than the nominal level except trivial cases. In other words, Rule 2.1 is not optimal in the sense of maximizing rejections at BFDR q. The next rule is the optimal nonrandomized rule that can attain the nominal BFDR level better and has more rejections than any other nonrandomized rule under the same or a smaller nominal level. We introduce some notations for ordering. Let r (1), r (2),..., r (n) be the ordered posterior probabilities and let H (i) and D (i) be the corresponding null hypotheses and decisions. Let q i = i j=1 r (j) /i be the BFDR when H (1),..., H (i) are rejected. Rule 2.2 (Optimal BFDR control). Let D (i) = 1, i R and 0 otherwise, where R = arg max j q j q. Then the BFDR is controlled at the nominal level q and the BFNR is controlled at 1 r (R).

26 15 In using Rule 2.2, the following observation says that the quantity q i is increasing in i. In conjunction with proposition 2.2, a reasonable decision will always reject H (1), H (2),..., H (R), i.e., the least probable R hypotheses. Obs 2.1. Let q i = i j=1 r (j) /i, i n. Then q i is nondecreasing in i. Proof. i+1 i q i+1 q i = r (j) /(i + 1) r (j) /i j=1 j=1 = = i i+1 j=1 r (j) (i + 1) i j=1 r (j) i(i + 1) ir (i+1) i j=1 r (j) i(i + 1) (2.2) ij=1 (r (i+1) r (j) ) = i(i + 1) 0. Again Rule 2.2 simply uses Proposition 2.1. By definition, it is obvious that any realization d by Rule 2.2 has a BFDR less than or equal to q. By optimality, any nonrandomized rule C with a BFDR level q C q has fewer detections R C R, where q is the actual BFDR level and R is the rejections of the optimal decision by Rule 2.2. The following proposition states the optimal property. Obs 2.2. Rule 2.2 is the optimal decision in controlling the BFDR. Proof. When R = 0, q also equals 0 by the definition of BFDR. Any nonrandomized rule C with q C 0 has to agree with the optimal decision D. When R > 0, we use contradiction. Suppose that for rule C there are R C rejections and R C > R, of which

27 16 V C are false rejections and q C = E( V C R C Y ). 0 q C q = E( VC R C V R Y ) = ni=1 Ri=1 C i r i r (i) R c R Rc i=1 r Ri=1 (i) r (i) R c R (2.3) = q Rc q R 0, by Proposition 2.1. It can also be seen that the BFDR is actually the average posterior probability of null hypotheses over the rejected indices. A problem of using the average is that some individual probability could be much larger than the average, i.e., the nominal level. Although Rule 2.2 has the optimal detecting ability in theory, in practice it may lead to unacceptable rejections. Consider the case where n = 10, r (10) = 1 and all other r (i) s are 0. Take the nominal level q =.1. Then the optimal decision is to reject all hypothesis with the BFDR= q =.1. The absolutely sure null hypothesis H (10) is optimally rejected. The absurdity occurs due to the inherent shortcoming of the original definition of FDR, which only has a global concern over all of the decisions D 1, D 2,..., D n and lacks specific consideration of each individual D i. A similar admissibility problem also appears for the step-up procedure. See Cohen and Sackrowitz (2005) and Muller et al. (2006) for more details. An adaptive rule based on Rule 2.2 uses a practical bound s. A hypothesis can be rejected only if r i s. A natural choice is s =.5. It can be set either larger or smaller depending on the specific application. But the absurd case, such as r i =.999, can be

28 excluded from the candidate list by the constraint. This is summarized as the following adaptive rule. 17 Rule 2.3 (Adaptive BFDR control). Let D (i) = 1, i min(r D, S) and 0 otherwise, where R D is the number of rejections that would be made by rule 2.2, and S = arg max j r (i) s. Then the BFDR is controlled at the nominal level q and the BFNR is controlled at the nominal level 1 r (min(rd,s)). Exactly the same argument as Proposition 2.1 can show that the BFNR of the decisions by Rules 2.2 and 2.3 is decreasing in R. Figure 2.1 shows the BFDR and BFNR versus the number of rejections by Rule 2.2. Obviously it is not always possible to control the BFDR and the BFNR at the same time for some arbitrary nominal levels, unless the nominal levels q for BFDR and q for BFNR are well chosen. A toy example is that n = 10, r 1 =.1, all other r i =.5. Let q = q =.1. Then any non-randomized decision rule cannot satisfy q and q together. If we change q =.5 then the decision that D 1 = 1 and 0 for all others meets the requirements. In addition, the numerical value of BFNR (and also of FNR) is somewhat misleading. A not uncommon scenario of microarray studies is that there is only a small proportion of genes differentially expressed, i.e. π 0 is big. Hence the FNR or BFNR will never be big, even if all null hypotheses are accepted. In Figure 2.1, it can be seen that the maximal BFNR is.1 and it decreases rather quickly. Instead of BFNR, the number of type II errors, i.e., false nondiscoveries, is more informative.

29 18 Fig Plot of BFDR and BFNR against rejections R where the sorted H (i), 1 i R are rejected. Solid: BFDR, dashed: BFNR. 25,000 null hyp. s with 90% true. Notice BFDR(R) ˆπ 0, as R n, BFNR(R) 1 ˆπ 0, as R 0, R BFDR(R) + W (1 BFNR(R)) = ˆπ 0.

30 19 While the FDR and FNR or their Bayesian counterpart BFDR and the BFNR cannot be controlled simultaneously, the total mistakes, i.e., the number of false discoveries and false nondiscoveries, can be estimated. More generally, from the decision theory point of view, we can assign losses L 1 to a false discovery and L 2 to a false nondiscovery. The risk is then given by E(DHL 1 + (1 D)(1 H)L 2 ). The choice of losses is rather arbitrary. Muller et al. (2006) discussed several special forms of loss functions. We would like to consider a decision rule taking into account the total mistakes. The thought of measuring total mistakes is similar to Ishwaran and Rao (2003), which is equivalent to letting L 1 and L 2 be the same constant. Therefore we have the following rule. Rule 2.4 (Control total risk). Let D (i) = 1, i R, where R = arg min j j i=1 L 1 r (i) + ni=j+1 L 2 (1 r (i) ). Loss functions are not necessarily constants. But when they are, Rule 2.4 essentially chooses a nominal level q for rule 2.1, i.e., local fdr control. The following proposition states the equivalent relationship. The similar result is seen in Muller et al. (2006). Obs 2.3. Let c= L 1 L 2. Let q = 1 1+c = L 2 L 1 +L 2. Then the decision based on Rule 2.4 using L 1 and L 2 is equivalent to the decision based on Rule 1 using the nominal level q. Proof. Suppose that the decision d by Rule 4 rejects a hypothesis H i with r i > q. Consider another decision d that is the same as d except for acceptance of H i. The difference in risk between d and d is r i L 1 (1 r i )L 2 > L 1 L 2 L 1 + L 2 L 1 L 2 L 1 + L 2 = 0.

31 20 Namely there is always a decrease in risk by accepting a H i with r i > q. Since the nominal level of FDR or BFDR has different meaning than the nominal significance level in a classical statistical test, the choice of the FDR level is somewhat arbitrary and lacks justification. There is no automatic equivalence between the usual.05 significance level and a nominal FDR level q =.05. However, Rule 2.4 can help justify a nominal level q. It is also natural to set the loss L 1 as a monotone function of the effect size, i.e, differential expression, such as linear loss or quadratic loss. Given a loss function it is not hard to calculate risk functions over the finite support (and actually much smaller) of all possible decisions, based on the posterior distribution. In practice, the difficulty lies in the choice of the specific form of loss functions. 2.3 Further discussions on the rules This section compares the decisions rules proposed in the previous section. Rule 2.1 (control local fdr) apparently makes the most straightforward decision based on data. A null hypothesis H i will be rejected if it is unlikely to be true given data, i.e., have small enough r i. Given data, the decision about H i is independent of all other decisions. It suffers little from the ill-posed scenarios such as the one discussed in the last section, where both Rule 2.2 (optimal BFDR control) and the step-up procedure have trouble. However, from the perspective of BFDR, Rule 2.1 is often too conservative and suboptimal in the sense of detection capability. By Rule 2.2 it is possible that a highly unlikely null hypothesis can be rejected if the average r i over rejected indices is small.

32 21 Sacrificing a little of the optimality, the adaptive Rule 2.3 will be more sensible about the extreme cases. A comprehensive method combining false discoveries and false nondiscoveries is appealing. However, the difficulty in reality is in the choice of loss function. In the range of constant losses, it is usually believed that L 2 < L 1, i.e, a type I error is more severe than a type II error. It can also be seen that only the ratio L 2 /L 1 influences the decision procedure. Nevertheless, unfortunately the risk based decision rule is not robust to the ratio. We will stick to the constant and equal losses that are meant to evaluate the total mistakes of both types in the rest of the thesis. Non-constant loss functions could be in proportional to the size of effect, denoted by µ i. Then the loss of type I error L 1 (µ) should be a decreasing function of µ and the loss of type II error L 2 (µ) should be an increasing function of µ. The total risk is estimated by ˆR = i [D i L 1 (ˆµ i ) + (1 D i )L 2 (ˆµ i )] (2.4) None of the decision rules we discussed requires independence among H i s. The convenience is due to the definition of BFDR, which is an expectation of a sum of random variables and therefore is unaffected by a possible dependent structure among H i s. 2.4 A randomized decision rule In terms of controlling the BFDR, a randomized decision rule can have no fewer rejections w.p. 1 than the optimal nonrandomized rule under the same nominal level q. We restrict the discussion of randomized rules only to those that can definitely reject

33 22 more than the optimal nonrandomized rule, i.e., Rule 2.2 (optimal BFDR control), with the same nominal level. Within this range, the improvement by a randomized rule results from the fact that the actual BFDR achieved by Rule 2.2 is seldom equal to the nominal level q but is always slightly smaller. The following randomized rule beats the optimal nonrandomized rule in terms of detection capability. Rule 2.5. Suppose k hypotheses are rejected by Rule 2.2 under the nominal level q. Let q k be the actual BFDR level. If q k < q, then reject H (k+1) with probability where q k+1 is the actual BFDR if the additional hypothesis is rejected. q q k q k+1 q k, Before discussing the meaning of the randomized rule, let us lay out the following observation first. Obs 2.4. Under the constraint that if H (i) is rejected then H (j) should also be rejected j < i, the decision made under Rule 2.5 has the properties: i) P (R > k) is maximized; ii) if there exists a convex function f(x) passing all points (i, q i ) in a plane, then E(R Y ) is also maximized. Proof. Let π i, i > k be the probability that additional rejections are only occurred on H (k+1),..., H (i). Let π k = 1 n i=k+1 π i. Then in order to control the BFDR q, it follows that n k π i q i = q by taking conditional expectations on R = i. A simple transform gives n k+1 π i q i = q (1 n k+1 π i )q k. Let T = n k+1 π i. Then q k+1 T q q k + q k T. Hence T bound of T, so i) is shown. q q k q k+1 q k < 1. Notice that rule 2.5 takes π k+1 equal the upper Next since E(R Y ) = n k π i i = n k+1 π i i + (1 T )k. Let S denote the expected rejections by rule 2.5. S E(R Y ) = q q k q k+1 q k n k+1 π i i + T k, which is a linear

34 combination of π i. Recall π is a probability. In conjunction with the linear constraint 23 π i q i + T q k = q, the vector (π i, i > k) is enclosed within a hyper-triangle in the multidimensional space. The extremum can be attained only at a vertex, i.e., exactly one π i does not equal 0 or all of them are 0. The latter case returns to the nonrandomized rule. Hence S E(R Y ) q q k q k+1 q k π i i + π i k, for some i > k. In this case π i = q q k q i q k. It follows that RHS in the last equation is proportional to 1 q k+1 q k i k q q k. By convexity the last expression is always smaller than or equal to 0. In our experience, however, it seems that (i, q i ) does not necessarily form a convex function but rather may have several inflection points. Hence ii) is not always true globally. Notice that convexity is not rare when q i is small (see in figure 2.1 on (0,100)). Moreover it is meaningless in practice to consider a randomized rule that can reject much more than a nonrandomized rule. Therefore within a moderate neighborhood of k, proposition 2.4 is still true. Probably the only practical usage of the randomized Rule 2.5 is: provided that q k is lower than the nominal level q and provided q k+1 and perhaps the next a few are very close to q, it is rather safe to reject the extra ones under the same nominal level q. The randomized rule, which seems not very useful by itself, conversely shows that the optimality of Rule 2.2. Even with randomization, the best reasonable improvement in the sense of proposition 2.4 is about one extra rejection. Consequently Rule 2.2 is nearly optimal in a slightly larger range by the proposition. In summary, the simplicity of Rule 2.2 means that a Bayesian has almost no trouble in optimally controlling the BFDR if r i, or namely P (H i Y ), is available. However, to provide a satisfying estimation of the posterior probability, especially for a precise null hypothesis, is far from trivial.

35 Bayes false discovery percentage (BFDP) As the integrand of FDR or BFDR, the random variable FDP is also of research interest. Both the FDP and the FNP, i.e., T W, are discrete random variables with finite support. Genovese and Wasserman (2004) have emphasized using FDP and FNP but not their classical expectations. In addition, it seems that the concept of the FDR and the FDP are sometimes confused in practice. When the FDR is controlled by a procedure, it does not mean the realized percentage of false discoveries among all discoveries is q, but rather the classical expectation of the quantity is q. The percentage is a random variable and is usually different from its classical expectation. In an extreme case when the FDP has a large variance, merely controlling the FDR will have dubious utility, because the actual FDP could be either much smaller or much bigger than the nominal level q. An extremely large FDP is not favorable at all. A full Bayesian approach in multiple tests can estimate the BFDP and the BFNP, i.e., the Bayesian version of FDP and FNP, through posterior inference. Let H (k) i, k = 1,..., N denote N independent samples of indicators for hypotheses drawn from the joint posterior distribution. Given a nonrandomized decision rule D that fixes rejections R, we have N independent V (k) samples from the distribution of the BFDP, where V (k) = i D i H(k) i. It seems that the mode is a better summary than the posterior expectation, though the posterior expectation is easier to evaluate. If the BFDP is skewed to the right, by using the mode of the BFDP instead of BFDR, we can detect more hypotheses with a comparable, if not better, control of false detections. On

36 25 the other hand, when the BFDP is skewed to left, using the mode of the BFDP will result in fewer detections, which is safer in the sense of false discoveries. Figure 2.2 shows a simulated example. By Rule 2.2 (optimal BFDR control) with the nominal level.005, 120 genes are detected. However, the mode of the BFDP is at 0 and the distribution is skewed to the right. When R = 200, the estimated BFDR is.0086 greater than the nominal level.005. But the mode shown in the right graph is exactly at.005. The observed FDP in these two cases are all close to the nominal level. The mode adjustment has an even smaller FDP and 58.7% more detections. In general, the mode adjustment involves little loss of control over the BFDR, and can detect more differential genes when the posterior distribution is skewed to the right, which is quite common when the nominal level q is small. The posterior distribution can also provide us a credible interval of the FDP, which usually includes the nominal level q. The interval estimation is also informative. If the credible interval is wide, one may want to use a smaller nominal level q, and vice versa.

37 26 Fig Histogram of posterior samples from the BFDP, i.e., f( V R Y ), in a simulation study. The left histogram used the optimal rule applied with q =.005, and the observed FDP is The right one tuning the mode at.005 has 200 rejections, and the observed FDP is.005.

38 27 Chapter 3 A Bayes model for 2-condition comparison In this chapter we introduce Bayesian models for 2-condition comparison. This is among the main objectives of microarray studies: to detect differentially expressed genes between two biological conditions, e.g., cancer versus normal cells. We adopt the two-way ANOVA modeling strategy. The first factor, i.e., biological condition, has two levels, and the second factor, i.e., gene expression, has n levels. In Affymetrix data, a popular type of microarray nowadays, the sample size on each gene level is usually the same within each condition. Hence the design is usually either balanced or proportional. In the older type of cdna microarray, the design is sometimes non-orthogonal due to missing data. Although we focus our study on the more recent Affymetrix data, our Bayesian models can be easily extended to the non-orthogonal situation, because orthogonality is not used in posterior inference and the only change is calculation of the sufficient statistics. 3.1 P-value paradox and the probability of a null hypothesis Most procedures controlling the classical FDR deal with p-values defined as the tail area probability of a test statistic under the null hypothesis. Decision rules aimed at the BFDR and BFNR are based on the posterior probability r i = P (H i = 1 Y ) = E(H i Y ), which in general does not equal the p-value for the same null hypothesis. The lack of correspondence between the p-value and the posterior probability of the null

39 28 hypothesis has a long history of debate. The discrepance is widely known as Jeffreys or Lindley s paradox (Berger, 1985). Berger and Delampady (1987) have criticized the misleading role of p-values in hypothesis tests. In addition, a p-value is often in practice mistakenly explained as an indirect measure of the posterior probability r i. Although both p-values and the posterior probabilities have the same decreasing trend as the test statistic goes towards tails, usually p-values shrink much faster than r i s. See Table 4.2 in Berger (1985) for one of the simplest examples. Sellke et al. (2001) have discussed calibrating on p-values to estimate the lower bound of the posterior probability. Gomez- Villegas et al. (2002) have showed that p-values, when restricted to a small neighborhood of the origin, such as (0,.05), behave similarly to r i, when a very small prior probability is assigned to the null hypothesis. Instead of estimating the lower bound of r i or adopting a prior highly disfavorable toward the null hypothesis, we would rather estimate the probability more accurately based on a more reasonable prior. In testing a precise null hypothesis, the point null must be given a nonzero prior probability. This is usually done by a mixture structure on data f(y ) = π 0 f 0 (Y ) + π 1 f 1 (Y ), where f 0 and f 1 are the marginal prior predictive densities. Efron (2004) has discussed the empirical choice of f 0. In large scale data, we prefer a hierarchical model for π 0, f 0 and f 1 to address the possible dependence in the data and to avoid overfitting. Such a mixture is not the only approach to hypothesis tests. The key issue of a Bayesian model from our point of view is that the parameter space Θ has a nontrivial partition {H i (Θ) = 0, 1} for each i = 1... n, where n is the number of hypotheses. By nontrivial we mean that both P (H i = 1) and P (H i = 0) have non-zero probability a priori and a posteriori. Then decision rules discussed in the previous chapter can be applied.

40 Introduction to Bayesian finite mixture model All of the models proposed in this chapter are essentially various Bayesian finite mixture models. In this section we briefly review some key concepts. Suppose that there are m = 1... M distributions. The mth distribution f m (y θ m ) depends on parameter θ m. Each distribution is called a component. Suppose that it is not known which component underlies each observation. Then the distribution of observation y is called a (finite) mixture model. Besides components, the other indispensable element in mixture model is the mixing proportion, denoted by λ m or π m, which is the proportion of the population coming from the mth component and M m=1 λ m = 1. Usually components f m are assumed from the same parametric families, e.g., two normal distributions with different means. The mixture distribution is f(y θ, λ) = M f i (y θ i )λ i (3.1) 1 Moreover, an auxiliary indicator ζ im is introduced with 1 if the ith observation is taken from the mth component, ζ im = (3.2) 0 otherwise. Given λ, the unobserved indicators ζ i = (ζ i1... ζ im ) have a Multinomial(1;λ) distribution. The distribution of the ith observation can be written as M ( f(y i ζ i, θ) = f(yi θ m ) ) ζ im, (3.3) i=1

41 30 and the joint distribution of data and indicators is f(y i, ζ i θ, λ) = f(y ζ i, θ)f(ζ i λ) = (λ i f(y i θ m )) ζ im. (3.4) Gelman et al. (2004) suggest a general two-step strategy: first given indicators sample θ m from each individual components; second given all θ m sample indicators from the multinomial fully conditional distribution. The second step assumes that for each component there is an available Gibbs sampler to sample from the posterior f m (θ m y). However, there are still computational difficulties with finite mixture models. The most imminent ones consist of label switching, local extrema of total likelihood, and the choice of the number of components M. Based on the simulated posterior joint distribution of all parameters, we can estimate the parameters in each component θ m. Indicators that contain the useful information of the likelihood that an observation is drawn from a component are also estimable by posterior simulations. We will utilize this important information to estimate the posterior probability of a hypothesis, i.e. r i = P (H i = 1 Y ). 3.3 Model for treatment/control comparison When one condition is identified as the control one and the other as the treatment, we can model the relative size of the difference in means. Let Y 1,i,j(1) correspond to the j th observation in the ith level, i.e., gene, of the control, and Y 2,i,j(2) be the j th observaton of the treatment. Notice that the index j for replication is nested in the index for condition. This is due to the nature of the

42 experiment design of microarray studies. Then we can have the following normal response model 31 Y 1,i,j(1) (µ i, σ 2 ) N(µ i, σ 2 ), i = 1... n, j(1) = 1... k 1, (3.5) Y 2,i,j(2) (µ i, α i, ζ i, σ 2 ) N(µ i + ζ i α i, σ 2 ), i = 1... n, j(2) = 1... k 2, where µ i is the mean of control condition of the ith level, and ζ i α i models the size of corresponding treatment effect. In total there are n levels. The sample size is assumed equal within each condition only for convenience. However the assumption can be waived without loss of generality. α i has a continuous distribution for the differential effect size. ζ i is binary indicator, where ζ i = 0 corresponds to the ith null hypothesis H i and ζ i = 1 corresponds to the ith alternative hypothesis H i c. The model is similar to the mixture model in the schizophrenia example in Rubin and Wu (1997) and also in chapter 18 of Gelman et al. (2004). We complete the hierarchical structure by adding the following priors in Eqn (3.6). The prior of σ 2, α i and µ i are inverse gamma, normal, and normal. These choices of prior are partially conjugate which is computationally simple in programming the Gibbs sampler. By partial conjugacy we mean that the fully conditional distribution of a parameter or a parameter vector given all other parameters and the data is conjugate to the prior. The Gibbs sampler only requires the fully conditional distributions to simulate the posterior distribution. Hence, using partially conjugate priors we can easily program

43 32 the Gibbs sampler. We only need to tune the parameters in the prior if necessary. σ 2 IΓ(a 1, a 2 ), where a 1 is shape and a 2 is scale parameter, Eσ 2 = a 2 a 1 1 for a 1 > 1, α i i.i.d.n(0, τ 2 ), τ 2 IΓ(b 1, b 2 ), (3.6) λ = (1 λ 1, λ 1 ) Dirichlet(α 0 ), α 0 = (α 01, α 02 ), 1 ζ i i.i.d.bernoulli(λ 1 ), µ i N(µ 0, ν 2 ), where σ 2, α, λ, ζ are independent in prior. α has a further hierarchy structure in its i 1 i i variance τ 2, which is also partially conjugate. The partial conjugacy on ζ is completed i by a Bernoulli distribution. The hyper prior on the mixing proportion λ is Dirichlet, which is also partially conjugate. Model (3.5) and its priors in (3.6) are a Bayesian finite mixture model with 2 components, where ζ is the indicator and λ is the mixing proportion. Although the i indicator ζ is an auxiliary variable in a mixture model, it is indispensable in our model. i From the above discussion, it is straightforward to find that the ith null hypothesis H = I(ζ = 0). Hence the posterior probability r = E(H Y ) = P (ζ = 0 Y ). In i i i i i addition, the mixing proportion λ also has useful information. The posterior distribution of λ actually estimates the proportion of true hypotheses.

44 The constants in the priors are all known before analysis. They can be empirical 33 estimates, but nearly flat and proper priors are more preferred. For example, for normalized Affy data a 1 = b 1 = 2.1, a 2 = b 2 = 5, ν 2 = 100, and λ = (9, 1) give a proper and nearly flat prior. Such a choice of prior is based on the robust multi-array (RMA) normalized data with a range roughly in (0, 20) and the variance in priors is deliberately set larger to achieve flatness. The prior of λ reflects the impression in advance that about 90% of the null hypotheses are true, which is a usual scenario in many microarray studies. Combining the prior into the model we have the joint likelihood f(y, µ, α, σ 2, ζ, τ 2, λ) =f(y µ, σ 2 ) [ K ( λi f(y µ, ζ = i, α, σ 2 ) ) ζ ] i 1 2 i=1 (3.7) f(σ 2 )f(µ)f(α τ 2 )f(τ 2 )f(λ) The closed form of Eqn (3.7) is available but complicated. Due to the partially conjugate family, all fully conditional distributions have regular forms. The fully conditional distributions are summarized in Eqn (3.8) to Eqn (3.13) below. It is then easy to program

45 34 the Gibbs sampler. σ 2 ( ) IΓ ( nk nk a 1, 1 2 (Y µ ) (Y µ ζ α ) 2 ) + a, 1ij i 2 2ij i i i 2 i j i j (3.8) ( k 1 Y j=1 1ij σ 2 + k 2 (Y j=1 2ij ζ i α i )σ 2 µ ( ) N i (k + k )σ 2 + τ α i ( ) N( k2 τ 2 ( ) IΓ ( n 2 + b 1, 1 2 j=1 (Y 2ij µ i )ζ i σ 2 k 2 ζ i σ 2 + τ 2,, ) 1 (k + k )σ 2 + ν (3.9) 1 k 2 ζ i σ 2 + τ 2 ), notice when ζ i = 0, it is just N(0, τ2 ), (3.10) n α 2 + b, (3.11) i=1 i 2) λ ( ) Dirichlet ( I{ζ = 0} + α, i 01 i i I{ζ i = 1} + α 02 ), (3.12) 1 ζ i ( ) Bernoulli(p), (3.13) where p 1 p k 2 = exp{log λ log(1 λ) σ 2 (Y µ )α + 2ij i i j=1 k 2 α 2 i 2σ 2 }, where Eqn (3.13) is used to sample indicators and all the others are used to sample component parameters and mixing proportions. From the testing perspective, posterior inference should focus on the probability P (ζ = 0 Y ). The effect size is estimated by f(α ζ Y ). f(λ Y ) estimates the mixing i i i proportion, i.e., the percentage of true null hypotheses. This information is also useful.

46 Notice that the null hypotheses H i = I(ζ i = 0) are correlated both in prior and in posterior Exchangeable model on 2-condition comparison The exchangeable model When there is no clearly defined control condition, we prefer an exchangeable model between the two experimental conditions being compared. The explanation of interesting parameters that model the absolute and differential expression levels is invariant to the exchange in the order of experimental conditions. A small change in model (3.5) can make it exchangeable. As before let Y be the jth observation on the ith 1,i,j(1) gene for the first condition, similarly for Y. Furthermore, we want to have more 2,i,j(2) flexibility on the distribution under alternative hypotheses so that the model can be more robust. The model is K Y N(µ ζ α, σ 2 ), i = 1... n, j(1) = 1... k, 1,i,j(1) i ik ik 1 k=1 K Y N(µ + ζ α, σ 2 ), i = 1... n, j(2) = 1... k, 2,i,j(2) i ik ik 2 k=1 (3.14) where represents all parameters in the model, subscript i stands for genes, subscript j(l) stands for the jth array nested in the lth condition, n is the number of total genes, k and k are number of arrays for the first and the second condition. For example, 1 2 Y represents the expression of the second gene taken from the third array of the 1,2,3(1) first condition. µ and α are continuous parameters modeling the size of absolute and i ik

47 36 differential expression level, ζ are binary parameters. For a fixed i when all ζ = 0 ik ik it corresponds to the ith null hypothesis, i.e., H = 1, is parametrized as no location i change between two conditions. For the ith alternative hypothesis, there is some ζ not ik equal zero. Again, the model (3.14) is essentially a mixture model with indicators, i.e., labels, explicitly used as parameters. It has been argued that ζ should not be treated as parameters but rather missing observations. The main concern probably lies in the fact that the dimension of indicators is too large to show asymptotic properties (Marin et al., to appear). However, we are not concerned similarly here because any model on microarray data already demands O(n) parameters even without indicators. We assign proper priors to all of the parameters in model (3.14). Using IΓ(α, β) to denote the inverse Gamma distribution with shape α and scale β, σ 2 IΓ(a 1, a 2 ), α N(θ, τ 2 ), k = 1... K, i = 1... n, ik k k τ 2 IΓ(b k k1, b ), k = 1... K, k2 θ N(t, ν 2 ), k = 1... K, k k 1 (3.15) λ Dirichlet(α 0 ), α 0 = (α 00, α 01,... α 0K ), (ζ i0, ζ i1,..., ζ ik ) Multinomial(λ), i = 1... n, µ i N(µ 0, ν 2 2 ).

48 37 For simplicity we introduced an unused parameter ζ to make up a complete i0 multinomial distribution for ζ = (ζ, ζ,..., ζ ). In addition, on the nonzero differential expression part, the K-component mixture (3.16) can model a complicated i i0 i1 ik population distribution similar to the kernel estimator using a normal mixture densities (Marron and Wand, 1992) α i = k ζ ik α ik K k=1 λ N(θ, τ 2 ). (3.16) i k k However, we would like to use fewer components and target each component for a specific group of genes. The following special cases are most interesting for the K-component mixture on the alternative hypothesis: significant genes have skewed differential effects about the origin (asymmetry); significant genes have clusters in differential effects (clusters); significant genes have highly extreme differential effects (heavy tails). While asymmetry and heavy tails may need K = 2 components, clusters may require as many components as the number of clusters. Otherwise, a K = 1 model should be chosen for computational efficiency. Unless there are more than two apparent clusters, usually K = 2 is sufficiently complex to model the (possibly asymmetric and heavily tailed) differential expression. A quick view of the raw differential expressions Z = Ȳ2,i, Ȳ1,i, 2i versus genes could be helpful for choosing K. The priors of σ 2, α and µ are inverse gamma, normal, and normal distribution respectively. α has a further hierarchical level in its variance τ 2 that also follows an inverse i i i

49 38 gamma distribution. The hierarchical structure on ζ is completed by a multinomial(1, λ) i distribution. The hyper-prior on the mixing probability λ is a Dirichlet. The choice of priors is semi-conjugate so that the Gibbs sampler can be easily programmed. The key feature of the Bayesian mixture model is the utilization of the indicators that provides the posterior probability of a null hypothesis E(H Y ) = P (ζ = 1 Y ). i (i0) The joint posterior distribution of all ζ and the other parameters are simulated using the Markov chain Monte Carlo (MCMC) method. In testing the precise null hypothesis, the choice of prior is more crucial than in other Bayesian problems. The mixture structure π 0 f 0 + π 1 f 1 essentially excludes using a noninformative prior. There is always a positive probability that a component is empty, i.e., no observation with the corresponding label. When this happens, the Gibbs sampler has to sample from an improper density. Moreover, a super flat but still proper prior makes the marginal f also super flat, which results in the P (H = 1 Y ) infinitely 1 i close to 1 and hence meaningless. The choice of priors should be made in conjunction with the nature of microarray data and the preprocessing or normalization steps prior to modeling. For example, many normalizing methods give differential expression levels within ( 16, 16). The bounds give a rough but helpful idea on the choice of priors: choose values that make most of the mass within the range while being as flat as possible Markov Chain Monte Carlo (MCMC) and posterior inference Due to conjugacy all fully conditional densities in model (3.14) have regular form. Hence the Gibbs sampler can be easily programmed. The hyper parameter θ is sampled

50 39 from (3.24). σ 2 Y, µ, ζ, α IΓ ( n(k 1 + k 2 ) 2 µ i Z 1, σ N( a 1, (Y µ + (2I{l = 1} 1)ζ α ) 2 ) + a, (3.17) l,i,j(l) i i i 2 l,i,j 2Z 1i [( 1 k k 2 )σ 2 ] 1 4[( 1 k k 2 )σ 2 ] 1 + ν 2 2 1, 4[( k 1 + k 1 )σ 2 ] 1 + ν 2 ), (3.18) where Z 1i = Y 1,i, + Y 2,i,, τ 2 k α, θ IΓ( n 2 + b 1, 1 2 θ k α, τ k N( i α ik τ 2 k (nτ 2 k n (α θ ) 2 ) + b, ik k 2 k = 1... K, (3.19) i=1 + ν 2, 1 2 ) 1 (nτ 2 k θ τ 2 + 2ζ Z [( k α Z, θ, τ, ζ, σ N( k ik 2i k 1 + k 1 )σ 2 ] ik 2 τ 2 k + ν 2 ), k = 1... K, (3.20) ) ζ ik [( 1 k k 2 )σ 2 ] 1, τ 2 k 1 + 4ζ [( 1 ik k + k 1 )σ 2 ), ] where Z 2i = Y 2,i, Y 1,i,, (3.21) λ ζ Dirichlet ( I{ζ = 1} + α, i0 00 i i ) I{ζ = 1} + α,..., I{ζ = 1} + α ), i1 01 ik 0K i (3.22) λ, f(z ζ = 1) ζ Z, λ Multinom(1; p), where p = ( 0 2i i0 i 2 k λ k f(z 2i ζ ik = 1),..., λ f(z ζ = 1) K 2i ik k λ k f(z 2i ζ ik = 1)). (3.23) With the indicators ζ, it is easy to apply the following algorithm, where a Gibbs sampler alternates with three main steps.

51 40 Algorithm 1: 1) Sample µ, α, σ 2, θ, τ 2 sequentially from the corresponding fully conditional distribution given the data and the other parameters. 2) Sample the indicators from the Multinomial distribution given all other parameters. 3) Sample the mixing proportion given all the indicators. 4) Repeat until convergence. In Algorithm 1, it can be seen that if all indicators for some component equal 0, then the Gibbs sampler will repeatedly draw samples from f(α θ, Y ) = f(α θ) and f(θ α). In that case θ will be trapped in its prior. Even if a component is not actually empty, the Gibbs sampler might still be trapped with a certain probability. We can partially avoid the problem due to the normal-normal structure on (Z 2, θ, α), which gives (Lindley and Smith, 1972) K Z N(2 ζ θ, σ 2 K (1/k + 1/k ) + 4 2i ik ik 1 2 k=1 k=1 ζ τ 2 ), (3.24) ik k where Z = Ȳ1,i,. Similar to the optimization in Chib and Carlin (1999), we can 2i Ȳ2,i, therefore sample θ and α in a block. This change can partially reduce the correlation between θ and α in posterior samples. The revised algorithm is Algorithm 2: 1) Sample µ, σ 2, τ 2 sequentially from the fully conditional distribution. 2) Sample θ, α, τ 2 in a block by a) sample θ from (3.24); b) sample α from f(α θ, Y ). 3) Sample the indicators from the Multinomial distribution given all other parameters. 4) Sample the mixing proportion given all the indicators. 5) Repeat until convergence. In using the algorithms, an adequately long burn-in period should be taken out at the beginning stage of the MCMC. For example, the first 50% of the 10, 000 random

52 41 draws could be dropped. Another choice is to monitor the random walk of the scalar parameters such as individual mixing proportions. Until it becomes be stable, the early rounds should all be dropped. Interestingly, the estimate for the mixing proportion of true null hypotheses are quite accurate for both Bayesian and classical estimators. Hence we expect that when convergence is achieved, the sampler should only wander around a very close area around the point estimate. When K=1, model (3.14) is essentially a two-component mixture. The two components are distinct in the sense of prior marginal likelihood. With more components, we can always assign different priors to distinguish components. Hence in theory label switching will never happen, because the permutation of labels will result in a change in the joint likelihood. However, that does not necessarily guarantee a safe computation. Marin et al. (to appear) indicates that even in the simplest normal-normal model, there are local maxima in the likelihood surface, where a Gibbs sampler can be trapped. To move out from a local maximum to the global maximum needs an overwhelmingly long run, which causes a large practical problem for the Gibbs sampler. Although it has a seemingly concise algorithm, the convergence of MCMC may demand an intolerably long time. Moreover, when the Gibbs sampler appears to achieve convergence, it is hard to tell whether it has reached a global maxima, especially when K is big and the meaning of each component except the null case becomes vague. The difficulty of local maximum in the likelihood, however, is less serious for our model, because each component is distinct and meaningful. For example, suppose K = 1 and we assign a relatively large probability such as.9 to the mixing proportion of the null component in the prior. If the Gibbs sampler shows the majority of genes fall in the second component, then it

53 42 is clearly a local maximum rather than a global one. The scalar parameters, such as the mixing proportion, can be monitored during MCMC. Once a dubious local trap appears to arise, we can incorporate a Metropolis-Hastings step called reclustering, which permutes the labels (Mosquin, 2000) and samples the parameters given the permutated indicators. The permutation can be accepted or rejected with an ordinary Metropolis step. 3.5 Extensions of model (3.14) A better model will provide better estimates for the posterior probabilities of hypotheses, namely r = E(H Y ). Given the optimal decision rules, the capability of i i detecting significance (i.e., the analogue of the concept of power in a single test) largely depends on how well r is estimated. i The preceding (K+1)-component exchangeable mixture model provides a good prototype under nice assumptions: i.i.d. errors given hyperparameters. In practice it may be necessary to further improve the prototype model to fit a more complicated situation where one or more of the assumptions do not hold. On the other hand, the model can be further modified to fit simpler cases. The parametrization in the prototype model is nearly saturated. In usual microarray data there are limited replicates per condition, i.e., k 1 and k 2 are small. Therefore it is also necessary to consider a more parsimonious model that can further reduce the dimension of the parameter space and retain the capability of simultaneous inference.

54 Unequal variance model Assume that the errors do not have equal variance among levels and do not have any known specific structure. We would like to model this general unequal and unstructured variance scenario. A small change in the parametrization of the prototype model (3.14) is made. Instead of a single model error σ 2 we allow every level to have different errors but still remain independent given hyper parameters. This reflects only a small change in the model formulation. Equation (3.14) is modified to K Y ( ) N(µ ζ α, σ 2 ), ij(1) i ik ik i k=1 i = 1... n, j(1) = 1... k 1, K Y ( ) N(µ + ζ α, σ 2 ), ij(2) i ik ik i k=1 i = 1... n, j(2) = 1... k 2. (3.25) The only change is on σ 2 i with the subscript denoting the distinct error term on the ith level. We have then added n 1 more parameters in the model. The benefit of doing so is that we now do not suffer the weakness of equal variance assumption, which is a strong assumption, rarely holds perfectly and may not be easily remedied. The cost is n 1 additional parameters. The effective sample size reduces and the error of estimation increases. Presumably, the unequal variance model has less detection capability compared with the prototype model. This conjecture is supported in our simulation studies, where a great loss of capability in detecting difference is observed. Therefore we prefer a preprocessing step on the raw data to try to equalize errors. In microarray data analysis there is usually a preprocessing step called normalization that reduces noise in

55 44 data. In our experience normalization partially stabilized the errors, probably because many normalizations use a logarithmic transformation. However, normalized data is not meant to satisfy the equal variance assumption. Further adjustment may be needed as in Cui et al. (2003). The unequal variance model is used to reserve the modeling capability after all efforts fail to equalize the error. The change of parametrization does not affect most priors, except that instead of a single error now we have exchangeable error variances σ 2 IΓ(a i 1, a ). (3.26) 2

56 45 The full conditional densities are the following corresponding to (3.17) to (3.23). σ 2 i ( ) IΓ( k k a 1, 1 2 (Y µ + ζ α ) (Y µ ζ α ) 2 ) + a, 1ij i i i 2 2ij i i i 2 j j µ i ( ) N( 2z 1i [( 1 k k 2 )σ 2 ] 1 4[( 1 k k 2 )σ 2 ] 1 + τ 2 3 (3.27) 1, 4[( k 1 + k 1 )σ i ] 1 + τ 2 ), z = Y + Y 1i 1i 2i, 3 (3.28) τ 2 ( ), θ ( ) are not changed from the prototype model, k k θ τ 2 + 2ζ z [( k α ( ) N( k ik 2i k 1 + k 1 )σ i ] 1 ik τ 2 k + 4ζ ik [( 1 k k 2 )σ 2 i ] 1, τ 2 k 1 + 4ζ [( 1 ik k + k 1 )σ 2 ), z 1 2 i ] 1 2i = y 2i y 1i, (3.29) λ ( ) is not changed from the prototype model, f(z ζ = 1) ζ ( ) Multinomial(p), where p = ( 2i i0 i k f(z 2i ζ ik = 1),..., f(z ζ = 1) 2i ik k f(z 2i ζ ik = 1)). (3.30) Note although the form of (3.30) is the same as (3.23), the explicit form of the likelihoods in the equations are different Model on structured covariance by random effects To model spatial and longitudinal covariance is a classical topic in statistics. In the classical generalized linear model (GLM), longitudinal and spatial correlation can

57 46 be modeled in two ways: covariance of random effects and covariance of model errors. The first one is sometimes also called R-side effect and the other one is correspondingly called G-side effect (same as repeated measures in most software packages). Both types of effects can have independent blocks (for panel data). There is no necessity to have relations between the two types of effects. (For example, in the simplest one-way random effect normal response model, G-side effect is simply diagonal(σ 2 ), and R-side effect is diagonal(τ 2 ).) Combining the two covariance structures, we can model very complicated parametric longitudinal correlation structures. In general exponential family responses, the likelihood is hard to evaluate and can be done usually only through approximations (Wolfinger and O Connell, 1993). In the normal response case if we want to use Bayesian statistics, the Markov chain Monte Carlo (MCMC) method is applicable in theory since the fully conditionals are always derivable. However MCMC suffers practical difficulties. The fully conditional distributions involve matrix inversion to estimate the covariance matrix. It will be very computationally intensive when the covariance matrix is huge. MCMC demands calculating invert matrices repeatedly in iterations. When one of the two types of effects is not a diagonal matrix, the fully conditional distribution will be very complicated. For example, the conditional densities (3.20) and (3.21) that are independent scalar densities will become multivariate normal vectors with dimension n 1. Given the covariance matrix, to sample from a high dimensional multivariate normal is much slower than sampling independent normal distributions with the same length. Compound symmetry is one of the solvable cases both in theory and in practice. The compound symmetry structure is equivalent to having a random intercept effect.

58 47 Thus a non-diagonal G-side effect is converted to a diagonal R-side effect. Other possible cases can be found in Chib and Carlin (1999). For example, random effects can fit a growth curve with autocorrelation AR(n) structure. When a non-diagonal covariance is unavoidable, the inverse Wishart distribution is usually chosen as the prior for the sake of partial conjugacy in covariance. (Chib and Carlin, 1999; Olsen and Schafer, 2001). Besides that, Gelman et al. (2004) have suggested a way to model correlation coefficients (for time series), or elements of spectrum decomposition (panel data and spatial data). However, there are some obvious difficulties when the model is big. The prior joint distribution of correlation coefficients is not easy to assign due to the restriction that a covariance matrix needs to be positive semidefinite. This restriction is difficult to convert to individual correlation coefficients except in special cases, such as compound symmetry. By modeling the spectral decomposition of the covariance matrix we can avoid this problem, because a quadratic form is always nonnegative. But the problem becomes placing a prior on an orthogonal matrix (consisting of eigenvectors), which is even more abstract to imagine and understand. In the rest of this section we focus on the simple diagonal R-side effect with the form νi. Each R-side effect can model a compound symmetry correlation structure within the corresponding blocks. We can have multiple effects in one model. Here we only study a simple case with one block factor: array (chip). Two observations are in the same block if they are in the same condition and the same batch. E.g., Y and 111(1) Y are of the same block. Within each chip (array) there is a compound symmetry 151(1) structure with unknown covariance. Let us call this model the simplest R-side effect

59 48 model, based on the prototype model. The model is K Y N(µ ζ α + γ, σ 2 ), i = 1... n, j(1) = 1... k, 1,i,j(1) i ik ik j(1) 1 k=1 K Y N(µ + ζ α + γ, σ 2 ), i = 1... n, j(2) = 1... k. 2,i,j(2) i ik ik j(2) 2 k=1 (3.31) All the priors used in the prototype can be applied. The newly introduced random effect has a partially conjugate prior γ j(l) N(0, ν 2 ), l = 1, 2. Again since we use partially conjugate families we can derive all the fully conditional distributions easily. The design matrix of model (3.14) can be made more complicated to incorporate other types of correlation through random effects. Eqn (3.32) to (3.35) give the fully conditional distributions.

60 49 σ 2 IΓ ( n(k 1 + k 2 ) 2 + a, 1 [(Y µ + ζ α + γ ) ij(1) i i i j(1) i,j (Y 2ij(2) µ i ζ i α i γ j(2) ) 2 ] + a 2 ), (3.32) µ ( ), τ 2 ( ), θ ( ) are not changed from the prototype model, i k k θ τ 2 + (2ζ z 2γ )[( k α N( k ik 2i j k 1 + k 1 )σ 2 ] ik τ 2 k + 4ζ ik [( 1 k k 2 )σ 2 ] 1, τ 2 k 1 + 4ζ [( 1 ik k + k 1 )σ 2 ), ] (3.33) where z 2i = y 2i y 1i, λ is not changed from the prototype model, f(z ζ = 1) ζ Multinomial(p), where p = ( 2i i0 i k f(z 2i ζ ik = 1),..., f(z ζ = 1) 2i ik k f(z 2i ζ ik = 1)), (3.34) γ j(l) N( i z 2ij 2 i k ζ ik α ik ν 2 + 4n[( 1 k k 2 )σ 2 ] 1, 1 ν 2 + 4n[( 1 k k 2 )σ 2 ] 1 ). (3.35) Model with G-side covariance matrix This section describes a tentative derivation using G-side covariance to model the correlation. The model is similar to the unequal variance model. But we need a new set of multivariate notation before laying out the full conditionals. Let µ = (µ i ) i. Let

61 50 α = (α ik ) ik. Let ζ = (ζ ik ) ik. Then the model can be described by Y 1 N n (µ ζ α1, Σ), (3.36) Y 2 N n (µ + ζ α1, Σ), (3.37) where σ 2 σ... σ n σ σ 2... σ 21 Σ = 2 2n f(σ). (3.38). σ n1 σ n2... σ 2 n The most common f(σ) is inverse Wishart. It can have other priors such as modeling eigenvectors and eigenvalues. Assuming that we can sample Σ from some conditional distribution, we provide all other full conditionals for the covariance model as

62 51 µ ( ) N ( A(τ 2 3 µ 0 + 2( 1 k k 2 )Σ 1 Z 1 ), A ), where A = (2( 1 k k 2 )Σ 1 + τ 2 3 I) 1, (3.39) τ k ( ), θ k ( ) are not changed, α k ( ) N ( B(τ 2 k θ k + 2( 1 k k 2 )Σ 1 ζ k Z 2 ), B), (3.40) where B = (2( 1 k k 2 )Σ 1 + τ 2 k I) 1, λ ( ) is not changed, ζ ( ) has no change in form. Although the apparent form of sampling ζ does not change, the likelihood needs to be ij replaced with the multivariate normal, which is quite expensive to evaluate. For example for the first gene, to sample ζ we need to evaluate Z (ζ = 1, ) and Z (ζ = 0, ), because through Σ all other genes contribute to the classification of the first gene. In sum, although the non-diagonal covariance structure (same for R-side effect covariance) is derivable under the normal distribution, the more complex model greatly increases the difficulty in coding the Gibbs sampler.

63 A parsimonious model The prototype model can be simplified to reduce the number of parameters. Consider the following model K Y ( ) N(µ ζ α, σ 2 ), i = 1... n, j = 1... k, 1ij(1) ik k 1 k=1 K Y ( ) N(µ + ζ α, σ 2 ), i = 1... n, j = 1... k, 2ij(2) ik k 2 k=1 (3.41) α N(θ, τ 2 ), k = 1... K, i = 1... n, k k k This parsimonious model perhaps lost some accuracy in estimating individual gene expression levels. But the precision of the model is hopefully improved by reducing parameters and increasing the effective sample size. The full conditionals are easy to derive and similar to the prototype model. While assuming the same absolute expression level for all genes, the parsimonious model can still estimate the size of differential effect. Although all genes share the same α, the size of differential effect is given by k k ζ ik α, which is distinct for each gene. k The absolute expression level is rarely the same for all genes. But centering can be easily done for each gene.

64 Summary of the model variants This chapter has discussed a few candidate models. The exchangeable and treatment/control comparison models are almost equivalent in practice, though the parameters have different interpretations. Correlation models consider the possible improvements in modeling the experimental error terms. The parsimonious model can deal with small effective sample size. The assumption of equal variance in the error terms should be used whenever possible. In the simulations it is seen that an unequal variance model yields dramatically worse results. A similar finding is seen in Kooperberg et al. (2005). The standardization and normalization prior to modeling and inference are important to help stabilize the error variances. One can also force the error variance to be equal by dividing by the standard deviation within each gene (Ishwaran and Rao, 2003). We noticed that a log type transformation that is incorporated in many normalization methods can stabilize the variance rather well, because variance often increases with the absolute expression levels. A more sophisticated method of variance stabilization for cdna arrays is discussed by Cui et al. (2003).

65 54 Chapter 4 Model Comparison This chapter investigates checking the important assumptions of the prototype model and the comparisons among all candidate models. We prefer graphical model diagnosis tools to predictive test statistics. The reason is that for a large dataset such as microarray data, it is hard to expect a perfect match between reality and a parametric model. A simple test-statistic or Bayesian p-value, which is defined as the tail probability in the posterior predictive density (Gelman et al., 2004), will be too stringent. Another important issue is the choice of the number of mixture components K. We use the deviance information criteria (DIC) to do the comparison. We also discuss the Bayesian information criteria (BIC) and its properties in hierarchical models 4.1 Posterior model checking All the Bayes models discussed in Chapter 3 are similar to normal response linear models in terms of structures of cell means and assumptions on error terms, except for some minor changes in parametrization and perhaps choice of priors. It is straightforward to borrow the discussion of robustness and adaptiveness from linear models. For the prototype model the error term is assumed i.i.d. distributed N(0, σ 2 ) conditional on σ 2, similar to classical linear models. This assumption is known to be crucial. In our models, however, independence is slightly loosened to exchangeability. More complicated

66 55 covariance, if it exists, can be modeled by random effects with appropriate blocking and covariates. The residual plot is an important graphical posterior model checking tool on error terms. The Bayesian predicted residual is ˆr = y E(y θ) that plugs in a single random B draw of θ from its posterior distribution. Instead, if a point estimate ˆθ(y) is used then it is more like a classical residual, namely ˆr = y E(y ˆθ) by ignoring the uncertainty C in θ. Hence it may underestimate the uncertainty of errors from the Bayesian point of view. The Bayesian residual ˆr has a better interpretation but larger variation, which B depends on the random draw from f(θ y). Ideally the Bayesian residual plot differs each time when it is drawn, and hence cannot be exactly replicated. hand, the classical residual is unique conditional on the same estimate. On the other In addition, although indicators ζ are redundant in mixture models (only mixing proportions λ are essential), we would like to keep ζ because the mixing proportion alone will generate larger dispersion in residuals. Table 4.1 summarizes two versions of Bayes and classical residuals. Table 4.1. Two versions of predicted residuals in prototype model Notation Variation in parameters Formula ˆr B Yes y µ ± ζα ˆr C No y ˆµ ± ζα

67 56 We want to inspect the residuals from different perspectives by diagnostic plots. Only when no plot shows violation of crucial model assumptions can we continue with inference and model comparisons. Residuals should be plotted against levels, i.e., genes, and predicted values. Also, histograms for each array are helpful to see a possible differential array effect. The unequal variance model does not assume equivalence among variances. However we can still draw the residual plot to see if a simplification of the prototype model is possible. There are some issues in residual plots especially related to a large data set. First in a huge dataset it is hard to imagine the simple assumptions hold exactly and perfectly. Second since the usual residual plot is very dense, details may be obscured. Plots that can display the density by colors or contours are preferred. While the correlation might not be a severe problem (since a hierarchical model does assume at least some correlation in data), the scale of variance should be carefully inspected. It is crucial to those models making the equal error variance assumption. We show some example residual plots. First we simulate data from the prototype model assumption that the errors are i.i.d. N(0, 1). Let n = 1, 000 levels (genes) and k = k = 5 replicates for each condition. The first 100 levels have differential effect size 1 2 generated from N(0, 4) and the rest are actually no different. Figure 4.1 and 4.2 show Bayes and classical residuals versus genes and predicted values respectively. It is rather easy to see if residuals are centered at zero. However whether residuals have equal variances along x-axis sometimes is not very obvious to the naked eye. It may be helpful to plot the absolute value of residuals and fit a smooth curve. Figure 4.3

68 57 shows the absolute residuals with loess fitted value. Figure 4.4 and 4.5 show histograms of residuals for each array. An important thing to notice from Figure 4.1 and 4.2 is that even when the data satisfies the model assumption perfectly, the residual plot versus predicted values shows a somewhat ball-like shape. The possible reason is the distribution of the predicted value, which is very dense in the middle and sparse at the tails. Suppose the predicted value and residuals follow independent normal distributions. Then although the residuals indeed follow the model assumption, what we will see will be a football shape. This simulation sheds some light on the football-shaped residual plots against predicted value, which provides no evidence against the model assumptions. Next we deliberately violate the equal variance assumption in data generation. Let the variance of the error be proportional to the absolute value of means at each gene by condition combination, i.e., cell means. Figure 4.6 and 4.7 are the Bayes and classical residuals for this scenario. Although from the gene perspective the violation is not clear, it is obvious that there is an increasing tread along with the absolute predicted values increasing. The fitted curve in 4.8 also shows a clear trend.

69 58 Fig Plot of Bayes residual versus genes and predicted values, when the errors are generated from the model assumption N(0,1).

70 59 Fig Plot of classical residual versus genes and predicted values, when the errors are generated from the model assumption N(0,1).

71 60 Fig Plot of absolute value of residuals versus predicted values with loess fit, when the errors are generated from the model assumption N(0,1).

72 61 Fig Histograms of Bayes residuals by arrays, when the errors are generated from the model assumption N(0,1).

73 62 Fig Histograms of classical residuals by arrays, when the errors are generated from the model assumption N(0,1).

74 63 Fig Plot of Bayes residual versus levels and predicted values, when the errors are proportional to differential expression levels.

75 64 Fig Plot of classical residual versus levels, predicted values, when the errors are proportional to differential expression levels.

76 65 Fig Plot of absolute residuals versus predicted values with loess fit, when the errors are proportional to differential expression levels.

77 Predictive density and information criteria Inference about future observations is often called predictive inference. The distribution used in prediction is then called the predictive distribution, usually defined as a marginal distribution of observation alone not depending on unknown parameters. The first version of marginal distribution of y is f(y) = f(y, θ)dθ = f(y θ)f(θ)dθ, (4.1) where f(θ) is the prior of the parameter. Hence the predictive density in Eqn (4.1) is also called the prior predictive density. On the other hand, if we condition on the observed data in Eqn (4.1), we can have another version of predictive density often called the posterior predictive density. To distinguish from future observations, we use ŷ to denote observed data. The posterior predictive density is f(y ŷ) = f(y, θ ŷ)dθ = f(y θ)f(θ ŷ)dθ. (4.2) Gelman et al. (2004) argued that the posterior predictive density can be seen as an average of conditional prediction over the posterior distribution of θ. The name posterior prediction is based on the fact that the density (4.2) is conditioned on the observed data. The predictive density is often used to choose the best model among candidates by an information criteria. Ideally a good model should have good predictive performance. Naturally the (log) predictive likelihood of observed data ŷ should be higher for a good

78 67 model than a bad one. A rationale of choosing the best model among candidate models is then to choose one maximizing predictive likelihood or minimizing the predictive risk. However, even based on the same rationale very different information criteria have been developed with controversial results. Prior to discussing the technical details of information criteria, we want to emphasize our aims in model comparisons: 1) to choose an optimal choice of number of mixture components K for alternative hypotheses; 2) to choose a good model from the candidates Bayesian information criterion (BIC) If we use the prior predictive density to calculate the predictive likelihood, Schwarz (1978) has shown that it has an asymptotic simple form known as the Bayesian information criterion (BIC) or sometimes Schwarz information criterion (SIC). The simplified form is BIC = M 1 2p log n, where M is the global maximum of the log likelihood log f(y θ), p is the dimension of the parameter space, and n is the number of observations. When the model likelihood comes from an exponential family, Schwarz (1978) showed that for fixed p, BIC has the same limit as the prior predictive likelihood when the sample size n goes to infinity. When the prior contains hierarchical structures or say hyperpriors, some have argued that there could be ambiguity in counting model complexity, i.e., number of free parameters (Spiegelhalter et al., 2002). For example, let the model be f(y, θ, ψ) = f(y θ)f(θ ψ)f(ψ). (4.3)

79 68 Then the model can be treated as f(y θ) with the prior f(θ) = f(θ ψ)f(ψ)dψ. (4.4) Ψ Alternatively, the model can be f(y ψ) = f(y θ)f(θ ψ)dθ (4.5) Θ with the prior f(ψ). A consequence is that neither the model likelihood nor the complexity can be uniquely decided (Spiegelhalter et al., 2002). This creates a challenge in using information criteria such as BIC that requires the user to specify complexity. In addition, different model complexities may result in different effective sample sizes. For example, the number of independent observations could be different in f(y ψ) compared to f(y θ). No matter how we treat the internal hierarchical structures, the marginal likelihood of y, i.e., the prior predictive likelihood, should not be changed at all, since the joint likelihood (4.3) is unchanged. However, not all decompositions between the model likelihood and prior are equally good for asymptotic approximation by BIC. Since the order of the approximation error is known in Lemma 1 in Schwarz (1978), we can obtain a clear idea which decomposition between model likelihood and prior is good. The rationale in the BIC approximation is the following: in general, for exponential family distributions and exchangeable data the log predictive likelihood f(y) has the

80 form S(y, n, j) = log 69 α j exp((θ y b(θ))n)dµ j (θ), (4.6) where α j is the prior for the jth model being true, µ j is the prior of θ given the jth model, n is the sample size, and y is the sample mean. Schwarz (1978) established the approximation of the likelihood (4.6) by BIC. He showed that for fixed parameter dimensions p, the difference between S(y, n, j) and the corresponding BIC is dominated by n. In fact, the remainder has the same order as p. Therefore, BIC works well as long as n >> p and under other assumptions embedded in Eqn (4.6), one of which requires a fixed dimension p. If we have several ways to separate the model likelihood from the prior for a hierarchical model, which will yield different model likelihoods and model complexities, we should be careful that BIC s assumptions are satisfied. The real sample size is perhaps changed. This is somewhat surprising because usually the complexity p but not the sample size n is thought to have ambiguity. The justification is in the details of the proof of the first lemma in Schwarz (1978). The complexity p comes from the explicit evaluation of the integration w.r.t. dµ j (θ), which uses the result that the pdf of a multivariate normal distribution integrates to 1. p is unambiguously defined as the dimension of the support of θ. Hence, given a hierarchy and given all assumptions hold, p equals the dimension of the parameters involved in the model likelihood and has nothing to do with the hyperparameters in the prior.

81 70 Then what is the real sample size n? It arises from Eqn (4.6) where the model likelihood can be written as n times the model likelihood at the average. Independent observations are necessary and n refers to the number of independent observations. We use an example to illustrate. Let y i i.i.d. N(µ, σ 2 ), i = 1... n (4.7) µ N(θ, τ 2 ), (4.8) θ N(ψ, ν 2 ). (4.9) Usually we will take Eqn (4.7) as the model likelihood. So the sample size is n. Suppose σ 2 is known. Then the number of parameter involved in the likelihood is just p = 1. The second hierarchy is Y N(θ1, Σ), (4.10) σ 2 + τ 2 τ 2 τ 2... τ 2 τ 2 σ 2 + τ 2 τ 2... τ 2 Σ =. (4.11). τ τ 2 σ 2 + τ 2 The likelihood of the observation vector Y can still be written as the form in Eqn (4.6). But observations are dependent through τ 2. It cannot be expected that the BIC will approximate well in this case.

82 71 When a hierarchical model is specified, it is not always easy to derive the correct likelihood for using BIC, because the integration of Eqn (4.5) is usually hard to do. In large scale inference, models such as those discussed in the previous chapter have a huge number of parameters and relatively few replicates per parameter. If likelihoods are calculated based on the initial definition, such as in Eqn (3.5), (3.14) and (3.25), we will suffer the problem that the remainder term of BIC approximation is not ignorable. An idea related to BIC is the Bayes factor. Let M and M denote two competing 1 2 models. Then the Bayes factor is defined as B(M, M ) = f(y M 1 ) f(θ1 1 2 f(y M ) = M )f(y θ, M )dθ (4.12) f(θ2 M )f(y θ, M )dθ Let p(m 1 ) and p(m 2 ) denote the prior preference on those models. Then the posterior ratio of two models are p(m y) 1 p(m y) = p(m 1 ) p(m ) B(M 1, M ). (4.13) It can be seen from Schwarz (1978) that the ratio in Eqn (4.12) has an asymptotic equivalent form BIC(M )/BIC(M ), with the prior submerged by large sample size. 1 2 Another important critique against BIC and the Bayes factor is that they are only appropriate whenever the true model is among the finite candidates. This concern perhaps arises because the studies that showed the consistency of BIC and Bayes factors have made such an assumption, for example, Haughton (1988). A recent development in nonparametric Bayesian statistics has established a result mitigating this point. Under

83 72 weak conditions, the Bayes factor converges to the difference in the Kullback-Leibler (δ) properties almost surely (Walker, 2004). The Kullback-Leibler (δ) property is defined as the minimal Kullback-Leibler distance between the true model and the family of densities specified a priori. In other words, we do not need the true model to be contained in the prior. The Bayes factor will consistently pick up the model minimizing δ. This result is established initially for infinite dimensional distributions. It can also be applied to problems with finite and fixed dimensions. Combining the results above, we can conclude that in theory BIC and Bayes factors are usable for a slightly wider range of problems than some critics assumed. However, the remainder term of BIC approximation is not always small, especially when the dimension of the parameter space is high Deviance information criterion (DIC) An alternative way to assess prediction accuracy is to utilize the posterior predictive distribution in Eqn (4.2). The weighted mean squared error T (y, θ) = n 1 (yi E(y θ)) 2 /var(y θ) measures the model fit. Even if a model does not fit the data, the i i extent of closeness can still be numerically evaluated. Hence a comparison among all false models is possible. It is shown that the deviance in Eqn (4.14) is proportional to T (y, θ) when sample size goes to infinity (Gelman et al., 2004). D(y, θ) = 2 log f(y θ). (4.14)

84 73 Spiegelhalter et al. (2002) suggested using the posterior distribution of θ to evaluate the deviance. First if we plug in a posterior estimator ˆθ(y), e.g., the posterior mean or median, we can estimate the deviance at ˆθ as Dˆθ(y) = 2 log f(y ˆθ(y)). (4.15) Next if we have a posterior sample with size L drawn from f(θ y), we can estimate the mean deviance as ˆD (y) = 1 L 2 log f(y θ ) (4.16) avg L l l=1 Based on the preceding two versions of Bayesian deviance, Spiegelhalter et al. (2002) defined the effective number of parameters as p D = ˆD avg (y) Dˆθ(y), (4.17) and deviance information criterion (DIC) as DIC = 2 ˆD avg (y) Dˆθ(y). (4.18) A more accurate model in prediction sense should have a smaller DIC. An equivalent form of DIC is DIC = Dˆθ(y) + 2p D. (4.19)

85 74 As the number of parameters increase, the deviance Dˆθ(y) will decrease, and the number of effective parameters p will compensate for the use of more parameters. D The deviance and DIC are comprehensively summarized in Spiegelhalter et al. (2002). DIC can be seen as a Bayesian version of AIC. The aim is to evaluate the posterior predictive likelihood. So it is expected closely related to the idea of cross validation, namely using the observed data to predict future observations. The DIC suffers the same problem as AIC: with fixed parameter space as sample size n neither will consistently choose the true model (Spiegelhalter et al., 2002). Combined with the Bayesian asymptotic result in Walker (2004), it is very possible that DIC will not choose the best model minimizing the Kullback-Leibler distance (δ) either, since p is usually not close to the penalty term in BIC that involves the log sample D size. However, the advantage of DIC lies in the following problems: 1) general Bayesian models that do not come from an exponential family; 2) problems with a varying parameter space (usually with increasing dimensions). For nonexpoential family distributions, since there is no simple form like the BIC for the prior predictive likelihood, the Bayes factor is usually hard to compute in practice, especially when the priors are weak so that Monte Carlo simulations are also difficult. For the problems with increasing parameter space, there is no established result for either BIC or Bayes factors. On the other hand, the DIC can at least provide a result with some justification from the deviance point of view. Since the DIC penalizes the likelihood only with the estimated number of parameters, intuitively it will be biased to larger models than BIC.

86 75 In practical computations, the DIC is not hard to calculate based on the posterior samples simulated by MCMC, and there is no ambiguity in counting parameters or the true sample size. So it is convenient to apply. It is currently the default model fitting diagnostic criterion in WinBugs. 4.3 Using information criteria Information criteria are not meant for totally blind model searching or comparison. It is necessary to combine other measures such as diagnostic graphs, the purpose of the study, etc. to have a comprehensive assessment of the candidate models. We think that the following preliminary steps are necessary to choose candidate models for microarray differential expression studies: plotting raw residuals Y ijk Y ij versus an index combining i, j to find any violation on equal variance assumption; applying pre-processing steps to stabilize variance if necessary; plotting roughly significant mean differences between conditions to have an intuition about the choice of K, the number of components in the alternative hypotheses; (e.g. multiple modes or cluster pattern, asymmetry, etc.); plotting raw estimated absolute expression levels (e.g. Y i ) versus an index combining i, j to see if centering within each gene is necessary for using the parsimonious model.

87 76 Chapter 5 Extension to comparisons among multiple conditions This chapter discusses some tentative ideas in extending the 2-condition modeling work to more than two conditions. An exact extension of the mixture model is not easy to do when the number of conditions m is large. While exact methods are difficult, we study some approximate methods and try to determine under what conditions the approximation could have satisfactory performance. 5.1 Direct modeling of multiple conditions by mixture models A microarray study can consist of more than two biological conditions, i.e., more than one pairwise comparison. Let m denote the number of biological conditions throughout this chapter. There could be a single factor involved with more than two levels, e.g., different cancer types versus healthy tissue. Alternatively there could be more than one factor. We do not distinguish the two cases. A single index with m levels distinguishes the experimental conditions. Each pair of two conditions can form a batch of comparisons of all of the genes. In total there are at most ( m 2 ) such pairwise batches. We use superscripts to distinguish the batches, i.e., pairs, where the two superscripts are exchangeable. For example, H (13) 1 and H (31) 1 are the same null hypothesis about the first gene, i.e., the first gene does not express differentially between conditions 1 and 3.

88 We denote the batch of comparisons between conditions i and j by the notation (i j) or (j i). In terms of simultaneous tests, the FDR or BFDR both within one batch and across all batches are of interest. However, generally speaking the simultaneity is more emphasized within one batch, i.e., comparisons between two conditions. After controlling of FDR or BFDR within each batch, we are also interested in the FDR or BFDR of the whole experiment, where there are up to ( m 2 ) pairwise batches of comparisons. In some special cases, the model for a single batch can be readily extended to accomodate multiple conditions. The first such special case is multiple comparisons with control (MCC). Let the first condition be the control one, and all the other conditions be treatments. We are interested only in the comparisons between treatment conditions and the control condition. A direct modeling approach is to apply the treatment/control model in Section 3.3 to each (1 i), 1 < i m. If we still want all conditions to share the same error term σ 2, we only need to pool all the data y (1), ij y(2),..., ij y(m) to estimate ij the model error σ 2. For simplicity, one can plug in the empirical estimator MSE. The fully conditional distribution of σ 2 is also easy to derive 77 m σ 2 IΓ (n k 1 i +a, i j (Y (1) ij µ i ) k i j (Y (k) ij where the superscript k denotes the parameters used in treatment k. µ i ζ (k) i α (k) ) 2 ) +a, i 2 (5.1) The Gibbs sampler can be extended in two major steps. Firstly pool all the data and parameter and sample σ 2, next sample parameters in each pair separately given

89 78 σ 2. The total computation time will be roughly increased by a factor of m 1, i.e., the number of 2-condition models. Another special case is called disconnected multiple comparisons (DMP). If we consider conditions as points and a pairwise comparison as a line segment, disconnected multiple comparisons (DMP) consist of disconnected individual line segments. A control Fig Example of disconnected multiple comparisons (DMP) condition is not necessary for DMP. Hence the exchangeable models discussed in Section 3.4 are preferred. Since no condition is involved in more than one pair, there is essentially no additional constraint on each pair. We do not have to assume equal variance among batches. In the special cases above, we simply apply a 2-condition model in Chapter 3 for each pairwise batch. For each pair, we can apply the decision rules in Chapter 2 and control the BFDR within each pairwise comparison. Interestingly, if we pool the decision of all comparisons, the global BFDR is still under control. We discuss this result from a more general point of view in the next section. When m = 3, it is also possible to directly model all pairwise batches of comparisons by a mixture model. Let µ always denote 1 j 2 E(Y 1j + Y 2j ), let α always denote j

90 1 2 E((Y Y ), and let β always denote 1 2j 1j j 2 E((Y 3j Y ). There are in total 5 possible 1j patterns, each of which is represented by a mixture component. Y 1j µ j µ α j j µ α j j Y 2j p N 1 µ j, σ 2 I + p N 2 µ + α j j, σ 2 I + p N 3 µ + α j j, σ 2 I + Y µ µ α µ + α 3j j j j j µ j µ α j j p N 4 µ j, σ 2 I + p N 5 µ + α j j, σ 2 I, µ + 2β µ α + 2β j j j j j 79 (5.2) where p is the mixing proportion. Similarly we can introduce hidden indicators ζ. i ij One advantage of the mixture model (5.2) is that it has already excluded incompatible conclusions among pairwise comparisons. The posterior probability of a null hypothesis is hence the expectation of the two corresponding indicators. For example P (H (12) Y) = j E(ζ + ζ Y). If we use partially conjugate prior densities, MCMC simulations can 1j 4j be applied easily. Choices of hyperprior parameters should follow the same criteria as in 2-condition models. However, it can be shown by induction that when m > 3 the number of components in the finite mixture model approach in (5.2) is too large to be feasible in practice. For example when m = 5 there will be 203 mixture components, and the number is beyond 1 million when m = 12.

91 5.2 Approximate approaches: marginal modeling using finite mixture models 80 There are only limited special cases that have a relatively easy direct modeling approach. In a general multiple comparison scenario, there could be many cumbersome constraints among parameters from different pairs, which will make the derivation and computation quite difficult. We propose some approximate approaches to avoid directly modeling all experimental conditions at the same time. Instead, each batch of pairwise comparisons is done separately, and the BFDR within each pair is controlled at some level q (ij). The pooled results from a collection of 2-condition models will still retain the control on BFDR in the sense that E( Vi Ri {Y i } m i=1 ) q, where q = f(q (12), q (13),..., q (m 1,m) ). To avoid confusion we call E( Vi Ri {Y i } m i=1 ) the pooled BFDR. The following property yields the simplest bound for q. Proposition 5.1. Suppose there are m Bayesian models F i (U i, θ i ), i = 1... m, where U i is the observed data and θ i is the parameter, θ i Θ i. There are n i null hypotheses for the ith model s.t. H ij has nonzero prior probability for each 1 i m, 1 j n i. A joint model F (U 1... U m, θ 1,..., θ m ) exists s.t. the marginal models are F i (U i, θ i ), i = 1... m. Let q i be the BFDR controlled by the decision rule D i (U i ) in the ith model. Let r ij = E(H ij U i ) be the posterior probability of H ij by the ith marginal model. Let r = E(H ij ij {U i }m ) be the posterior probability of H by the i=1 ij

92 81 joint model. Assume i) E(V U, D (U )) q R ; ii) r r, i, j; then the pooled i i i i i i ij ij BFDR E( Vi Ri {U i, D i (U i )} m i=1 ) max q i. Proof. We know the pooled BFDR equals bounded above by i j rej.byd i (U i ) r ij i R i Hence the pooled BFDR is less than BFDR less than i j rej.byd i (U i ) r ij i R. By ii) the summand is i. But j rej.byd i (U i ) r ij = E(V i U i, D i (U i )). i E(V i U i,d i ) i R i By i) we further have the pooled i q i R i i R. The result follows by taking the supremum. i Intuitively, the prerequisite for discussing the pooled BFDR is that a joint model for all data and parameters exists and can reduce to the marginal models. Condition i) specifies that the marginal decisions are good enough and condition ii) describes a weak dependent relationship among marginal models. Notice i) in Prop. 5.1 is always true for the nonrandomized decision rules in Chapter 2. Let us consider the following example: three tests are done to determine if the normally distributed samples are centered at zero. With conjugate prior and provided no preference in prior, P (H Y ) = [1 + a exp( bz 2 i 2(c+σ 2 ) )] 1 (Berger, 1985)., where a, b, c are known constants and σ 2 is the variance. Assume that three samples have the same variance. The three sample variances are 1.45, 1.2 and The pooled sample variance is P (H i Y ) is increasing with σ 2. If using the sample variance to estimate σ 2, then we have r > r and r > r, because the joint model will use the pooled variance 3 estimate 1.32, which is smaller than the marginal model s variance estimate 1.45 and The pooled BFDR of the first and the third tests will be smaller than maximum of the two BFDR estimated by marginal models.

93 When the marginal models f(u i, θ i ) are all independent, it is easy to see condition ii) in Prop. 5.1 is satisfied. Moreover under independence r = r. The following ij ij corollary is the direct result. 82 Corollary 5.2 (independent marginal models). Suppose there are m independent Bayesian models F i (U i, Θ i ), i = 1... m. All conditions are the same as in Prop Then the pooled BFDR= E( Vi Ri {U i, D i (U i )} m i=1 ) max q i. Proof. This is just a special case of Obs 5.1. Notice now we have r = r, and the joint ij ij model is unique by independence. The next corollary loosens the independence assumption on marginal models. When all marginal models are conditionally independent such that the r ij s will not increase by the joint model, Obs 5.1 also applies. Corollary 5.3 (Conditonally independent marginal models). Under the same setting in Prop. 5.1 and still assuming i), a sufficient condition for ii) to hold is the following: iii) i, U = (U... U ) = (U, K (U ),..., K (U ), K (U ),..., K (U ), and 1 m i i1 1 i,i 1 i 1 i,i+1 i+1 im m condition on U, H is independent with K (U ),..., K (U ), K (U ),..., K (U ), i ij i1 1 i,i 1 i 1 i,i+1 i+1 im m where K (U ) = U \U. ij j j i Proof. r ij = P (H ij U) = P (H ij U i, K i1 (U 1 ),..., K i,i 1 (U i 1 ), K i+1 (U i+1 ),..., K m (U m )) = E(I(H ij ) U i ) by iv) = r ij i, j. by definition.

94 Heuristically, the idea is that r ij depends only on U i, the data of the ith marginal model not the additional data provided by other marginal models. Hence we still control Vi the pooled BFDR = E( Ri {U, D (U )} m ) max q by Prop i i i i=1 i results. The following simple corollary gives additional help in approximating the pooled Corollary 5.4 (classical pooled FDR). The pooled classical FDR= E( Vi Ri ) q i 83 Proof. Vi E( ) = V E( i ) q. Ri Ri i In a general scenario such as all pairwise comparisons, one condition is involved in more than one comparison, which induces dependence among those marginal models involved. For example, when m = 3 the all pairwise comparisons (1-2, 1-3 and 2-3) are dependent, because any two of them share a same condition. The joint model for the 3 conditions has assumed that two or more of the conditions could share the same mean. Unless all null hypotheses are false, observations from different conditions and the same gene are correlated through the possibly common parameters. What if the strict relationship r r does not hold somewhere? If a joint model i i exists for all the marginal models F i (U i, θ i ), i = 1... m, the best estimator is definitely ˆV = i j rej.byd (U ) r. However, the estimator Ṽ = i i ij i j rej.byd (U ) r ij is i i also legitimate but perhaps not the best one, because additional information for H (ij) from data other than U i are not utilized by the estimate of r ij. The difficulty in handling multiple conditions is largely due to the mixture model, because it is difficult to build a solvable finite mixture model for multiple conditions.

95 84 Generally, detecting significant genes uses a two-sided test with a precise null hypothesis. The probability of a point set is always zero if the probability distribution is dominated by Lebesgue measure. The mixture structure is introduced to assign a point mass to the null hypothesis. This approach works well when there are only a few points that need point mass. For example when m = 3 the model in (5.2) is still usable. For a general m, the finite mixture model cannot be used due to the complexity. Instead we could use the approximation based on Corollary 5.3 and 5.4. A simulation study in Chapter 6 shows an example of the approximation. 5.3 Testing an interval type null hypothesis by Bayesian linear models In general, a Bayesian model has no trouble in testing a composite hypothesis versus another composite hypothesis if the prior is supported on both hypotheses with nonzero probability. If a two-sided hypothesis test can be treated as a composite versus composite problem, then there is no trouble evaluating r = E(H Y ). Let us begin with i i the simplest 1-sample problem Y N(µ, σ 2 ) with σ 2 known. Let f(µ) be the prior. Let the null hypothesis be represented by an interval ( b, b). Then the prior probability P (H 0 ) is b P (H ) = f(µ)dµ. (5.3) 0 b Berger (1985) argued that it is natural to assign P (H ) = 0.5 as the prior when a 0 mixture structure is used. We can take the same idea: P (H ) is pre-chosen and reflects 0 the preference towards the null hypothesis. To mimic the thought of frequentists, we could allow P (H ) be larger than 0.5. Further suppose f(µ) = f (µ) where τ is a scale 0 τ

96 85 parameter of the prior. Let f τ (µ) be N(0, τ 2 ). Then P (H ) = 2Φ( b ) 1, (5.4) 0 τ where Φ is the cdf of the standard normal. Eqn (5.4) says that we only need to provide the parameter τ in the prior, or b the half length of H. Most researchers using microarrays 0 agree that a 2-fold change in expression level is sound evidence for differential expression. This information is very useful in choosing b. In general one can allow b (1.4, 2.8) or (0.5, 1.5) in the log scale. A smaller b means less conservative detection. 2 It is then straightforward to adopt the normal linear model likelihood Y ijk(i) N(µ ij, σ 2 ), (5.5) where i = 1... m denotes conditions, j = 1... n denotes genes, k(i) = 1... K(i) denotes replicates. The null hypothesis H (ij) is defined on the support of µ = (µ,..., µ ). j 1j mj We use the normal prior independent with the error variance σ 2, where σ 2 still has a inverse gamma prior, µ ind. N(0, Σ), (5.6) i where Σ = {Σ } is a m by m covariance matrix. pq Instead of using P (H ) = 0.5 a more appealing approach is to use the proportion 0 of true null hypotheses as the prior P (H ). A convenient setting for many microarray 0 studies is.9. An empirical estimate could be used too. For example, the 2-condition prototype model can be fitted and the estimated mixing proportion ˆλ gives an empirical

97 estimate. Let π (pq) denote the prior probability of the null hypothesis H (pq) between condition p and q. Then we have 86 P (µ p µ q ( b, b)) = π (pq), (5.7) or equivalently b 2Φ( ) 1 = π (pq). (5.8) Σpp + Σ 2Σ qq pq The diagonal element Σ ii can be empirically estimated by var(y ij ). When b=1, the covariance parameter Σ pq is the root of 1 2Φ( ) 1 = π (pq). (5.9) var(y + var(y 2Σ )pj qj ) pq If we estimate each Σ ii by var(y i ), then Eqn (5.9) is further simplified to 1 2Φ( ) 1 = π (pq). (5.10) 2var(Y 2Σ i ) pq In both (5.9) and (5.10) Σ is monotonically increasing with π (pq). This is reasonable. pq When a null hypothesis is less likely, the means are less likely to be equal. Hence the correlation is close to 0. On the other hand, when a null hypothesis is very possible, the means are about the same. Therefore the correlation should be close to 1.

98 densities are The model in Eqn (5.5) can also be fitted using MCMC. The fully conditional 87 µ j N(Y j (σ 2 I + Σ 1 ) 1, (σ 2 I + Σ 1 ) 1 ) σ 2 IΓ(a + n m k, a i 2 2 (Y j µ j ) (Y µ )), j j i=1 (5.11) where vectors Y and µ are respectively the observation and mean of the jth gene across j j m conditions. Extensions similar to those in Chapter 3, such as unequal variance and covariance, can also be included. An even faster approximation is to use MSE for a quick estimate of σ 2. Then the posterior distributions of µ are all available without using MCMC. In that case, j f(µ Y ) is a normal distribution. Then j E(H (pq) b E(µ µ Y ) b E(µ µ Y ) p q Y ) = Φ( s.e.(µ µ Y ) ) Φ( p q ). (5.12) s.e.(µ µ Y ) p q p q After calculating Eqn (5.12) for each interesting hypothesis we can then apply the decision rules in Chapter 2 to control the BFDR. In summary, the procedures in this section provide quick and crude methods in multiple condition comparisons. The only calculations are tuning the covariance in prior and calculating the cdf of some normal distributions. Not surprisingly, the procedures shows less detection capability in simulations. However, this crude result is helpful. Firstly it can help set up starting points for more delicate analysis. Secondly it is

99 also useful for comparing the later results: very different results could suggest possible mistakes in a more sophisticated algorithm and model (Gelman et al., 2004). 88

100 89 Chapter 6 Numerical examples In this chapter we summarize several simulation studies using the models and decision rules discussed previously. We also demonstrate the analysis of the spike-in human genome U133 dataset that is a benchmark provided by Affymetrix as a standard dataset for methodology study. 6.1 Simulation study on 2-condition comparison We assessed the Bayesian model and several decision rules on simulated data. We used the typical setting with 2 conditions where 10% of the genes are differentially expressed. The model error is σ 2 = 1 and the number of arrays is k = k = 5. The total 1 2 number of genes is n = 25, 000. We tried three scenarios: 1) 5% genes have +1.5 and another 5% genes have -1.5 difference in mean expression; 2) 5% genes have +1.5 and another 5% genes have -1.2 difference in mean expression; 3) all 10% genes have +1.5 difference in mean expression. The last scenario is exactly the same setting as Ishwaran and Rao (2003). The methods to compare are the four nonrandomized decision rules, the mode adjustment, the step-up method and the Bonferroni adjustment. We also applied a FDRmix procedure, which uses Bayesian p-values as the input for the step-up method. Bayesian p-values are defined similarly to classical p-values as the tail area probability under the null hypothesis, which corresponds to the posterior distribution

101 90 f(z Y, ζ = 1) Figures 2.1 and 2.2 are taken from the simulation of the third scenario. 2 0 All methods are compared at the same nominal level q =.05. Bonferroni adjustments use the nominal significance level.05. We adopted equal constant losses so as to minimize the total mistakes for Rule 2.4. We compare the Bayes models with K = 1 and K = 2. When K = 1, the hyperprior has t = 0, and when K = 2 we set t =.5 and t = to avoid label switching. α = (9, 1) for K = 1 and (9,.5,.5) for K = 2. The remaining 0 priors are the same for the two models: a = a =.1, b = b =.1, τ 2 = 20, 1 2 1k 2k 3 τ2 = It is also observed that the posterior inference is little affected within a large range near the chosen priors. We used the deviance information criterion (DIC) (Spiegelhalter et al., 2002; Gelman et al., 2004) to choose the better model in each scenario. Table 6.1 lists the DIC and p in the simulation scenarios. The complicated model (K = 2) beats the simpler d model (K = 1) in the first two scenarios. In the third scenario where the differential expressions are completely skewed, the simpler model K = 1 has a smaller DIC. The estimated numbers of effective parameters p, i.e., model complexity, is always larger in d the complicated model K = 2 in all of the scenarios. In the third scenario, the impression from the raw differential expression Z in 2i Figure 6.1 is that the distribution of differential expressions is highly skewed to the right. Having an additional component by setting K = 2 may catch some negatively regulated genes (though there is no such gene in this scenario). However, there is no real difference in decisions made by the two models, because the additional component is almost empty in the whole posterior simulation after the burn-in period of 2,000 iterations. This also suggests that K = 2 may have redundant parameters with little information in data

102 91 Table 6.1. DIC comparisons in simulation scenarios between the Bayesian models K = 1 and K = 2. Scenario 1 has 5% +1.5 and 5% -1.5 difference in mean expressions. Scenario 2 has 5% +1.5 and 5% -1.2 difference in mean expressions. Scenario 3 has 10% +1.5 difference in mean expressions. Scenario K DIC p d ˆDavg (y) D(ˆθ(y)) for estimation. For example, in the last 50 rounds of MCMC, i.e., 1, 250, 000 sample points for indicators corresponding to the additional component, there is not a single occurrence of 1. Step-up 1 used p-values from the two-way ANOVA and the pooled variance estimator MSE, and step-up 2 used individual two sample t-tests with pooled variance within each test. Bonf 1 and Bonf 2 are corresponding Bonferroni adjustments. The FDP column is the observed false discovery percentage, the BFDR column is the estimated Bayes FDR. We also provide the maximal local fdr max(r ) where the concept applies. Notice i that the empirical bound for the adaptive BFDR control rule is deliberately chosen as.1 to show the difference from the optimal BFDR control rule.

103 92 Table 6.2. Method comparisons on simulated data with 2,500 differential expressed in n = 25, 000 genes. The differential expression size is +1.5 for 1,250 genes and -1.5 for another 1250 genes. Method Detected False False non Total BFDR FDP BFNR FNR max(r ) i detected detected Mistakes Local fdr BFDR Adaptive Risk Mode FDRmix step-up step-up Bonf Bonf Table 6.3. Method comparisons on simulated data with 2,500 differential expressed in n = 25, 000 genes. The differential expression size is +1.5 for 1,250 genes and -1.2 for another 1250 genes. Method Detected False False non Total BFDR FDP BFNR FNR max(r ) i detected detected Mistakes Local fdr BFDR Adaptive Risk Mode FDRmix step-up step-up Bonf Bonf

104 93 Table 6.4. Method comparisons on simulated data with 2,500 differential expressed in n = 25, 000 genes. The differential expression size is Method* Detected False False non Total BFDR FDP BFNR FNR max(r ) i detected detected Mistakes Local fdr BFDR Adaptive Risk Mode FDRmix step-up step-up Bonf Bonf From the simulation result it can be seen that the decision rules 2.1 (control local fdr), 2.2(optimal BFDR control), 2.3 (adaptive BFDR control) and mode adjustment that aim at controlling the BFDR have the observed FDP at about the nominal level q. Rule 2.1 is quite conservative and all the others produce more detections. Rule 2.4 that is meant for minimizing the total mistakes has the fewest mistakes among all competitors in each scenario. Again, since the q level has a rather vague meaning, Rule 2.4 actually suggests a good nominal level. For example, in the third scenario a good choice of the nominal level is around.2. Notice in Ishwaran and Rao (2003) the claimed best method has an observed FDP.189, which agrees with our finding here. Rule 2, 3 and mode adjustment have a good agreement among the actual BFDR, the nominal level and the observed FDP. We can conclude that those rules in conjunction with the prototype model have attained the nominal level well.

105 94 Fig The raw differential expressions for the simulated data with the true difference size 1.5 for the first 10% genes. The Bayesian decision rules always give no worse results than the step-up methods, partially because the step-up method is derived on a model free basis and is known to be conservative. In the extreme case where the differential expressions are totally skewed, under the same nominal level q =.05, the optimal BFDR rule rejects 287 more hypotheses and the mode adjustment rejects 294 more hypotheses than the stepup method based on p-values from 2-way ANOVA. With the asymmetric differential expressions, the extent of conservativeness of the step-up method is comparable to rule 1, i.e., the local fdr control rule, if p-values are from the similar linear model. When the differential expressions are perfectly symmetric, the step-up method has approximately the same performance as the optimal BFDR rule. It is also noticed that a less accurate model, i.e., individual two-sample t-tests, results in a big loss in detection capability.

106 95 Fig Performance of the optimal BFDR rule in the third simulation scenario. (a)estimated total mistakes versus rejections; (b) the solid line is the observed FDP and the dashed line is the estimated BFDR versus rejections; (c) the solid line is the observed FNP and the dashed line is the estimated BFNR versus rejections; (d) the observed total mistakes are the solid line, and the estimated mistakes are the dashed line.

107 Affymetrix latin square spike-in HG-U133 data The analysis This data set is downloadable at data/datasets.affx for methodology research. It consists of 3 technical replicates of 14 separate samples, where 42 artificial spiked transcripts differ between each pair of hybridizations. The spike-in concentrations range from 0.125pM to 512pM, where 30 spikes are isolated from a human cell line, 4 spikes are bacterial controls, and 8 spikes are artificially engineered sequences believed to differ from any sequence in the human genome. There are many studies on preprocessing and normalization using the spike-in data set. See Bolstad et al. (2003) and Cope et al. (2003) for a comprehensive comparison of existing methods. The latter paper refers to an up-to-date website maintained by Department of Biostatistics, Johns Hopkins University. We applied the robust multiarray average (RMA) normalization (Irizarry et al., 2003) on the raw data, and fitted the normalized data using the Bayesian model (3.14). The advantage of using the spike-in data instead of other experimental data is that the true differentially expressed genes are known in advance. Hence the observed FDP of a testing procedure is also available. The spike-in data are superior to simulated data since they have a similar error distribution to real microarrays. However, besides the 42 spiked in probe sets, some nonspike genes that are not supposed to differentially express also have very obvious differential expression. Refer to McGee and Chen (2006) for more detailed discussion, where another 22 nonspike probe sets are identified as truly

108 97 differentially expressed. The list of the extra 22 genes can be found in Appendix C in McGee and Chen (2006). In evaluating the observed FDP, we adopt the result in McGee and Chen (2006) and treat all 64 probe sets as true differentially expressed genes. The comparison we consider is between the first two experimental conditions, which correspond to concentrations 0pm and 0.125pm for concentration group 1 (consisting of probe sets at, at, and s at). Since the asymmetric differential expression is so obvious as shown in Figure 6.3, we used a K = 2 mixture to account for the heavy tail on the negative side. One component is assigned an informative prior to catch those weak differential genes with differential expressions approximately within ( 2, 2), and the other component has a flat prior to account for the heavy tail on the negative side. The highly asymmetric and heavy tailed data shown in Figure 6.3 is rather rare in real studies. However, for the sake of methodological study, these data sets still serve as a good benchmark to verify statistical models and multiple test procedures. The differential expression is highly skewed. Almost every method can identify the extremely large differences. The differences among methods lie in the detection of weak differential expression. Using the nominal level.15, Rule 2.1 (local fdr) detects 47 genes with 1 false detection and an observed FDP=.021. Rule 2.2 (optimal BFDR) detects 57 genes with 6 false detections and an observed FDP=.105. With an adaptive bound.9, Rule 2.3 (adaptive control) detects 50 genes with 2 false discoveries and an observed FDP.04. The spike-in data has a very few true differences (λ =.997) and very 0 few discoveries. Let the constant loss be 10L = L to incline to more detections, Rule (total risk) suggested an optimal nominal level around.13. The mode adjustment has a similar result with the optimal BFDR rule. Using the popular analysis package

109 98 limma that applies empirical Bayesian method in Smyth (2004) on variance adjustment and step-up method, we found the same 47 significant genes as the local fdr rule. The optimal BFDR rule provided an about 10% improvement in true detections and 20% in the whole detections Model Diagnosis The likelihood of the Bayes model uses the same assumption as a normal response linear model, namely i.i.d. normal distribution for the error term. The residuals are used to estimate the errors for each gene. Let (ˆµ, ˆα, ˆζ) be the Bayesian estimators of the parameters. Then the residual is defined as r = Y Ê(Y ) = Y ˆµ ˆζ ˆα. This version of residual is more similar to the classical residual than a true Bayesian residual, where the latter considers the variance in parameters as well as the variance in the likelihood. However, since the discrete parameter ζ introduces a large variance in the posterior distribution, the true Bayesian residual does not have a satisfying pattern even when the data are generated from the model. As well, because of the discrete parameter ζ that induces a multi-modal distribution for α, we prefer the posterior mode instead of the posterior mean for the estimator ˆα. We use the diagnostic graphs for the residuals to check the important assumptions on the error terms in the spike-in data analysis. Figure 6.4 shows the hexbin plots of the actual residuals against genes and predicted expression levels (in log scale), where the gray scale of hexagons indicates the intensity of residuals. The gray scale graph allows more details than the ordinary residual plots in the diagnosis of large scale models. Hexbin plots are created in R using the Hexbin package (version 1.6.0). The residuals show a slightly longer tail than the normal distribution. This could

110 99 be explained by the additional noise introduced by estimating µ and α, whose posterior distributions are not necessarily nearly normal given the small sample size k = k = There is also a small bump in the residuals when the expression levels are between 1.5 and 2. The slightly wider spread in this range could be accounted for by the agglomeration of genes, i.e., most of the genes have the expression levels in this range.

111 100 Fig Plot observed differential expressions Z against gene ID. Small dots are null 2 genes (true non discovery). : true discovery, +: false discovery, : false non discovery. Detection used the optimal rule 2 at q = The bottom graph removes all the null genes.

112 101 Fig Residual plots for the Latin Square spike-in experiment. (a) Plot residuals against predicted mean expressions; (b) plot residuals against gene IDs.

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE

A GENERAL DECISION THEORETIC FORMULATION OF PROCEDURES CONTROLLING FDR AND FNR FROM A BAYESIAN PERSPECTIVE Sanat K. Sarkar 1, Tianhui Zhou and Debashis Ghosh Temple University, Wyeth Pharmaceuticals and