Adaptive testing of conditional association through Bayesian recursive mixture modeling

Size: px

Start display at page:

Download "Adaptive testing of conditional association through Bayesian recursive mixture modeling"

Emery Kelley
5 years ago
Views:

1 Adaptive testing of conditional association through Bayesian recursive mixture modeling Li Ma February 12, 2013 Abstract In many case-control studies, a central goal is to test for association or dependence between the predictors and the response. Relevant covariates must be conditioned on to avoid false positives and loss in power. Conditioning on covariates is easy in parametric frameworks such as the logistic regression by incorporating the covariates into the model as additional variables. In contrast, nonparametric methods such as the Cochran-Mantel-Haenszel test accomplish conditioning by dividing the data into strata, one for each possible covariate value. In modern applications, this often gives rise to numerous strata, most of which are sparse due to the multi-dimensionality of the covariate and/or predictor space, while in reality, the covariate space often consists of just a small number of subsets with differential response-predictor dependence. We introduce a Bayesian approach to inferring from the data such an effective stratification and testing for association accordingly. The core of our framework is a recursive mixture model on the retrospective distribution of the predictors, whose mixing distribution is a prior on the partitions on the covariate space. Inference under the model can proceed efficiently in closed form through a sequence of recursions, striking a balance between model flexibility and computational tractability. A power study shows that our method substantially outperforms classical tests under various scenarios. 1

2 1 Introduction The case-control or retrospective sampling scheme is a popular design for studying the dependence relationship between a set of predictors and a binary response variable. It is more cost efficient compared to prospective (or cohort) studies when one of the response category is much less common than the other [1]. In the past 15 years, this design is widely adopted in areas such as epidemiology, most notably in the so-called genetic association studies [9], which aim at finding the genetic factors that affect disease risks. In these studies, a central focus is to test for the association (or dependence) between the predictors (the gene markers) and the response (the disease). The classical approach to analyzing case-control studies centers around inference on the odds ratio [2]. In particular, the most popular class of methods of this sort is the logistic regression, which imposes linear modeling assumptions over the odds-ratio on the log transformed scale. In many applications such as genetic association studies, there are no particular reasons to believe that genetic factors add to (or reduce from) the disease risk in a linear, additive fashion under a particular scale such as the logit. Indeed, despite the ever increasing sample sizes involved in such studies, the genetic factors discovered for most common complex disease such as heart diseases and diabetes contribute to only a small fraction of the heritability of these diseases [8]. One potential direction, among others, to unraveling this mystery of missing heritability is to develop more flexible tests for association, in particular those without any parametric modeling assumptions. Several nonparametric methods for testing association have been adopted in applications such as genetic association studies. They are classical tests for independence on contingency tables, including Pearson s χ 2 -test [1, Sec. 3.2] as well as their Bayesian counterparts based on computing the appropriate Bayes factor (BF) [6, 10]. While these classical nonparametric tests afford more flexibility in capturing the different modes of predictor-response association than parametric methods, they have a few important shortcomings that have significantly 2

3 limited their usefulness. One of the most important limitations is the difficulty, in comparison to parametric methods, to incorporate covariates that should be properly conditioned on. In genetic association studies, for example, such covariates may include gender, race, diet, and smoking status, etc. Without proper conditioning, phenomena such as Simpson s paradox may arise, leading to false positives and loss of power in detecting true signals. Classical extensions to these tests, such as the (generalized) Cochran-Mantel-Haenszel () test [1, Sec ] allows conditioning on covariates by dividing the marginal predictorresponse table into conditional ones, or the so-called strata, based on the covariate values. In particular, the method treats observations with each possible covariate value combination as a separate stratum, and computes a global test statistic essentially by adding up statistics summarizing the evidence from each stratum. In many modern applications, the strata can often be very sparse containing few data points due to the multi-dimensionality of the covariates and/or predictors. In addition, the test relies on a homogeneous effect assumption that the dependence structure as measured in odds-ratio is the same (or in practice at least similar) across all the strata in adding up the information across strata. When this assumption holds, the test is moderately robust to data sparsity, but the assumption is often unrealistic in modern applications with multiple covariates. Consequently, classical methods based on such brute-force conditioning often performs very poorly in modern applications such as genetic association studies and are not commonly adopted. In many practical situations, the covariate space can often be divided into just a small number of homogeneous groups with respect to their relation to the predictor-response dependence. For example, in epidemiological studies, even though many covariates may be considered, one often expects only a small number of them to be relevant. In other words, the underlying actual number of strata necessary for appropriate conditioning is small, suggesting that the brute-force approach to conditioning can be extremely wasteful. Following this reasoning, it is desirable to condition on the covariates based on a strat- 3

4 ification of the covariate space that is as parsimonious as possible. But of course how the covariate space should be partitioned to provide sufficient conditioning is typically not known a priori. A key motivation of this work is that such information is contained in the data, and with the appropriate method can be inferred. In particular, we introduce a Bayesian framework for nonparametric testing of association that achieves such adaptive conditioning on the covariates. Inference under our framework proceeds entirely in a principled, probabilistic fashion that properly takes into account all sources of uncertainty, including that involved in inferring the stratification. The framework is based on a recursive mixture formulation for the retrospective distribution of the predictors for which the mixing probabilities are defined by a recursively constructed probability distribution on the collection of partitions over the covariate space. This recursive mixture design allows inference under this framework to be carried out efficiently in closed form through a sequence of recursion, striking a balance between model flexibility and computational tractability. The rest of the paper is organized as follows. In Section 2 we start by briefly reviewing the existing Bayesian methods for testing association without any covariates. Then we introduce our method as an extension to the existing approach in the simple case with a binary covariate. Next we generalize our framework to cover the scenario with multiple predictors and covariates. In Section 3 we illustrate our method through numerical examples and carry out a power study to investigate its performance and compare it to those of two classical methods. In Section 4 we discuss choices of prior specification and investigate the robustness of our method to different choices. Finally we conclude with some discussions in Section 5. While the framework we propose can be applied to both discrete and continuous predictors and covariates, we focus our presentation on the discrete case partly due to our prime motivating example genetic association studies and partly for simplicity. In the following, we use G to denote predictors, E for covariates, and D for the response, standing for genes, environment, and disease respectively as in a genetic association study. 4

5 2 Methods 2.1 Testing association without covariates We start from the simple case in which there is a single discrete predictor G and no covariates. For illustrative purpose, let G be a single trinary predictor, such as a single-nucleotidepolymorphism (SNP) marker commonly encountered in genetic association studies, and so G Ω G = {0, 1, 2}. Also, let θ d = (θ d0,θ d1,θ d2 ) for d = 0 and 1 be the distribution of G given the response D = d, that is P(G D = d). In other words, θ 0 and θ 1 denote the case and control retrospective predictor distributions respectively. Accordingly, we let n dg for d = 0, 1 and g = 0, 1, 2 denote the number of observations with D = d and G = g, and let n d = (n d0,n d1,n d2 ) for d = 0, 1. Under the null hypothesis that no association exists between G and D H 0 : θ 0 = θ 1 and under the alternative that there is association, H 1 : θ 0 θ 1. A simple Bayesian approach for testing H 0 and H 1 is to place Dirichlet priors on these probability vectors [10] and compute the corresponding marginal likelihood under each hypothesis. One can then calculate the Bayes factor (BF) [6] as a ratio of these marginal likelihoods and use it as an instrument for testing. More specifically, under H 0, we let θ 0 = θ 1 Dirichlet(α) where α = (α 0,α 1,α 2 ), the prior pseudo-count parameters of the Dirichlet prior. A common non-informative choice is to let α 0 = α 1 = α 2 = 0.5. It is easy to show that given this prior, the marginal likelihood under H 0 is M 0 = D(n 0 + n 1 + α) D(α) 5

6 where the function D( ), when applied on a vector of length p, is defined as D(t 1,t 2,...,t p ) = p i=1 Γ(t i) / p ) Γ(. i=1 t i Similarly, under the alternative hypothesis H 1, we let θ 0 Dirichlet(α 0 ) and θ 1 Dirichlet(α 1 ) where α d = (α d0,α d1,α d2 ) for d = 0 or 1 are the pseudo-count parameters for the Dirichlet priors. The marginal likelihood under H 1 is then M 1 = D(n 0 + α 0 )D(n 1 + α 1 ). D(α 0 )D(α 1 ) The BF for comparing H 1 against H 0 is then BF = M 1 /M 0. Stephens and Balding [10] called this BF the retrospective BF, or rbf, since it is based on modeling the retrospective distribution of G, P(G D). Besides computing the BF, one may further assign prior probabilities to the two hypotheses (or competing models) H 0 and H 1 : π(h 0 ) = γ and π(h 1 ) = 1 γ. One can think of this as modeling the retrospective distribution P(G D) using a mixture model with two mixing components H 0 and H 1. The marginal likelihood under this mixture model is then M = γm 0 + (1 γ)m 1. By Bayes Theorem, the probability for H 0 given the data, that is, the posterior probability for no association between G and D, is π(h 0 data) = π(h 0)M 0 M = γm 0 γm 0 + (1 γ)m 1 = γ γ + (1 γ) BF. (1) 2.2 Testing conditional association with a single binary covariate Next we consider the case with covariates but start with the simplest such scenario when there is a single binary covariate E Ω E = {0, 1}. This covariate may be an environment 6

7 variable (e.g. smoking or not) that may affect the joint distribution of the predictors and the response. There are two possibilities in the relationship between the covariate E and the distribution of G given D: either the distribution P(G D,E) depends on the value of E or it does not. An appropriate model for the retrospective distribution should take into account both possibilities. To present such a model, we introduce some new notation. Let θ ed = (θ ed0,θ ed1,θ ed2 ) for d = 0, 1 and e = 0, 1 be the probability distribution vector of G given D = d and E = e, that is P(G D = d,e = e). Similarly, let n edg be the number of observations with D = d, E = e and G = g, and let n ed = (n ed0,n ed1,n ed2 ). For any non-empty A Ω E, that is, A = {0}, {1}, or {0, 1}, we define H(A) : θ e0 θ 0 (A) and θ e1 θ 1 (A) for all e A. In other words, H(A) is the hypothesis that A forms a homogeneous stratum the retrospective distribution of G given D does not depend on E given E A, or P(G D,E = e) = P(G D,E A) for all e A. Note that when A = {0} or {1}, it has only one element, and so H(A) holds trivially. The hypothesis H(A) can be further divided into two sub-hypotheses depending on whether P(G D,E A) depends on D: H 0 (A) : H 1 (A) : H(A) is true, and θ 0 (A) = θ 1 (A) H(A) is true, but θ 0 (A) θ 1 (A). In words, H 0 (A) represents the scenario that A forms a single stratum, and given E A, no association exists between G and D. In contrast, H 1 (A) represents the case that the retrospective distribution differs across the two response groups and so there is association between G and D conditional on E A. To compare H 0 (A) and H 1 (A) in a Bayesian framework, we can again place Dirichlet 7

8 priors on the retrospective distribution of G given D for E A. Under H 0 (A), we let θ 0 (A) = θ 1 (A) θ(a) Dirichlet(α(A)) where α(a) = (α 0 (A),α 1 (A),α 2 (A)) are the pseudo-count parameters. Under H 1 (A), we let θ 0 (A) Dirichlet(α 0 (A)) and θ 1 (A) Dirichlet(α 1 (A)) where α d (A) = (α d0 (A),α d1 (A),α d2 (A)) for d = 0, 1 are the pseudo-count parameters. With this setup, the marginal likelihoods from the data with E A under H 0 (A) and H 1 (A) are M 0 (A) = D(n 0(A) + n 1 (A) + α(a)) D(α(A)) and M 1 (A) = D(n 0(A) + α 0 (A))D(n 1 (A) + α 1 (A)), D(α 0 (A))D(α 1 (A)) (2) where n d (A) = (n d0 (A),n d1 (A),n d2 (A)) is the sample size vector with n dg (A) := e A n edg = # of observations with D = d, G = g and E A. The corresponding BF for the two components H 0 (A) versus H 1 (A) is BF(A) = M 1 (A)/M 0 (A), which measures the evidence against no association given E A under H(A). Similar to before, we can place prior probabilities on H 0 (A) and H 1 (A) under H(A): π(h 0 (A) H(A)) = γ(a) = 1 π(h 1 (A) H(A)), that is, to model the retrospective distribution of G given D and E A under H(A) using a mixture with components H 0 (A) and H 1 (A). The marginal probability under H(A) is then M(A) = γ(a)m 0 (A) + (1 γ(a))m 1 (A). (3) By Bayes Theorem, we get the probability for H 0 (A) given the data γ post (A) := π ( H 0 (A) H(A), data ) = γ(a)m 0(A) M(A) = γ(a) (1 γ(a)) BF(A). (4) 8

9 Figure 1: The two-layer mixture model for P(G D,E) for Ω E = {0, 1}. Red indicates the conditional mixing probabilities induced by the latent decisions. This is completely analogous to what we have seen in Eq. (1) for the case without covariates. Either γ post (A) or BF(A) can be used to test H 0 (A) versus H 1 (A). In particular, when A = {0, 1}, this allows us to test H 0 ({0, 1}) (no association between G and D) against H 1 ({0, 1}) (there is association between G and D) under H({0, 1}), that is, when the covariate does not affect the distribution of G given D at all. This is apparently not satisfactory enough, as we must also take into account the case when the dependence between G and D does depend on E, that is, when {0, 1} does not form a single stratum. Further extension is needed to produce a test that incorporates the latter possibility. We address this need in the Bayesian way by introducing a further layer of mixture over the competing hypotheses H({0, 1}) vs H c ({0, 1}). We let π (H({0, 1})) = ρ({0, 1}) and π (H c ({0, 1})) = 1 ρ({0, 1}) be the mixing (or prior) probabilities for the two hypotheses. The idea is that we will be able to let the data speak as to whether {0, 1} forms a stratum or not. Figure 1 illustrates the structure of this two-layer mixture model for P(G D,E). To see how we may test for association with this model, let Φ({0, 1}) be the overall 9

10 marginal probability (conditional on the covariate) under the two-layer mixture model. To find Φ({0, 1}), first note that given H({0, 1}), the marginal likelihood is M({0, 1}). Under H c ({0, 1}), on the other hand, the distribution of G given D is different for E = 0 versus for E = 1, and so the likelihood in this case is the product of the likelihoods from the data in each of the covariate subsets {0} and {1}, or M({0}) M({1}). Accordingly, the overall marginal likelihood on {0, 1} under the two-layer mixture model is Φ({0, 1}) = ρ({0, 1})M({0, 1}) + (1 ρ({0, 1}))M({0}) M({1}). (5) Eqs. (2), (3), and (5) provide a complete recipe for computing the overall marginal likelihood Φ({0, 1})) using the marginal likelihoods from subsets of the data, M({0}) and M({1}). Note that the BF for comparing H({0, 1}) versus its complement H c ({0, 1}) is M({0})M({1})/M({0, 1}), and the posterior probability for H({0, 1}) given the data is ρ post ({0, 1}) := π (H({0, 1}) data) = ρ({0, 1})M({0, 1}), (6) Φ({0, 1}) which can be used for testing whether the distribution P(G D,E) depends on E. Next let us find the posterior probability for association between G and D given E, or whether the distribution P(G D,E) depends on the response D. By plugging (3) into (5), we get a seemingly tedious expression for the marginal likelihood: ( ) Φ({0, 1}) = ρ({0, 1}) γ({0, 1})M 0 ({0, 1}) + (1 γ({0, 1})) M 1 ({0, 1}) + ( 1 ρ({0, 1}) )( )( ) γ({0})m 0 ({0}) + (1 γ({0})) M 1 ({0}) γ({1})m 0 ({1}) + (1 γ({1})) M 1 ({1}. However after rearranging terms we can write Φ({0, 1}) = Ψ({0, 1}) + ({0, 1}) where Ψ({0, 1})=ρ({0, 1})γ({0, 1})M 0 ({0, 1})+ ( 1 ρ({0, 1}) )( ) γ({0})γ({1})m 0 ({0})M 0 ({1}) (7) 10

11 and ({0, 1}) is defined to be Φ({0, 1}) Ψ({0, 1}). What are the meanings of Ψ({0, 1}) and ({0, 1})? It turns out that Ψ({0, 1}) is the marginal probability of the data joint with the following event: or in words H 0 ({0, 1}) ( H c ({0, 1}) ( H 0 ({0}) H 0 ({1}) )), {G and D are not associated conditional on E.} while ({0, 1}) = Φ({0, 1}) Ψ({0, 1}) is the marginal probability of the data joint with the event H 1 ({0, 1}) ( H c ({0, 1}) ( H 1 ({0}) H 1 ({1}) )), or in words {G and D are associated conditional on E.} By Bayes Theorem, the posterior probability of no association is thus P(G and D are not associated conditional on E data) = Ψ({0, 1}) Φ({0, 1}). (8) Accordingly, the posterior probability for association is 1 Ψ({0, 1})/Φ({0, 1}). This suggests a three-step recipe for testing the association between G and D conditional on E: 1. Compute M 0 (A), M 1 (A), and M(A) for A = {0}, {1}, and {0, 1} by Eqs. (2) and (3). 2. Compute Φ({0, 1}) by Eq. (5), and compute Ψ({0, 1}) by Eq. (7). 3. Compute Ψ({0, 1})/Φ({0, 1}) and use it as a test statistic for association. This test is nonparametric as our model places no distributional assumptions on P(G D,E). To carry out this testing procedure one must specify the prior mixing probabilities ρ({0, 1}), γ({0, 1}), γ({0}), and γ({1}). While in certain examples one may have a priori beliefs about these probabilities, most often such knowledge is lacking and there needs to 11

12 be some guidelines for choosing these parameters. To this end, a simple choice is to set all of the mixing probabilities to a non-extreme constant value. For example, one may set those probabilities to 0.5, that is, whenever there is a choice between two mixing components we assume 50/50 probability to choose either. This type of prior specification is out of convenience, but our experience suggests that it generally leads to good performance and the resulting inference is robust to this choice choosing a constant value between 0.1 and 0.9 often leads to similar results. (We investigate the robustness of inference to such choices and justify this claim in Section 4 through a sensitivity analysis.) Additionally, one also needs to specify the prior pseudo-count vectors α({0, 1}), α 0 ({0}) and α 1 ({1}) for the Dirichlet priors. For these, we adopt Jeffrey s prior and set all the pseudo-counts to Testing conditional association with multivariate covariates Next we consider the more general case with multiple covariates. Let E = (E 1,E 2,...,E h ) be h covariates with joint sample space Ω E = Ω E1 Ω E2 Ω Eh. To illustrate ideas, let us consider the case with binary covariates, that is Ω E = {0, 1} h. (Our framework is applicable to other discrete, continuous, or mixed spaces. The case with binary covariates greatly simplifies presentation.) Let A be any non-empty rectangular subset of Ω E, that is, A = A 1 A 2... A h where each A i for i = 1, 2,...,h is a non-empty subset of Ω Ei = {0, 1}. We define hypotheses H(A), H 0 (A) and H 1 (A) in the same way as we did in Section 2.2, and again model the retrospective distribution of G given D and E A as a mixture of H(A) and H c (A) with mixing probabilities ρ(a) and 1 ρ(a). Similar to before, under H(A) A forms a homogeneous stratum we model P(G D,E) as a mixture of H 0 (A) and H 1 (A) with mixing probabilities γ(a) and 1 γ(a). Accordingly, the expressions for the marginal likelihood terms M 0 (A), M 1 (A) and M(A) all stay the same as given in Eqs. (2) and (3). The difference from the case with only one covariate arises under H c (A) that is, when 12

13 the retrospective distribution of G given D does depend on E for E A. Earlier, for A containing more than one value, for example A = Ω E = {0, 1}, under H c (A) we have modeled the retrospective distribution P(G D,E) for E = 0 and for E = 1 separately. Similarly, here we want to model the retrospective distribution differently for varying E values, but it is undesirable to adopt a separate model for each possible value of E, as A may contain a large number of possible values. In particular, for A = Ω E, it contains 2 h values. In real problems, for a large Ω E, many, if not all, of the E values will correspond to only a small fraction of observations, while a gigantic model with a huge number of free parameters will be needed for separate modeling of the retrospective distribution under all possible E values, costing power for detecting association. An oracle, who knows exactly how the retrospective distribution depends on the covariates, will know the optimal way to divide Ω E for modeling. For example, suppose a study has four covariates gender (M/F), smoking (Y/N), exercise (Y/N) and diet (Vegetarian/Non- Vegetarian), and that the subjects who smoke and do not exercise have a different retrospective distribution for G given D from others. The oracle will then divide Ω E into two blocks Ω E = Ω E,1 Ω E,2 according to the subjects smoking and exercise status, and adopt two separate models for the retrospective distributions P(G D,E Ω E,1 ) and P(G D,E Ω E,2 ). In practical situations, however, such oracular knowledge is unavailable, at least not perfectly. Good divisions of the covariate space are not known a priori. But such information is contained in the data and with appropriate methods can be inferred! From a statistical perspective, this is but a model comparison/selection problem, where the different models are indexed by the corresponding stratifications (i.e. partitions) over the covariate space. Indeed we have already adopted this idea in the simple case of a single binary covariate. There were only two possible partitions on {0,1} either {0, 1} is undivided or it is partitioned into {0} and {1} corresponding to H({0, 1}) and H c ({0, 1}) respectively. The mixing probabilities ρ({0, 1}) and 1 ρ({0, 1}) form a prior on this (albeit small) model space, which 13

14 allows inference on the appropriate stratification through the standard Bayesian machinery. For general covariate spaces, a natural generalization is to place a prior over the collection of partitions or stratifications on Ω E, which represents different models for P(G D,E). This will allow us to infer on the appropriate division and thereby achieve adaptive conditioning on the covariates. In principle, we can choose any probability distribution over the collection of partitions on Ω E. But in making this choice, one must balance the need for modeling flexibility and for computational tractability. More specifically, we want the prior to cover a wide class of partitions, while still allowing inference to be carried out efficiently. The latter is especially important in large-scale studies such as genome-wide studies where the scientist often needs to test a large number ( ) of genomic locations. Having to run, say, a separate Markov Chain Monte Carlo (MCMC) chain for each location, aside from the accumulation of Monte Carlo errors, is often impractical. One way to constructing such a prior distribution on the space of partitions that meets the dual-goal is by randomly dividing the covariate space recursively, in a sequential manner. Recursive partitioning has large support over the space of partitions. Moreover, as we will see, efficient posterior inference can be attained utilizing the self-similar nature of the prior. Probability distributions over recursive partitions have been extensively studied in the literature. A notable example include the Bayesian CART [3, 4, 12]. The prior we adopt here, which we call the optional tree (OT) distribution, is a constructive process recently introduced in the context of density estimation [11]. The key motivation to adopting the OT prior on stratifications is that it gives rise to a recursive mixture model for P(G D, E) that extends naturally our earlier two-layer mixture in the single binary covariate case. As a result, inference can also proceed efficiently in a way that generalizes the recipe for the single binary covariate case. In the following, we provide the details about this recursive mixture framework and show how it can be used for adaptively testing conditional association. We start with a brief description of the OT 14

15 distribution The optional tree (OT) distribution The OT distribution can be described in terms of the following constructive procedure, termed the OT procedure, that generates a random partition on Ω E recursively. Starting from the whole space A = Ω E, draw a Bernoulli random variable S(A) Bernoulli(ρ(A)). If S(A) = 1, then we terminate the partitioning procedure on A and end up with a trivial partition of a single block. If instead S(A) = 0, then we proceed to divide up A into smaller pieces as follows. Suppose there are a total of N(A) available ways to partition A. We randomly select one of the N(A) ways to divide A and let λ j (A) denote the probability of choosing the jth way of partitioning. That is, we draw a random variable J(A) {1, 2,...,N(A)} with P(J(A) = j) = λ j (A) for j = 1, 2,...,N(A) and N(A) j=1 λ j(a) = 1. If J(A) = j then we divide A in the jth possible way into K j (A) children A = Kj (A) i=1 A j i where A j i denote the ith child of A under the jth possible way of partitioning. For example, if we allow only dimension-wise dyadic cuts on A that is, each A can be divided into two halves according to the value of one of the covariates, then for A = Ω E = {0, 1} h, we have N(A) = h, K j (A) = 2 for all j = 1, 2,...,h, and the two children A j 1 and A j 2 are A j 1 = {(e 1,e 2,...,e h ) : e j {0}, and e i {0, 1} for i j, i = 1, 2,...,n} A j 2 = {(e 1,e 2,...,e h ) : e j {1}, and e i {0, 1} for i j, i = 1, 2,...,n}. This completes the partition step for A = Ω E, and we then restart the same procedure on each of the children A j i, beginning with the drawing of a Bernoulli stopping variable S(Aj i ). This recursive partitioning procedure continues so on and so forth, but it will eventually stop everywhere, thereby producing a random partition of Ω E consisting of rectangular 15

16 blocks on Ω E. The process is specified by the hyperparameters ρ(a) and λ j (A) for j = 1, 2,...,N(A) for each set A Ω E that can potentially arise during the procedure. Note that the partition procedure will always stop because the atomic subsets those A s that contain only one possible value for E such as A = {1} {1} {1} {1} cannot be further divided so we must let ρ(a) = 1 for these sets. Hence if the partitioning procedure has reached all the way to the atomic sets on some parts of Ω E, then it will be forced to stop there as no further partitioning is possible. For h = 1, we have Ω E = {0, 1}, and so A = {0} and A = {1} are the only two atomic subsets as we have seen in Section 2.2. (In the more general case when some of the covariates are continuous and so there are no atomic subsets, one can show that the partitioning procedure will stop almost everywhere on Ω E with probability 1, provided that all prior stopping probabilities ρ(a) are bounded below by a common constant c > 0.) A recursive mixture model for P(G D,E) Now that we have a prior distribution on the partitions over Ω E, let us associate the random partitions arising from this process to the corresponding models for the distribution of G given D and E. If the OT procedure stops on a subset A, that is, S(A) = 1, we let A form a single stratum and model the retrospective distribution P(G D,E) for E A with H(A), specified as a mixture model over H 0 (A) and H 1 (A) with mixing probabilities γ(a) and 1 γ(a) as we did for the case with h = 1 in Section 2.2. If instead S(A) = 0 and J(A) = j so A is not a single stratum and is divided into its children A = A j 1 A j 2, then we adopt H c (A) for the retrospective distribution P(G D,E) for E A and build separate models for the retrospective distribution for E in each A j i, in the same manner as we have just described for A. Again, since the partitioning will eventually stop, this recursive mapping from partitions to models is well-defined. An important feature of this recursive prior/model specification is its self-similarity the prior and model specified on a rectangular subset A is in exactly the same form as the one on its children. As we will see below, this feature leads 16

17 Figure 2: The local model for P(G D,E) for E A when Ω E = {0, 1} h and dimensionwise dyadic cuts are allowed. Red indicates the conditional mixing probabilities induced by the latent decisions. For simplicity, we have suppressed notation and N(A) is denoted as N. to an analytic approach to inference based on recursive computation. From now on, we shall refer to this self-similar prior/model specification on A as the local (recursive mixture) model on A. Figure 2 illustrates the self-similar design of each local model Adaptive conditional association test under the recursive mixture model How do we test for predictor-response association conditional on the covariates under this recursive model? Also in a recursive manner! To see this, we again define Φ(A) to be the marginal probability (conditional on E) under the local model on A, computed from the observations with E A. From the design of the model, Φ(A) can be written recursively as Φ(A) = ρ(a)m(a) + ( 1 ρ(a) ) N(A) j=1 K j (A) λ j (A) i=1 Φ(A j i ) (9) 17

18 where M(A) is the marginal likelihood under H(A) as in Eq. (3) and N(A) j=1 λ j(a) K j (A) i=1 Φ(A j i ) is the marginal likelihood under H c (A). Eq. (9) is a generalization of Eq. (5) with the main difference being the additional integration over the N(A) different ways to partition A. In order to infer on whether G is associated with D conditional on E, it is again useful to decompose Φ(A) into two pieces Φ(A) = Ψ(A) + (A) where Ψ(A) = ρ(a)γ(a)m 0 (A) + ( 1 ρ(a) ) N(A) j=1 K j (A) λ j (A) i=1 Ψ(A j i ) (10) is the part of Φ(A) contributed from the models with no association between G and D given E. Eq. (10) shows that Ψ(A) also has a recursive representation. To understand Eq. (10), note that when H(A) holds, the marginal likelihood on A under no association is M 0 (A) as defined in Eq. (2), while if H(A) does not hold and A is divided in the jth way, then the contribution to the marginal likelihood on A from the models of no association is the product of such contributions from A s children. Finally, when A is an atomic subset, then ρ(a) = 1 and Ψ(A) = γ(a)m 0 (A) by design. So this recursion will also eventually terminate. Eqs. (9) and (10) provide a recipe for recursively computing Φ(A) and Ψ(A). But how do we use these terms? In very much the same way as we did in Section 2.2. More specifically, Ψ(Ω E ) is the probability of the data joint with the event W = {G and D are not associated conditional on E.} By Bayes Theorem, the posterior probability for the above event is Pr(W data) = Ψ(Ω E) Φ(Ω E ). Accordingly, the posterior probability for association is 1 Ψ(Ω E )/Φ(Ω E ). To summarize, under the proposed framework one may test the association between G 18

19 and D conditional on E through the following three-step recipe (which is in complete analogy to the three-step recipe presented at the end of Section 2.2): 1. Compute M 0 (A), M 1 (A) and M(A) for each rectangular subset A of Ω E. 2. Compute Φ(A) and Ψ(A) recursively by Eqs. (9) and (10) for the rectangular subsets. 3. Compute Ψ(Ω E )/Φ(Ω E ) and use as a statistic for testing association. Again, to carry out this test one needs to specify the prior mixing probabilities ρ(a), γ(a), λ(a), and the pseudo-counts for each rectangular subset A of Ω E. A simple choice that often leads to good performance in common situations is to set ρ(a) 0.5, λ(a) = 1/N(A) for all non-atomic A s, γ(a) 0.5 for all A, and set all the pseudo-counts to 0.5. We defer a more careful study of the effect of prior specifications on inference to Section General Bayesian inference under the recursive mixture model In the previous two subsections we have introduced a recursive mixture model for the distribution of G given D and E, and have provided recipes for testing association between G and D conditional on E based on this model. This is achieved through a sequence of recursive computation for the terms Φ(A) and Ψ(A). Next, we show that once the Φ(A) s are computed, we can in fact derive the exact posterior of the recursive mixture model, and therefore inference can proceed through the general Bayesian paradigm. For example, one may take on the prediction task through Bayesian model averaging (BMA) [5]. The main result (Theorem 1) states that the recursive mixture model is conjugate. Here we must first make clear what we mean by the conjugacy of a model as conjugacy typically refers to the relation between a prior-model pair. To this end, note that the random recursive procedure we have introduced in Section 2.3 is essentially a prior for the conditional retrospective distribution P(G D, E). Conjugacy here refers to the fact that given 19

20 the data, the corresponding posterior for P(G D,E) has exactly the same recursive mixture representation, with the hyperparameters updated to their posterior values. Theorem 1. Suppose the conditional retrospective distribution for the predictors P(G D, E) arises from the recursive mixture model described in the previous subsection. In other words, suppose P(G D,E) arises from a prior that (1) randomly divides Ω E in a recursive manner according to the hyperparameters ρ(a) and λ(a), (2) for each stopped set A, i.e. each stratum, randomly assign a hypothesis H 0 (A) or H 1 (A) for P(G D,E A) with probability γ(a) and 1 γ(a) respectively, and (3) generates P(G D,E A) by drawing from the corresponding Dirichlet distributions with hyperparameters α(a) under H 0 (A) or α 0 (A) and α 1 (A) under H 1 (A). Then given the data, the posterior distribution for P(G D,E) can be represented by exactly the same recursive mixture model with the hyperparameters updated to the following values. 1. Posterior stopping probabilities: 2. Posterior selection probabilities: λ post j ρ post (A) = ρ(a)m(a). Φ(A) (A) = (1 ρ(a))λ j(a) Kj(A) i=1 Φ(A j i ), for j = 1, 2,...,N(A). Φ(A) ρ(a)m(a) 3. Posterior mixing probability for H 0 (A): 4. Posterior Dirichlet pseudo-counts: γ post (A) = γ(a)m 0(A). M(A) α post (A) = α(a) + n 0 (A) + n 1 (A), α post 0 (A) = α 0 (A) + n 0 (A) and α post 1 (A) = α 1 (A) + n 1 (A). 20

21 Proof. See Online Supplementary Materials S1. This theorem generalizes Eqs. (4) and (6) and shows that once we have computed the Φ(A) terms recursively, the posterior can also be computed exactly. One can therefore draw samples directly from the posterior through simulation of the recursive mixture model with the updated parameter values, and use the simulated posterior for inference. 2.5 The case with multiple predictors and covariates Up to this point, we have been assuming that G is a single trinary variable, which has allowed us to model the retrospective distribution of G given D using multinomial distributions together with Dirichlet priors. The multinomial-dirichlet conjugacy provides simple closed form expressions for the marginal likelihood under hypotheses H 0 (A) and H 1 (A) M 0 (A) and M 1 (A) as given in Eq. (2). It is these closed form marginal likelihoods that in turn have allowed us to carry out the recursive computation to get the Φ(A) and Ψ(A) terms exactly. In many studies the predictors under investigation are more complicated. For example, in the context of genetic association studies, instead of testing the association of a single SNP at a time, one may want to test several neighboring SNPs jointly. In principle the multinomial-dirichlet modeling strategy will still apply as long as the predictor space stays finite. However, when the predictor place Ω G is large containing many possible predictor combinations modeling the retrospective distribution of the predictors with multinomial distributions becomes inefficient. Such models ignore the sparsity of the data points in the predictor sample space and incur too many degrees of freedom than the data would allow reliable inference upon. To overcome this difficulty, we can adopt the optional Pólya tree (OPT) model/prior introduced in [11] in place of the multinomial-dirichlet conjugate pair. Instead of using a multinomial model on the sample space and placing an Dirichlet prior on that multinomial model, the OPT framework essentially adopts a mixture of multinomial models, each being 21

22 defined on a different partition of the predictor space, where the mixing distribution on the collection of partitions is also the OT distribution. (In fact, this was the context in which the OT distribution was first introduced.) The OPT prior can be considered a conjugate prior to this mixture-of-multinomial model as Dirichlet is to a standard multinomial model. The idea is that some of the partitions being mixed over will capture the structure of the underlying distribution of G very well. In particular, these partitions will divide more finely regions on the sample space where the underlying distribution changes more abruptly while leaving other parts of the space where the distribution is relatively flat undivided. (Please see [11] for details.) In the current context, replacing the multinomial-dirichlet pair with the OPT, we can model P(G D,E A) with a single OPT under H 0 (A) and with two independent OPTs under H 1 (A). But it turns out that this extension alone is still not enough to address the curse of dimensionality. The challenge remaining is that after conditioning on E, each response group (cases or controls) will often populate Ω G so sparsely that if under H 1 (A) we model P(G D = 0,E A) and P(G D = 1,E A) independently either using two OPTs or two multinomial-dirichlet pairs, we will not be able to draw reliable inference on the two distributions very well. Consequently, even when the two retrospective distributions are indeed different, the marginal likelihood under H 0 (A), or M 0 (A), will often tend to be higher, in fact often much higher, than that under H 1 (A), or M 1 (A), because under H 0 (A) the two response groups are pooled together to infer a single retrospective distribution P(G E A), thereby suffering less from data sparsity. To address this difficulty, one needs to allow borrowing of information between the two response groups under H 1 (A). To this end, a further extension utilizes the coupling optional Pólya tree prior (co-opt) [7] to replace the OPT prior for P(G D,E A) under H(A). 22

23 Specifically, under H(A), we place a co-opt prior on the pair of retrospective distributions (θ 0 (A),θ 1 (A)) co-opt( γ A, λ A, α A, ρ b A, λ b A, α b A) where γ A, λ A, α A, ρ b A, λ b A, and αb A are the hyperparameters for the co-opt prior. The subscript A indicates that we may have different prior specifications across different A s, although typically a specification shared across all A s will suffice. Given the constraint of space, here we just provide the intuition for why the co-opt is powerful in the current context. More technical details including a brief overview of OPT and co-opt priors are provided in the Online Supplementary Materials S2. Readers interested in a full treatment on these priors may refer to [7]. The co-opt is a prior for two distributions on the same sample space, which here is the predictor space Ω G, and it generalizes our two-component mixture model for H(A). Under the co-opt, the retrospective distribution is again modeled as a mixture of H 0 (A) and H 1 (A). Under H 0 (A), the common retrospective distribution for the two response groups P(G E A), or θ(a) in our earlier notation, is modeled as a single OPT, whereas under H 1 (A), the two retrospective distributions P(G D = 0,E A), or θ 0 (A), and P(G D = 1,E A), or θ 1 (A), are modeled as two dependent OPTs that can randomly couple and become a single OPT on subsets of A. The motivation for introducing such dependence is that in many situations even when two distributions are different, they still share some common structure at least in some parts of the sample space. Accordingly, the possible coupling or combining of the two samples for inferring shared structures allows exactly the borrowing of information between the two response groups. As our numerical examples will illustrate, this strategy becomes especially beneficial in higher dimensional situations where data is sparse and so effective sharing of information is crucial for overcoming the data sparsity. An important feature of the co-opt framework, in analogy to the multinomial-dirichlet 23

24 conjugate pair, is that the corresponding marginal likelihood terms can also be computed exactly, through a sequence of recursive computations. As a result, we can still compute the corresponding marginal likelihoods such as M 0 (A) and M(A) in closed form. Consequently, we can compute the recursive terms Φ(A) and Ψ(A) in the same way as before and our three-step inferential procedure for testing association remain the same as before. (For details see Online Supplementary Materials S2.) Moreover, the posterior conjugacy of our proposed recursive mixture model (Theorem 1) also remains true with a corresponding posterior update for co-opt distribution according to Theorem 4 in [7] taking the place of the update for the Dirichlet pseudo-counts in Theorem 1. 3 A power study In this section we illustrate the work of our recursive mixture method through numerical examples and carry out a power study to investigate its performance for testing conditional association between the predictors G and the response D given the covariates E. In addition, we compare our method to two other nonparametric methods for testing association one is the χ 2 -test [1] that aims at testing the marginal association between G and D, ignoring the covariates E, and the other is the generalized test [1]. To imitate the sampling of case-control studies, we simulate the predictors G and the covariates E first for a population, and then simulate the response D through a prospective disease model. We then retrospectively sample a case group and a control group, and apply the three methods to the simulated case-control data. We consider disease models that correspond to a variety of dependence relationships among G, D and E. Moreover, we vary the dimensionality of G and E to study how increasing sparsity of the data influences the performance of the three methods. More specifically, we simulate the data under four different scenarios, representing four different disease models motivated by common situations in genetic association studies. Un- 24

25 der each scenario, the predictors are all independent trinary variables taking values 0, 1, and 2 with probabilities 0.16, 0.48 and (The reader may think of them as SNP genotype markers under Hardy-Weinberg equilibrium with a minor allele frequency of 0.4.) The covariates are all independent binary variables. One may imagine them being binary environment variables such as gender, diet, and smoking status, etc. in a genetic association study. In all of the examples, we choose ρ(a) 0.5 and λ j (A) = 1/N(A) for all non-atomic A s, and γ(a) 0.5 for all A s. The first two scenarios (see below) correspond to the case when the predictors and covariates are independent of each other in the general population that is, without conditioning on the response. In both scenarios the covariates are simulated as independent Bernoulli(0.5) variables. Under each scenario, we simulate a population of 100,000 individuals with k g predictors G = (G 1,G 2,...,G kg ) and k e covariates E = (E 1,E 2,...,E ke ), and then we retrospectively sample a case-control data set. We carry out 500 such simulations for each of the 16 combinations of (k g,k e ) for k g and k e ranging from 2 to 5. So the number of cells in the corresponding contingency tables ranges from = 36 to = 7, 776. For the larger tables, for instance for k g = k e = 5, the data is sparse as the corresponding sample sizes we consider range from 1,200 to 2,500 per group while the table has 7,776 cells per group. Such sparsity is typical in genetic association studies when multiple markers are considered jointly. For each simulated case-control data set, we calculate three statistics (1) Ψ(Ω E )/Φ(Ω E ), (2) the χ 2 statistic (with kg 2 1 degrees of freedom) apply to the marginal table of G and D, and (3) the (generalized) statistic applied to the stratified table of G and D with E being the strata. Sample sizes for the case and control groups are chosen to make the comparison across the methods discriminative. 25

26 Scenario 1. In this case, the prospective model for the response is D G,E Bernoulli(0.35) if G 1 = 1 and G 2 = 0 Bernoulli(0.2) otherwise. So the predictors are associated with the response, while the covariates play no role. Scenario 2. In this case, the prospective model for the response is D G,E Bernoulli(0.4) if G 1 = 0, E 1 = 1, and E 2 = 1 Bernoulli(0.4) if G 2 = 0, E 1 = 0, and E 2 = 0 Bernoulli(0.2) otherwise. So the predictors and the covariates interact with each other in affecting the response distribution. In this case, the homogeneous effect assumption, which the test relies on, does not hold. Figure 3 and Figure 4 present the ROC curves for the three test statistics under the two scenarios. (In order to construct the ROC curves, we need to also simulate the test statistics under a null hypothesis where there is no association between G and D. For this purpose we simulated from the null model D G,E Bernoulli(0.2).) Note that the larger k g is, the sparser the tables are, and so the harder to detect the association. Consequently we increase the sample sizes n 0 for the controls and n 1 for the cases together with k g to keep the ROC curves informative (as opposed to lying along the 45 degree line). The specific sample sizes we used for each simulation are given in the figures as well. Let us first look at Figure 3. Under Scenario 1, the covariates do not enter into the disease model P(D G,E), and so it is in fact not necessary to condition on the covariates. Although our method takes E into account, it still is more sensitive than the other two 26

27 statistics, due partly to the mixing component H(Ω E ) that leaves Ω E undivided. The ROC curves of the three statistics are similar for small tables k g = 2, but the advantage of our method becomes apparent as k g increases. This is expected as the χ 2 test and the test does not take into account the sparsity of the data when the predictor space expands. In contrast, our method, with the co-opt prior as the model for the retrospective distribution under H(A), is robust to the sparsity of data through the borrowing of information across the two response groups as discussed in Section 2.5. The results for Scenario 2 (Figure 4) are similar to those for Scenario 1, but now our method outperforms the other two even more. It is interesting that the χ 2 test performs slightly better than the, despite the fact that the former ignores the covariates. This is probably due to two reasons: first, the conditional dependence structure induces fairly strong marginal dependence, which the χ 2 test directly targets at, and second, the homogeneous effect assumption, which the test relies on, fails. Next, we simulate from two additional scenarios that illustrate the importance of conditioning on relevant covariates to avoid false positives and improve power for true detections. Scenario 3. In this case, one of the covariates, E 1, is no longer simulated as a Bernoulli(0.5) variable. Instead, its (prospective) distribution is determined by the predictor G 2 E 1 G Bernoulli(0.7) if G 2 1 Bernoulli(0.5) otherwise. In addition, the prospective model for the response is D G,E Bernoulli(0.4) if E 1 = 1 Bernoulli(0.2) otherwise. 27

28 ke=2,kg=2,n0=300,n1=300 ke=2,kg=3,n0=500,n1=500 ke=2,kg=4,n0=800,n1=800 ke=2,kg=5,n0=1200,n1=1200 ke=3,kg=2,n0=300,n1=300 ke=3,kg=3,n0=500,n1=500 ke=3,kg=4,n0=800,n1=800 ke=3,kg=5,n0=1200,n1=1200 ke=4,kg=2,n0=300,n1=300 ke=4,kg=3,n0=500,n1=500 ke=4,kg=4,n0=800,n1=800 ke=4,kg=5,n0=1200,n1=1200 ke=5,kg=2,n0=300,n1=300 ke=5,kg=3,n0=500,n1=500 ke=5,kg=4,n0=800,n1=800 ke=5,kg=5,n0=1200,n1=1200 Figure 3: ROC curves for the three test statistics under Scenario 1. 28

29 ke=2,kg=2,n0=400,n1=400 ke=2,kg=3,n0=750,n1=750 ke=2,kg=4,n0=1500,n1=1500 ke=2,kg=5,n0=2000,n1=2000 ke=3,kg=2,n0=400,n1=400 ke=3,kg=3,n0=750,n1=750 ke=3,kg=4,n0=1500,n1=1500 ke=3,kg=5,n0=2000,n1=2000 ke=4,kg=2,n0=400,n1=400 ke=4,kg=3,n0=750,n1=750 ke=4,kg=4,n0=1500,n1=1500 ke=4,kg=5,n0=2000,n1=2000 ke=5,kg=2,n0=400,n1=400 ke=5,kg=3,n0=750,n1=750 ke=5,kg=4,n0=1500,n1=1500 ke=5,kg=5,n0=2000,n1=2000 Figure 4: ROC curves for the three test statistics under Scenario 2. 29

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.