On the multivariate probit model for exchangeable binary data. with covariates. 1 Introduction. Catalina Stefanescu 1 and Bruce W. Turnbull 2.

Size: px

Start display at page:

Download "On the multivariate probit model for exchangeable binary data. with covariates. 1 Introduction. Catalina Stefanescu 1 and Bruce W. Turnbull 2."

Steven Bell
5 years ago
Views:

1 On the multivariate probit model for exchangeable binary data with covariates Catalina Stefanescu 1 and Bruce W. Turnbull 2 1 London Business School, Regent s Park, London NW1 4SA, UK 2 School of Operations Research, Cornell University, Ithaca, NY 14853, USA Summary This paper considers the use of a multivariate binomial probit model for the analysis of correlated exchangeable binary data. The model can naturally accommodate both cluster and individual level covariates, while keeping a fairly flexible intracluster association structure. We discuss Bayesian estimation when a sample of independent clusters of varying sizes are available, and show how Gibbs sampling may be used to derive the posterior densities of parameters. The methodology is illustrated with two examples: the first involves epidemiological data from a study of familial disease aggregation; the second uses teratological data from a developmental toxicity application. Key words: Exchangeable binary data, multivariate binomial probit, Gibbs sampling, hierarchical Bayesian modelling. 1 Introduction Correlated binary data are a common feature in many applications, such as group randomized clinical trials, cluster sample surveys, or developmental toxicity experiments. In many such studies, it is reasonable to assume that the responses within a cluster are exchangeable. A sequence of binary random variables Corresponding author: cstefanescu@london.edu, Phone: +44 (0) , Fax: +44 (0)

2 2 C. Stefanescu and B. Turnbull: Multivariate Probit Model Y 1, Y 2,... is exchangeable if Pr(Y 1 = y 1,..., Y r = y r ) = Pr(Y 1 = y π(1),..., Y r = y π(r) ) for any r, any (y 1,..., y r ) and any permutation π of 1, 2,..., r. There is a large literature that has been devoted to the analysis of clustered binary data; an extensive review of modelling approaches is provided by Pendergast et al. (1996). However, fewer papers have focused specifically on the exchangeable case, and most of the research on this topic has been devoted to non parametric models and likelihood estimation, e.g. Bowman and George (1995), Stefanescu and Turnbull (2003) and Xu and Prorok (2003). These papers investigate a saturated model in which interactions of all orders are allowed, but treat only homogeneous populations (or subpopulations). See also George and Kodell (1996). However in the heterogeneous case, it may be reasonable to assume that the clustered binary responses are exchangeable only after accounting for the presence of explanatory variables. This is what we will mean by exchangeable binary data with covariates. The inclusion of covariates is an important issue, because their omission may bias the results of the analysis. Also, the effects of covariates on the marginal means may themselves be of interest. Covariates may act either at cluster level or individual level a discussion on separating individual level and cluster level effects is given by Begg and Parides (2003). George and Bowman (1995) propose inclusion of cluster level covariates in the saturated model by means of a folded logistic parametrization. As noted by Aerts et al. (2002, p.55), this parametrization is not invariant to reversal of the binary coding, it is valid only for a restricted range of the regressor function, and the model does not simplify to the binomial model under independence. In addition, the intracluster correlation is severely restricted for large cluster sizes. The q power model developed by Kuk (2004) is based on the same approach as George and Bowman (1995), but is more flexible in the choice of regressor function and does reduce to the binomial model under independence. However the q power model still is not coding invariant. This is an undesirable feature, although Kuk (2004, Sec.3.1) has argued that this asymmetry may be natural to consider in applications to quantitative risk assessment in developmental toxicology. More importantly, neither the folded logistic

3 3 nor the q power family are able to accommodate individual level covariates. Such covariates are present in epidemiological studies where responses from members of the same family can depend on explanatory variables specific to each individual. Also, in teratological studies of risk assessment, the incidence of malformations in litter members is linked not only to the toxicant dose to which the mother has been exposed (a cluster level covariate), but also to individual characteristics such as birth weight (Ryan et al., 1991; Catalano and Ryan, 1992; Chen, 1993). Thus there is a need for richer models which allow individual level covariates, while keeping the intracluster association structure fairly flexible. In this paper we examine in detail the use of a multivariate latent threshold model (Ashford and Sowden, 1970) for the analysis of exchangeable binary data with covariates. The binary responses are seen as indicators of the event that some unobserved latent variables exceed a threshold value of zero. A multivariate normal distribution is assumed for the unobserved variables, and exchangeability is ensured by appropriately parameterizing the variance covariance matrix. This leads to an exchangeable multivariate probit model, whereby both cluster and individual level covariates may be naturally included by assuming a linear form of the multivariate normal mean. The model is flexible enough to allow a reasonably wide range of correlation values, and in particular negative correlation between the binary responses can also be modelled. This allows the presence of either cooperative or competitive effects. Another important advantage of this model is that its parameters retain the same interpretation irrespective of cluster size. This is an important consideration when clusters are of varying sizes the interpretability assumption as discussed by Stefanescu and Turnbull (2003). Likelihood methods are an attractive option for inference (e.g. Ochi and Prentice, 1984; Chan and Kuk, 1997; McCulloch, 1994), but they are computationally difficult due to the intractability of the expressions obtained by integrating out the latent variables (Chib and Greenberg, 1998). As an alternative, we propose the use of a Bayesian framework for estimation of the model parameters. Given the underlying hierarchical structure, this is a natural approach to estimation and may be implemented using data augmentation (Tanner

4 4 C. Stefanescu and B. Turnbull: Multivariate Probit Model and Wong, 1987). Simulating from the posterior distributions of parameters can be accomplished using the Gibbs sampling algorithm, as described in Section 3. For the Bayesian model, Albert and Chib (1993) developed a framework for estimation through data augmentation for independent binary data, of which the univariate probit is a special case. Chib and Greenberg (1998) and Edwards and Allenby (2003) investigate Bayesian estimation of the multivariate probit model with a general correlation matrix. Both of these papers focus on the case when clusters have equal sizes. However, this is often not the case in epidemiological or teratological applications where clusters are, for example, families or litters of varying sizes. It is then necessary to impose some structure on the correlation matrix. One way is to specify a variance components model and Sorensen et al. (1995) applied such a model to an animal breeding study of genetic heritability. The alternative that we consider here consists in specifying an exchangeable structure (after adjustments for covariates), which is a very natural one in many applications. The main contributions of this paper consist in investigating the use of multivariate threshold latent models for the analysis of exchangeable binary data with individual level covariates, and developing a Bayesian framework for estimation of these models with samples of varying cluster sizes. The paper is structured as follows: Section 2 reviews the multivariate probit model for clustered binary variables with covariates. The Bayesian framework for estimation is developed in Section 3. Two applications are presented in Section 4: a study of familial disease aggregation and a teratology experiment concerning malformations in mouse litters. Several directions for future research are outlined in a concluding section. 2 Threshold latent variable models for exchangeable binary data Let (Y 1,..., Y r ) be a cluster of r binary random variables, where Y i = 0 ( failure ) or 1 ( success ), i = 1,..., r, and let x 1,..., x r R p denote corresponding vectors of covariates. Some of these covariates may be cluster level, in which case the corresponding components of x 1,..., x r are all equal. In particular, the first component of each x will typically be one, indicating the presence of an intercept term.

5 5 A flexible class of binary data models may be obtained by assuming that the response Y i (1 i r) is an indicator of the event that some unobserved continuous variable, Z i say, exceeds a threshold which can be taken to be zero without loss of generality. Specifically, let Z 1,..., Z r be latent continuous variables and assume that Y i = I(Z i > 0), for i = 1,..., r, Z = Xβ + ε, ε N r (0, Σ r ) (1) Here X is the r p matrix of covariates with rows x 1,..., x r, the vector of regression parameters is β R p, and ε is the vector of error terms assumed to have an exchangeable multivariate normal distribution with mean 0 and covariance matrix Σ r, given by 1 ρ... ρ ρ 1... ρ Σ r =., ρ 1.. r 1 ρ ρ... 1 (2) For identifiability reasons, we can take the common variance of ε to be one see, for example, Edwards and Allenby (2003). Note that, while the normality assumption for ε is motivated by mathematical convenience, other choices for the error distribution are also possible. For example, ε may have a multivariate t or logistic distribution. The specification (1) leads to a flexible model with the property that the binary responses for individuals with common covariate values are exchangeable. The marginal probability of success with covariate vector x is given by p(x) = E(Y ; x) = Pr(Y = 1; x) = Pr(Z > 0; x) = = Pr(xβ + ε > 0) = Pr(ε > xβ) = 1 Φ( xβ), (3) where Φ( ) is the standard normal cdf. Hence V ar(y ; x) = p(x){1 p(x)} = {1 Φ( xβ)}φ( xβ). (4)

6 6 C. Stefanescu and B. Turnbull: Multivariate Probit Model and E(Y i Y j ; x i, x j ) = Pr(Y i = 1, Y j = 1; x i, x j ) = Pr(Z i > 0, Z j > 0; x i, x j ) = Pr(ε i > x i β, ε j > x j β) = L( x i β, x j β, ρ), (5) say, using the standard notation L(h, k, ρ) for the upper quadrant probability of a standard bivariate normal see e.g. Kotz et al. (2000, p.264). The function L(h, k, ρ) has been extensively tabulated, e.g. U.S. National Bureau of Standards (1959). Alternatively, it can be computed from L(h, k, ρ) = h { ( )} k ρx 1 Φ φ(x) dx, (6) 1 ρ 2 where φ( ) is the standard normal pdf (Kotz et al., 2000, Eqn (46.46)). Following Bahadur (1961), the correlation of order l (l = 2,..., r) is defined by ρ l (x 1,..., x l ) = E( l i=1 Y i) l i=1 p(x i) l. i=1 p(x i){1 p(x i )} Here p(x) is given by (3) and E( l i=1 Y i) is the upper orthant probability of an l variate normal distribution, the obvious generalization of (5). This can be evaluated using the methods of Schervish (1984) or Genz (1992). Of course, by exchangeability, ρ l (x 1,..., x l ) = ρ l (x i1,..., x il ) for any choice of l distinct subscripts from {1,..., r}. In particular for l = 2, the correlation Corr(Y i, Y j ; x i, x j ) can be computed directly from (5), and the model can accommodate some limited negative intracluster dependence as well. In general, as expected with binary variates, their correlation Corr(Y i, Y j ; x i, x j ) depends on the correlation ρ of the latent variables {Z i } but also on their means (x i β, x j β). Table 1 shows how this correlation varies as a function of ρ and xβ for a cluster level covariate x i = x j = x, say. Note that the correlation depends on xβ through its absolute value. Both positive and negative correlation can be exhibited although only ρ values exceeding (r 1) 1 are feasible for clusters of size r. Note also that, in the absence of covariates, Corr(Y i, Y j ; 0, 0) = 2 π arcsin ρ which can take values in [ 2 π arcsin( 1 r 1 ), 1].

7 7 Another measure of association between the pair (Y i, Y j ) is the odds ratio: Ψ(x i, x j ) = Pr(Y i = 1, Y j = 1; x i, x j ) Pr(Y i = 0, Y j = 0; x i, x j ) Pr(Y i = 1, Y j = 0; x i, x j ) Pr(Y i = 0, Y j = 1; x i, x j ) (7) This can be evaluated using (5) and the expressions: Pr(Y i = 0, Y j = 0; x i, x j ) = L( x i β, x j β, ρ) + Φ( x i β) + Φ( x j β) 1 Pr(Y i = 1, Y j = 0; x i, x j ) = 1 Φ( x i β) L( x i β, x j β, ρ) The dependence of the odds ratio on the correlation ρ and mean xβ is also displayed in Table 1. [Table 1] 3 Bayesian estimation Suppose that K independent clusters of varying sizes are available for inference. Let Z k1,..., Z krk denote the latent data and let (Y k1, x k1 ),..., (Y krk, x krk ) be the responses and covariates in the k th cluster. Denote the maximum cluster size by R = max{r k ; 1 k K}. We impose the constraint ρ > 1/(R 1) in order to ensure that Σ r is positive definite for all 1 r R. The likelihood of the parameters β and ρ given the observed and latent data {y k = (y k1,..., y krk ), Z k = (Z k1,..., Z krk )} is a product of K cluster likelihoods: K L(β, ρ {y k }, {x k }, {Z k }) = φ rk (Z k ; X k β, ρ) k=1 K r k k=1 j=1 {I(y kj = 0)I(Z kj 0) + I(y kj = 1)I(Z kj > 0)}, (8) where φ r ( ; µ, ρ) is the density function of the r variate normal distribution with mean µ and covariance matrix Σ r given by (2). The likelihood given the observed data is now obtained by integrating out the latent variables, and its expression involves high dimensional integrals which are not numerically tractable. To alleviate the computational difficulties of the likelihood approach, Chan and Kuk (1997) use the EM algorithm for estimation with non exchangeable data.

8 8 C. Stefanescu and B. Turnbull: Multivariate Probit Model We propose instead the use of data augmentation as advocated by Tanner and Wong (1987) and Albert and Chib (1993). Independent prior densities p(β) and p(ρ) are specified for the parameters, then the joint posterior density p(β, ρ, {Z k } {y k }) of β, ρ and {Z k } is proportional to the product of the prior and the augmented likelihood (8), i.e. : { K } p(β) p(ρ) φ rk (Z k ; X k β, ρ) k=1 K r k k=1 j=1 {I(y kj = 0)I(Z kj 0) + I(y kj = 1)I(Z kj > 0)} (9) It is difficult to sample directly from (9), but the Gibbs sampling algorithm can be used to compute the marginal posterior distributions of β and ρ. The Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) is an iterative algorithm for generation of samples from a multivariate distribution. It proceeds by successively updating each variable by sampling from its conditional distribution given current values of all other variables. After a sufficiently large number of iterations, under mild conditions it can be proven that the values of the updated variables so obtained form a sample from the joint distribution see for example, Robert and Casella (1999). The Gibbs sampler requires all the posterior conditional distributions, and for model (1) these can be derived based on the joint posterior (9): The random variables Z 1,..., Z K are independent, and Z k y k, x k, β, ρ has a truncated multivariate normal distribution N rk (X k β, Σ rk ), where Z ki is truncated at the left by 0 if y ki = 1, and it is truncated at the right by 0 if y ki = 0, i = 1,..., r k. The posterior distribution of β given {y k }, {x k }, {Z k } and ρ and under the assumption of a diffuse prior for p(β), is multivariate normal N p ( β Z,Σ, (X Σ 1 X) 1 ), where Z = (Z 1,..., Z k ), X = (X 1,..., X k ), βz,σ = (X Σ 1 X) 1 X Σ 1 Z, and Σ is a block diagonal matrix with blocks Σ rk given by (2), k = 1,..., K. This follows from standard linear model results (Lindley and Smith, 1972).

9 9 The posterior distribution of ρ conditional on {y k }, {x k }, {Z k }, β has density proportional to p(ρ) (1 ρ) 1 2 P K k=1 (r k 1) exp [ K k=1 K {1 + (r k 1)ρ} 1 2 k=1 {1 + (r k 3)ρ}S k2 + ρs 2 k1 2(1 ρ) r k 1 {1 + (r k 1)ρ} ], (10) where S k1 = r k j=1 {Z kj (x k β) kj } and S k2 = r k j=1 {Z kj (x k β) kj } 2 for k = 1,..., K. To see this, note that it follows from (9) that the posterior conditional distribution of ρ is proportional to { K } p(ρ) φ rk (Z k ; X k β, ρ) k=1 p(ρ) K k=1 1 (2π) r k/2 Σ rk { exp 1 } 1/2 2 (Z k X k β) Σ 1 r k (Z k X k β). (11) But Σ r = (1 ρ) r 1 {1 + (r 1)ρ}, and Σ 1 1 r = (1 ρ) r 1 {1 + (r 1)ρ} [I r + ρ{1 r + (r 3)I r }], where I r is the identity matrix of order r and 1 r is the r r matrix of ones. Hence (10) may now be derived from (11). Note that it is easy to sample from the conditional posterior distributions of β and {Z k }. For example, sampling from the truncated normal distribution of Z k may be accomplished through a series of univariate draws using an inverse cdf method (Edwards and Allenby, 2003). The conditional posterior distribution of ρ is more difficult to simulate, however sampling may still be implemented using a griddy Gibbs approach (Ritter and Tanner, 1992). 4 Applications 4.1 Familial aggregation of liver cancer The aggregation of a specific disease within families may indicate potential factors contributing to disease etiology (Khoury, Beaty and Cohen, 1993), and it is therefore of major interest to genetic epidemiologists.

10 10 C. Stefanescu and B. Turnbull: Multivariate Probit Model Studies of familial aggregation of disease commonly use the case control design, and a cluster is formed by members of the same family. The degree of familial association of risk of disease may be quantified either by the correlation of responses within a cluster or by the cluster odds ratio. The threshold model (1) finds a natural application in studies of familial disease aggregation, whenever it is reasonable to assume exchangeable responses within a family after adjusting for individual factors. This is often the assumption underlying case control studies. Using the methods of Section 3, the intracluster correlation ρ Y (x i, x j ) can be estimated together with the marginal probability of becoming diseased p(x) = Pr(Y i = 1; x) and with the odds ratio Ψ(x i, x j ). As an example, we analyze the data from a genetic epidemiologic study on liver cancer in Shanghai reported by Liang and Beaty (1991). The data set comprised 347 relatives of 138 cases of primary hepatic carcinoma. A cluster in this analysis is a family, and Table 2 presents the frequency of liver cancer among families of varying size. This table is a slightly corrected version of the Table II that appears in Liang and Beaty (1991) K-Y. Liang (private communication). The binary response indicates whether each relative of a case had contracted the disease. The covariates available for analysis are gender (coded 1 for males and 0 for females), age (coded 1 if under 40 years and 0 otherwise), presence of corn in the usual diet (1 = yes, 0 = no), the source of drinking water (1 if river, 0 otherwise), and presence of antibodies to the hepatitis B virus (HBV, coded 1 = yes, 0 = no). Some of these covariates act at individual level, others at cluster level, and all are binary. Previous analysis of these data (Liang and Beaty, 1991) used a logistic regression model to account for individual effects on the risk of liver cancer. We fit two models of the form (1) to the liver data using the methods from Section 3. The first model includes all five covariates, the second model only takes into account the gender and age covariates. We have implemented the Gibbs sampling algorithm using the WinBUGS software (Spiegelhalter et al., 2003). [Table 2]

11 11 Diffuse but proper priors were specified for all parameters, however other priors are also possible (Turner, Omar and Thompson, 2001). We chose auniform prior U( 1/5, 1) for the correlation ρ Z, and N(0, σ 2 ) priors for the covariate effects β i. To investigate the impact of the choice of prior variance, we carried out a sensitivity analysis by running the chains with different values for σ 2 ranging between 10 3 and The results did not change significantly, hence we report the summary statistics based on the runs with σ 2 = Three parallel chains were started from different sets of initial values, and the Gibbs samplers were run for iterations with the first 5000 iterations discarded as burn in period. Gelman and Rubin s diagnostic (Cowles and Carlin, 1996) indicated satisfactory convergence of all chains. From the pooled sample of three chains we saved the values with a thinning interval of 10, which resulted in 7500 sampling values available for inference for each parameter. [Table 3] The results of our analysis are summarized in Table 3, which reports the mean, standard deviation and median of the marginal posterior distributions of the parameters, as well as the 95% credible intervals. In the first model where all covariates were included in the analysis, it appears that the HBV, diet and water source variables are not significant. Hence we fitted the second model with only the intercept, age and gender variables retained in the analysis. [Table 4] For this reduced second model, the latent correlation ρ is estimated at 0.77 with a 95% credible interval (0.605, 0.890), providing strong evidence of clustering within family. Table 4 reports the medians of the posterior distributions of p(x), Ψ(x i, x j ) and ρ Y (x i, x j ). The 95% credible intervals for these estimates have been obtained from the percentiles of the posterior distributions. The estimated odds ratios range between and for this data set suggesting that the risk of a family member is greatly increased if his/her relative is diseased rather than disease free. The estimates of the gender effect β 1 and age effect β 2 are negative and significantly different from zero. The analysis therefore provides evidence that there is

12 12 C. Stefanescu and B. Turnbull: Multivariate Probit Model a smaller risk of becoming diseased for males, and is consistent with the previous findings that liver cancer tends to occur in older age. These results are also in agreement with the conclusions of Liang and Beaty, who found a strong degree of familial aggregation and negative regression coefficients for the gender and age covariates in all the logistic regression models considered. 4.2 Developmental toxicology data A typical developmental toxicology study involves randomly assigning pregnant animals to groups which receive different doses of a potential toxic substance. The animals are then sacrificed near term and the fetuses in each litter are examined. One aim of these studies is to assess the relationship between dose level exposure and the incidence of malformations in litters, recorded as clustered binary data. To illustrate the application of the multivariate probit model, we analyze a data set from a study of exposure to the herbicide 2,4,5 trichlorophenoxyacetic acid conducted at the National Center for Toxicological Research. The data are described in Table 1 of Ahn and Chen (1997). There are 389 litters ranging in size from 1 to 13, which have been exposed to one of seven toxicant doses (0, 15, 20, 25, 30, 45, and 60 mg/kg/day). Besides dose, the data recorded include the incidence of cleft palate malformations and the fetal weight for each individual in the litter. In fact, Ahn and Chen (1997) only report fetal weight averaged over each litter, since this is all they used in their fitting of a tree structured logistic model to this data. However Dr. Chen kindly made available to us the full data set which included the weights of the individual fetuses. We fit the multivariate probit model (1) to the teratology data using weight as an individual level covariate, and dose and the indicator that dose is greater than zero as cluster level covariates. The relationship between fetal weight and malformation incidence has been documented empirically (Ryan et al., 1991; Chen, 1993). The inclusion of the indicator variable was prompted by Ahn and Chen s analysis of this data set which suggests that the control group behaves quite differently from the other dose groups, and also by the findings of the analysis of a similar data set in Kuk (2004).

13 13 We chose diffuse and proper priors for all parameters a uniform prior U( 1/12, 1) was specified for the correlation ρ Z, and N(0, σ 2 ) priors for the covariate effects β i. The summary statistics reported in Table 5 are based on prior variance σ 2 = 10 3, however the results are very similar for different choices of prior variances ranging between 10 3 and The chain converged rapidly and the sample autocorrelations were relatively low. The Gibbs sampler was run for 6000 iterations with the first 1000 iterations discarded as burn in period, resulting in 5000 sampling values available for inference for each parameter. The estimates from fitting model (1) with the methods from Section 3 are summarized in Table 5. All covariates have significant impact on the individual probability of malformation which, as expected, increases with dose and decreases with weight. The estimated frailty correlation is ρ = 0.369, suggesting a significant clustering effect of malformation incidence within litters. [Table 5] Ahn and Chen (1997) fit several logistic regression tree models to this data set, using dose as a cluster level covariate. Fetal weight is also included in their analysis. However, instead of considering individual weights as covariates, Ahn and Chen use the average cluster weight (a cluster level measure) which is regressed on dose level. The residuals from this linear regression are then used as predictors in a logistic regression model for the probability of malformation. Table 6 gives the estimated malformation frequencies and rates, using the multivariate probit model (1) and a logistic regression tree model investigated by Ahn and Chen (1997). It can be seen that the predicted rates of the tree-based model are closer to those observed for the lower dose groups, while the reverse is true for the higher doses. This equivocal finding may seem surprising given that the probit model can take advantage of the individual weights information. However it should be realized that the tree structured model is somewhat semi parametric and so it should not be surprising that the splitting procedure can produce predicted rates close to those observed. [Table 6]

14 14 C. Stefanescu and B. Turnbull: Multivariate Probit Model An alternate approach is to view malformation incidence and weight as a bivariate response, and this is also possible via an extension of the multivariate latent variable model. We will not discuss this further here see Regan and Catalano (1999) and Geys et al. (2001) who consider likelihood and GEE approaches respectively, and the comment in the next section. 5 Conclusions In this paper we have investigated the exchangeable multivariate probit model and proposed a Bayesian framework for estimation. The model belongs to the class of cluster specific approaches for modelling correlated data, as opposed to population average approaches of which the most common example are the GEE type methods (Zeger, Liang and Albert, 1988; Liang, Zeger and Qaqish, 1992). This latter class of models focus on the marginal expectation of the response, treat the existence of any random parameters as a nuisance, and thus are useful when interest lies only in the fixed parameters. By contrast, the cluster specific approaches such as the multivariate probit model, focus on the conditional expectation given the cluster specific random effect. The exchangeable multivariate probit model has several attractive features which make it particularly suitable for the analysis of correlated binary data. First, the model is a generalization of the ordinary probit model. As a way of relating stimulus and response, the probit model is a natural choice in situations where such an interpretation for a threshold approach is readily available; examples include attitude measurement, assigning pass/fail gradings for an examination based on a mark cut off, and categorization of illness severity based on an underlying continuous scale (Goldstein, 2003, p.107). Second, the connection to the Gaussian distribution in the specification of the multivariate probit model allows for flexible modelling of the association structure and straightforward interpretation of the parameters. For example, the model is particularly attractive in marketing research of consumer choice, because the latent correlations capture the cross dependencies in latent utilities across different items. Also, within the class of cluster specific approaches, the exchangeable multivariate probit model is more flexible than other fully specified

15 15 models (such as the beta binomial) which use compound distributions to account for overdispersion in the data. This is due to the fact that both underdispersion and overdispersion can be accommodated in the multivariate probit model through the flexible underlying covariance structure. Finally, due to the underlying threshold approach, the multivariate probit model extends to analysis of clustered mixed binary and continuous data, or of multivariate binary data (Ryan and Catalano, 1999; Geys et al., 2001). An extension to multiple threshholds leads to a model for clustered ordinal data (Sorensen et al., 1995). The Bayesian framework is a natural approach for estimation of the exchangeable multivariate probit model. Generic prior distributions may be used to incorporate prior information when this is available for example, it is often the case in clinical studies that some valuable prior information is available on the degree of intracluster association (Turner et al., 2001). The Markov chain Monte Carlo methods used to implement the Bayesian approach are particularly useful in models where some structure is imposed on the covariance matrix, because the Gibbs sampler simulates conditional distributions sequentially and so does not have to adjust to the structure of the model. By contrast, likelihood methods usually involve constructing large constrained variance matrices for these models. Finally, Markov chain Monte Carlo methods yield inferences based upon samples from the full posterior distribution, therefore they can give accurate interval estimates for non Gaussian error distributions as well, and allow exact inference in cases where the likelihood based methods yield approximations. Acknowledgements The authors are very grateful to Professor K-Y Liang for providing the liver cancer data used in Section 4.1, and to Dr. James Chen for the teratology data in Section 4.2. This research was supported in part by grant R01 CA66218 from the U.S. National Institutes of Health and by an RAMD grant from London Business School.

16 16 C. Stefanescu and B. Turnbull: Multivariate Probit Model References Aerts, M., Geys, H., Molenberghs, G. and Ryan, L., eds. (2002). Topics in Modelling of Clustered Data. Chapman and Hall, London. Ahn, H. and Chen, J.J. (1997). Tree structured logistic model for over dispersed binomial data with applications to modeling developmental effects. Biometrics 53, Albert, J.H. and Chib, S. (1997). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88, Ashford, J.R. and Sowden, R.R. (1970). Multi variate probit analysis. Biometrics 26, Bahadur, R. R. (1961). A representation of the joint distribution of responses to n dichotomous items. In Studies in Item Analysis and Prediction, H. Solomon (ed), Stanford University Press, California. Begg, M.D. and Parides, M.K. (2003). Separation of individual level and cluster level covariate effects in regression analysis of correlated data. Statistics in Medicine 22, Bowman, D. and George, E.O. (1995). A saturated model for analyzing exchangeable binary data: Applications to clinical and developmental toxicity studies. Journal of the American Statistical Association 90, Catalano, P.J. and Ryan, L.M. (1993). Bivariate latent variable models for clustered discrete and continuous outcomes. Journal of the American Statistical Association 87, Chan, J.S.K. and Kuk, A.Y.C. (1997). Maximum likelihood estimation for probit linear mixed models with correlated random effects. Biometrics 53, Chen, J. (1993). A malformation incidence dose response model incorporating fetal weight and/or litter size as covariates. Risk Analysis 13, Chib, S. and Greenberg, E. (1998). Analysis of multivariate probit models. Biometrika 85,

17 17 Cowles, M.K. and Carlin, B.P. (1996). Markov chain Monte Carlo convergence diagnostics: A comparative review. Journal of the American Statistical Association 91, Edwards, Y.D. and Allenby, G.M. (2003). Multivariate analysis of multiple response data. Journal of Marketing Research 40, Gelfand, A.E. and Smith, A.F.M. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association 85, Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6, George, E.O. and Bowman, D. (1995). A full likelihood procedure for analysing exchangeable binary data. Biometrics 51, George, E.O. and Kodell R.L. (1996). Tests of independence, treatment heterogeneity, and dose related trend with exchangeable binary data. Journal of the American Statistical Society 91, Genz, A. (1992). Numerical computation of the multivariate normal probabilities. Journal of Computational and Graphical Statistics 1, Geys, H., Regan, M.M., Catalano, P.J. and Molenberghs, G. (2001). Two latent variable risk assessment approaches for mixed continuous and discrete outcomes from developmental toxicity data Journal of Agricultural, Biological and Environmental Statistics, 6, Goldstein, H. (2003). Multilevel Statistical Models. 3rd Edition, Arnold, London. Khoury, M.J., Beaty, T.H. and Cohen, B.H. (1993) Fundamentals of Genetic Epidemiology. Oxford University Press, New York. Kotz, S., Balakrishnan, N. and Johnson, N.L. (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications, 2nd Edition. Wiley, New York. Kuk, A.Y.C. (2004). A litter based approach to risk assessment in developmental toxicity studies via a power family of completely monotone functions. Applied Statistics 53,

18 18 C. Stefanescu and B. Turnbull: Multivariate Probit Model Liang, K. Y. and Beaty, T. H. (1991). Measuring familial aggregation by using odds ratio regression models. Genetic Epidemiology 8, Liang, K.Y., Zeger, S.L. and Qaqish, B. (1992). Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society, Series B 54, Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society, Series B 135, McCulloch, C.E. (1994). Maximum likelihood variance components estimation for binary data. Journal of the American Statistical Association 89, Ochi, Y. and Prentice, R.L. (1984). Likelihood inference in a correlated probit regression model. Biometrika 71, Pendergast, J. F., Gange, S. J., Newton, M. A., Lindstrom, M. J., Palta, M. and Fisher, M. R. (1996). A survey of methods for analyzing clustered binary response data. International Statistical Review 64, Regan, M.M. and Catalano, P.J. (1999). Likelihood models for clustered binary and continuous outcomes: application to developmental toxicology. Biometrics 55, Ritter, C. and Tanner, M.A. (1992). The Gibbs stopper and the griddy Gibbs sampler. Journal of the American Statistical Association 87, Robert, C.P. and Casella, G. (1999). Monte Carlo Statistical Methods. Springer, New York. Ryan, L.M., Catalano, P.J., Kimmel, C.A. and Kimmel, G.L. (1991). Relationship between fetal weight and malformation in developmental toxicity studies. Teratology 44, Schervish, M.J. (1984). Algorithm AS 195: Multivariate normal probabilities with error bound. Applied Statistics 33, (Correction 34 (1985), ) Sorensen, D.A., Andersen,S,, Gianola, D. and Korsgaard, I. (1995). Bayesian inference in threshold models using Gibbs sampling. Genetics, Selection, Evolution 27,

19 19 Spiegelhalter, D.J., Thomas, A., Best, N.G. and Lunn, D. (2003). WinBUGS Version 1.4 User Manual. MRC Biostatistics Unit, Cambridge. Stefanescu, C. and Turnbull, B. W. (2003). Likelihood inference for exchangeable binary data with varying cluster sizes. Biometrics 59, Tanner, T.A. and Wong, W.H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association 82, Turner, R.M., Omar, R.Z. and Thompson, S.G. (2001). Bayesian methods of analysis for cluster randomized trials with binary outcome data. Statistics in Medicine 20, U.S. National Bureau of Standards (1959). Tables of the Bivariate Normal Distribution Function and Related Functions. U.S. Government Printing Office, Washington DC. Xu, J.L. and Prorok, P.C. (2003). Modelling and analysing exchangeable binary data with random cluster sizes. Statistics in Medicine 22, Zeger, S.L., Liang, K.Y. and Albert, P.S. (1988). Models for longitudinal data: A generalized estimating equations approach. Biometrics 44,

20 20 C. Stefanescu and B. Turnbull: Multivariate Probit Model Table 1 Values of correlation and odds ratio of binary variates in a multivariate probit model, for selected values of the absolute values of the means x and correlation ρ of the latent variables. Corr(Y i, Y j ; x, x) Ψ(x, x) xβ xβ ρ

21 21 Table 2 Frequency of liver cancer by size of the family (Liang and Beaty, 1991). Family Number of cases of liver cancer size Total Total

22 22 C. Stefanescu and B. Turnbull: Multivariate Probit Model Table 3 Bayesian estimates for liver cancer data. Model 1 Explanatory Mean Median Standard 95% Credible variable error intervals Intercept (β 0 ) ( 1.307, 0.253) Gender (β 1 ) ( 0.670, 0.066) Age (β 2 ) ( 1.018, 0.292) HBV (β 3 ) ( 0.076, 1.100) Diet (β 4 ) ( 0.738, 0.163) Water source (β 5 ) ( 0.567, 0.377) ρ (0.607, 0.897) Model 2 Intercept (β 0 ) ( 0.882, 0.416) Gender (β 1 ) ( 0.676, 0.076) Age (β 2 ) ( 1.002, 0.293) ρ (0.605, 0.890)

23 23 Table 4 Estimates of p(x), Ψ(x i, x j) and ρ Y (x i, x j) with 95% credible intervals. (Model 2) female female male male over 40 yrs under 40 yrs over 40 yrs under 40 yrs x (1 0 0) (1 0 1) (1 1 0) (1 1 1) p(x) (0.189, 0.339) (0.046, 0.178) (0.091, 0.237) (0.018, 0.107) Estimates of Ψ(x i, x j ) (1 0 0) (6.50, 36.97) (8.35, 97.71) (7.30, 52.25) (10.68, ) (1 0 1) (8.98, 56.72) (8.56, 65.92) (10.84, 85.49) (1 1 0) (7.63, 45.61) (10.36, ) (1 1 1) (12.15, 83.81) Estimates of ρ Y (x i, x j ) (1 0 0) (0.39, 0.68) (0.30, 0.58) (0.35, 0.64) (0.20, 0.49) (1 0 1) (0.31, 0.65) (0.31, 0.63) (0.27, 0.60) (1 1 0) (0.35, 0.66) (0.25, 0.55) (1 1 1) 0.44 (0.26, 0.62)

24 24 C. Stefanescu and B. Turnbull: Multivariate Probit Model Table 5 Bayesian estimates for teratology data. Explanatory Mean Median Standard 95% Credible variable error intervals Intercept (β 0 ) (1.034, 2.005) Dose (β 1 ) (0.0445, ) I(dose > 0) (β 2 ) ( 1.032, 0.374) Weight (β 3 ) ( , ) ρ (0.294, 0.441) Table 6 Comparison of multivariate probit model and tree structured model with regard to predicted number of malformations and malformation rates at each dose level for teratology data. Tree model is model 2 from Ahn and Chen (1997, Table 4, last column). Dose Number of malformations Malformation rates Observed Probit Tree model Observed Probit Tree model

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary