Multivariate clustered data analysis in developmental toxicity studies

Size: px

Start display at page:

Download "Multivariate clustered data analysis in developmental toxicity studies"

Byron Thompson
5 years ago
Views:

1 319 Statistica Neerlandica (2001) Vol. 55, nr. 3, pp. 319±345 Multivariate clustered data analysis in developmental toxicity studies G. Molenberghs and H. Geys Biostatistics, Center for Statistics, Limburgs Universitair Centrum, Universitaire Campus, B3590 Diepenbeek, Belgium In this paper we review statistical methods for analyzing developmental toxicity data. Such data raise a number of challenges. Models that try to accommodate the complex data generating mechanism of a developmental toxicity study, should take into account the litter effect and the number of viable fetuses, malformation indicators, weight and clustering, as a function of exposure. Further, the size of the litter may be related to outcomes among live fetuses. Scienti c interest may be in inference about the dose effect, on implications of model misspeci cation, on assessment of model t, and on the calculation of derived quantities such as safe limits, etc. We describe the relative merits of conditional, marginal and random-effects models for multivariate clustered binary data and present joint models for both continuous and discrete data. Key Words and Phrases: dose-response models, generalized estimating equations, conditional model, marginal model, maximum likelihood, pseudo-likelihood. 1 Introduction Lately, society has been increasingly concerned about problems related to fertility and pregnancy, birth defects, and developmental abnormalities. Consequently, regulatory agencies such as the U.S. Environmental Protection Agency (EPA) and the Food and Drug Administration (FDA) have given increased priority to protection against drugs, harmful chemical compounds and other environmental hazards. As epidemiological evidence of adverse effects on fetal development may not be available for speci c chemical compounds present in the environment, laboratory experiments in small mammalian species provide an alternative source of evidence essential for identifying potential developmental toxicants. For ethical reasons, animal studies afford a greater level of control than epidemiological studies. Moreover, they can be conducted in advance of human exposure. Laboratory studies have been used for several decades, but methods for extrapolating the results to humans are still being developed and re ned. Since these laboratory studies involve considerable amounts of time and money, as well as huge numbers of animals, it is essential that the most geert.molenberghs@luc.ac.be, helena.geys@luc.ac.be. Published by Blackwell Publishers, 108 Cowley Road, Oxford OX4 1JF, UK and 350 Main Street, Malden, MA 02148, USA.

2 320 G. Molenberghs and H. Geys appropriate and ef cient statistical models are used (WILLIAMS and RYAN, 1996). Three standard procedures (Segments I, II and III) have been established to assess speci c types of effects. Segment I or fertility studies are designed to assess male and female fertility and general reproductive ability. Such studies are typically conducted in one species of animals and involve exposing males for 60 days and females for 14 days prior to mating. Segment II studies are also referred to as teratology studies, since historically the primary goal was to study malformations. Segment III tests are focused on effects later in gestation and involve exposing pregnant animals from the 15th day of gestation through lactation. In this paper, we concentrate on segment II studies. A Segment II experiment involves exposing timed-pregnant animals (rats, mice and occasionally rabbits for which the time of fertilization can be calculated) during major organogenesis (days 6 to 15 for mice and rats) and structural development. Administration of the exposure is generally by the clinical or environmental routes most relevant for human exposure. Dose levels consist of a control group and 3 or 4 dose groups, each with 20 to 30 pregnant dams. The dams are sacri ced just prior to normal delivery, at which time the uterus is removed and thoroughly examined. An interesting aspect of Segment II designs is the hierarchical structure of the developmental outcomes. Figure 1 illustrates the data structure. An implant may be resorbed at different stages during gestation (very early death that is detectable at the time of maternal sacri ce as a small mark on the uterine wall). If the implant survives being resorbed, the developing fetus is at risk of fetal death. Adding the number of resorptions and fetal deaths yields the number of non-viable fetuses. If the fetus survives the entire gestation period, growth reduction such as low birth weight may occur. The fetus may also exhibit one or more types of malformation. These are commonly classi ed into three broad categories: (i) external malformations are those visible by naked eye, for instance missing limbs; (ii) skeletal malformations might include missing or malformed bones; and (iii) visceral malformations affect internal organs such as the heart, the brain, the lungs etc. Each speci c malformation is typically recorded as a dichotomous variable (present or absent). Adding the number of resorptions, the number of fetal deaths and the number of viable fetuses yields the total number of implantations. Since exposure to the test agent takes place after implantation, the number of implants, a random variable, is not expected to be doserelated. While such an hierarchical approach naturally deals with a structure which could also be considered to be incomplete, classical missing data methods are very uncommon in this eld. The analysis of developmental toxicity data as described above, raises a number of challenges (MOLENBERGHS et al., 1998; ZHU and FUNG, 1996), summarized below. Because of genetic similarity and the same treatment conditions, offspring of the same mother behave more alike than those of different mothers. This has been termed litter effect. As a result, responses on different fetuses within a cluster are likely to be correlated, inducing extra variation in the data relative to those associated with the common binomial or multinomial distribution. This extra variation must be

3 Clustered data in toxicity studies 321 Fig. 1. Data structure of developmental toxicity studies. taken into account in statistical analyses (CHEN and KODELL, 1989; KUPPER et al., 1986). Since deleterious events can occur at several points in development, an interesting aspect lies in the staging or hierarchy of possible adverse fetal outcomes (WILLIAMS and RYAN, 1996). Ultimately, a model should take into account this hierarchical structure in the data: (i) a toxic insult early in gestation may result in a resorbed fetus; (ii) thereafter an implant is at risk of fetal death; (iii) fetuses that survived the entire gestation period are threatened by low birth weight and/or several types of malformation. While some attempts have been made for the joint analysis of prenatal death and malformation (CHEN et al., 1991; RYAN, 1992), the analysis of developmental toxicity data has usually been conducted on the number of viable fetuses alone. An appropriate statistical model should then account for possible correlations among the different fetal endpoints. As the number of viable fetuses can

4 322 G. Molenberghs and H. Geys sometimes affect the chance of an adverse effect (in a large litter a larger number of animals have to compete for the same maternal resources and therefore the probability of malformation may be larger), a model should also be exible enough to allow litter size to affect response probabilities. Finally, one may have to deal with outcomes of a mixed continuous (low birth weigth)/discrete (malformation indicator) nature. A unique type of developmental toxicity study was originally developed by BROWN and FABRO (1981) to assess the impact of heat stress on embryonic development. Subsequent adaptations by KIMMEL et al. (1994) allow the investigation of effects, related to both temperature and duration of exposure. These heatshock experiments are described in detail in Section Motivating studies In this section we introduce two developmental toxicity studies conducted by the Research Triangle Institute under contract to the National Toxicology Program (NTP). The studies concerned the effects in mice, respectively rats of di(2- ethylhexyl)-phthalate (DEHP) (TYL et al., 1988) and ethylene glycol (EG) (PRICE et al., 1987). Further, we also describe the heatshock studies introduced by BROWN and FABRO (1981). 2.1 DEHP study in mice The use of phtalic acid esters as plasticizers for numerous plastic devices is widespread. The most commonly used ester is di(2-ethylhexyl)-phthalate (DEHP), which may constitute as much as 40% by weight of the nished products, in order to provide them with a desirable exibility and clarity. It has been well documented that small quantities of phtalic acid esters may leak out of plastic containers in the presence of food, milk, blood, or various solvents. Owing to their ubiquitous distribution and presence in human and animal tissues, the possible toxic effects of the phtalic acid esters have been the subject of considerable concern. In particular, the developmental toxicity study described by TYL et al. (1988) has attracted much interest in the toxicity of DEHP. The doses selected for the study were 0, 0:025, 0:05, 0:10 and 0:15% DEHP with 25, 26, 26, 17 and 9 timed-pregnant mice assigned to each of these dose groups, respectively. Females were observed daily during treatment, but no maternal deaths or distinctive clinical signs were observed. The dams were sacri ced, slightly prior to normal delivery and the status of uterine implantation sites recorded. A total of 1082 live fetuses were dissected from the uterus, anaesthetised, and examined for external, visceral and skeletal malformations. Table 1 shows for each dose group, the number of pregnant dams, the mean litter size for live fetuses and the rate of malformation (number of live fetuses affected 3 100=number of live fetuses) for three different classes: external malformations, visceral malformations and skeletal malformations. The table suggests clear doserelated trends in the malformation rates. The average litter size (number of viable

5 Clustered data in toxicity studies 323 Table 1. Summary data from a DEHP experiment in mice Dose Dams Litter Size Malformations (%) (mean) Ext. Visc. Skel animals) decreases with increased levels of exposure to DEHP, a nding that is attributable to the dose-related increase in fetal deaths. 2.2 EG study in rats PRICE et al. (1985) describe a developmental toxicity experiment, investigating the effect of EG in rats. The doses selected for the present teratology study were 0, 1:25, 2:50 and 5:0 g=kg day. A total of 1368 live rat fetuses were examined for low birth weight (continuous) or defects (binary). This joint occurence of continuous and binary outcomes will provide additional challenges in model development. Table 2 summarizes the malformation and fetal weight data from this experiment. The data show clear dose-related trends for both outcomes. The rate of malformation increases with dose, ranging from 1:3% in the control group to 68:6% in the highest dose group. The mean fetal weight decreases monotonically with increasing dose, ranging from 3.40 g to 2.48 g in control and highest dose group, respectively. The fetal weight variances, however, do not change monotonically with dose. In the lower dose groups, the variances remain approximately constant. However, in the highest dose group, the fetal weight variance is elevated. Further, it can be observed that simple Pearson correlation coef cients (r) between weight and malformation tend to strengthen with increasing doses. As doses increase, the correlation becomes more negative, because the probability of malformation is increasing and fetal weight is decreasing. This is illustrated in Figure 2, which shows the observed malformation rates for all clusters (litters), the averaged malformation rates for each dose groups Table 2. Summary data from an EG experiment in rats Dose Dams Litter Size Malf. Weight Pearson (g=kg day) (mean) Nr. % Mean SD Corr. (r)

6 324 G. Molenberghs and H. Geys Fig. 2. EG (rats) study. Top panel: observed malformation rates for all clusters ( ) (the size of the symbol re ects the multiplicity of the rate) and the observed malformation rates for all dose groups (3). Bottom panel: average weights for all clusters (d) and average weights for all dose groups (3).

7 Clustered data in toxicity studies 325 (with 95% con dence limits), the average weight outcomes for all clusters and the average weight outcomes (with 95% con dence limits) for each dose group. 2.3 Heatshock studies Heatshock studies have been described by BROWN and FABRO (1981) and KIMMEL et al. (1994). In these studies, embryos are removed from the uterus of the maternal dam and cultured in vitro. Next, each embryo is exposed to a short period of heat stress by placing the culture vial into a warm water bath, involving an increase over body temperature of 4 to 58C for a duration of 5 to 45 minutes. The embryos are examined 24 hours later for impaired or accelerated development. This type of developmental test system has several advantages over the standard Segment II design. First of all, the exposure is directly administered to the embryo, so that controversial issues regarding the unknown relationship between the exposure level to the maternal dam and that which is actually received by the embryo, need not be taken into account. Secondly, the exposure pattern can be easily controlled, since target temperature levels in the warm water baths can be achieved within 2 minutes. Further, information regarding the effects of exposure are quickly obtained, in contrast to the Segment II study which requires 8 to 12 days after exposure to assess impact. And nally, this animal test system provides a convenient mechanism for examining the joint effects of both duration of exposure and exposure levels. The studies collect measurements on 13 morphological variables. We will focus our attention on three of these (olfactory system (OLF) connected with the sense of smell, optic system (OPT) related to vision, and midbrain (MBN), the middle of the three primary divisions of the brain) and assess the effects of both duration and level of exposure on each morphological endpoint, coded as affected (1) versus normal (0). The study design for the set of experiments conducted by KIMMEL et al. (1994) is shown in Table 3, which indicates the number of embryos cultured in each temperature±duration combination. A total of 375 embryos, arising from 71 initial dams, survived the heat exposure and were used for analysis. Table 3. Heatshock studies: Number of (viable) embryos exposed to each combination of duration and temperature Temperature Duration of Exposure Total Total

8 326 G. Molenberghs and H. Geys The distribution of cluster sizes, ranging between 2 and 11, is given in Table 4. The mean cluster size is 5. Since only surviving fetuses were included, cluster sizes are smaller than those observed in most other developmental toxicity studies and do not re ect the true original litter size. 3 Accounting for litter effects Failure to account for the clustering in the data can lead to serious underestimation of the variances of dose effect parameters and, hence, in ated test statistics. The need for methods that appropriately account for the heterogeneity among litters, especially with regard to binary outcomes, has long been recognized. HASEMAN and KUPPER (1979) provide an excellent survey of likelihood generalizations of standard distributions to account for clustering. At a more complicated level, HOUWING- DUISTERMAAT and VAN HOUWELINGEN (1998) incorporate a familial association structure in logistic regression models. In general, models for multivariate correlated binary data can be grouped into the following different classes: conditionally speci ed models, marginal models, or cluster-speci c models (DIGGLE, LIANG and ZEGER, 1994). The answer to the question, `Which model family is to be preferred', depends principally on the research question(s) to be answered. In conditionally speci ed models the probability of a positive response for one member of the cluster is modelled conditionally on other outcomes for the same cluster, while marginal models relate the covariates directly to the marginal probabilities. Cluster-speci c (CS) models differ from the two previous models by the inclusion of parameters which are speci c to the cluster. What method is used to t the model depends, to a large extent, on the assumptions the investigator is willing to make. If one is willing to specify fully the joint probabilities, maximum likelihood methods can be adopted. Yet, if only a partial description in terms of marginal or conditional probabilities is given, one has to rely on non-likelihood methods such as: generalized estimating equations or pseudolikelihood methods. 3.1 Conditional modelling Owing to the popularity of marginal (especially generalized estimating equations) and random-effects models for correlated binary data, conditional models have received relatively little attention, especially in the context of multivariate clustered data. DIGGLE, LIANG and ZEGER (1994, pp. 147±148) criticized the conditional approach because the interpretation of the dose effect on the risk of one outcome is Table 4. Heatshock studies: distribution of cluster sizes cluster size n i number of clusters of size n i

9 Clustered data in toxicity studies 327 conditional on the responses of other outcomes for the same individual, outcomes of other individuals and the litter size. MOLENBERGHS and RYAN (1999), henceforth abbreviated as MR, discuss the advantages of conditional models and how, with appropriate care, the disadvantages can be overcome. They constructed the joint distribution for clustered multivariate binary outcomes, based on a multivariate exponential family model (COX, 1972). De ning z ij as the number of individuals from cluster i, positive on outcome j and z ijj9 as the number, positive on both outcomes j and j9, the model is expressed as: f Yi (y i ; È i ) 8 < ˆ exp : X M jˆ1 è ij z (1) ij XM jˆ1 ä ij z (2) ij X j, j9 9 ù ijj9 z (3) ijj9 X = ã ijj9 z (4) ijj9 A(È i) ;, j, j9 (1) where z (1) ij ˆ z ij, z (2) ij ˆ z ij (n i z ij ), z (3) ijj9 ˆ 2z ijj9 z ij z ij9, z (4) ijj9 ˆ z ij(n i z ij9 ) z ij9 (n i z ij ) z (3) ijj9. A(È i) is the normalizing constant, resulting from summing the previous model over all 2 Mn i possible outcomes. The advantage of this model is the exibility with which both main effects and associations can be modelled, and the absence of constraints on the parameter space which eases interpretability. Further, the fact that the probability model depends explicitly and implicitly on the cluster size is seen as an advantage since it is in line with the observation that litter size itself depends on the level of exposure. The exibility of the MR model partly relies on the exponential family framework. However, the classical advantages of exponential families can be lost, especially in multivariate settings, where the normalizing constant poses excessive computational requirements. Several suggestions have been made to overcome this problem, such as Monte Carlo integration (TANNER, 1991). For example, GEYER and THOMPSON (1992) use Markov chain Monte Carlo simulations to construct a Monte Carlo approximation to an analytically intractable likelihood. ARNOLD and STRAUSS (1991) and ARNOLD, CASTILLO and SARABIA (1992) propose the use of a so-called pseudo-likelihood (PL). Pseudo-likelihood (or pseudo-maximum-likelihood) methods are alternatives to maximum likelihood estimation that retain the methodology and properties while trying to eliminate some of the dif culties such as strong distributional assumptions or intensive computations. The idea is that a parametric family of models is speci ed, to which likelihood methodology is applied; the method is denoted `pseudo', as there is no assumption that this family is the true distribution generating the data. GEYS, MOLENBERGHS and RYAN (1997, 1999) implemented a pseudo-likelihood method for the MR model that replaces the joint distribution of the responses, a multivariate exponential-family model, by a product of conditional densities that do not necessarily multiply to the joint distribution. In this approach, the normalizing constant cancels, thus greatly simplifying computations, especially when litter sizes are large

10 328 G. Molenberghs and H. Geys and variable (since the normalizing constant depends on litter size). Let (y 1,..., y N ) be a set of M-dimensional observation vectors. De ne S as the set of all 2 M 1 vectors of length M, consisting solely of zeros and ones, with each vector having at least one nonzero entry. Denote by y (s) k the subvector of y k corresponding to the components of s 2 S that are nonzero. The associated joint density is written as f s (y (s) k ; È k). Specify a set ä ˆfä s js 2 Sg of real numbers, with at least one nonzero component and de ne the log pseudo-likelihood as: pl ˆ XN iˆ1 X s2s ä s ln f s (y (s) i ; È i ), (2) where some (though not all) of the ä s 's may be negative, but exp( pl) corresponds to a product of marginal and conditional densities. ARNOLD and STRAUSS (1991) established consistency and asymptotic normality. For clustered, multivariate binary data a convenient PL function is found by replacing the joint density (1), by the product of Mn i univariate conditional densities describing outcome j for the kth individual in a cluster, given all other outcomes in the cluster: PL(1) ˆ YN Y M Y n i iˆ1 jˆ1 kˆ1 f (y ijk jy ij9k9, j9 6ˆ j or k9 6ˆ k; È i ): (3) Equation (3) is but one de nition of the PL function. GEYS, MOLENBERGHS and RYAN (1997) considered an alternative speci cation and further showed that only a very small ef ciency loss was paid for enormous computational gains of pseudolikelihood over maximum likelihood. Moreover they proposed score and likelihood ratio tests for the pseudo-likelihood framework. They are easy to calculate, exhibit a very satisfactory behaviour and provide the necessary tools for model selection. AERTS and CLAESKENS (1997) showed how bootstrap approximations can be used as interesting alternatives to the classical asymptotic chi-squared distributions of these test statistics. As an illustration, Table 5 shows the maximum likelihood and pseudo-likelihood estimates of the DEHP study for the univariate conditional model, applied on each of the outcomes: external, visceral and skeletal malformation, as well as a collapsed Table 5. Maximum likelihood (model based standard errors; empirically corrected standard errors) and pseudo-likelihood (standard errors) estimates for the conditional model Method Par. External Visceral Skeletal Collapsed ML â (0.58;0.52) 2.39 (0.50;0.52) 2.79 (0.58;0.77) 2.04 (0.35;0.42) â d 3.07 (0.65;0.62) 2.45 (0.55;0.60) 2.91 (0.63;0.82) 2.98 (0.51;0.66) â a 0.18 (0.04;0.04) 0.18 (0.04;0.04) 0.17 (0.04;0.05) 0.16 (0.03;0.03) PL â (0.53) 2.30 (0.50) 2.41 (0.73) 1.80 (0.35) â d 3.24 (0.60) 2.55 (0.53) 2.52 (0.81) 2.95 (0.56) â a 0.18 (0.04) 0.20 (0.04) 0.21 (0.05) 0.20 (0.03)

11 Clustered data in toxicity studies 329 outcome, de ned to be 1 if any malformation occured and 0 otherwise. The natural parameters were modelled as follows: è i ˆ â 0 â d d i where d i is the dose level applied to the ith clsuter, and ä i ˆ â a, i.e. a constant association model. Obviously, ML and PL parameter estimates agree fairly closely. In all cases, we nd a highly signi cant dose trend and a signi cant clustering parameter. Since the pseudo-likelihood still re ects the underlying likelihood it can be useful for dose-response modelling. 3.2 Marginal modelling In marginal models, the parameters characterize the marginal probabilities of a subset of the outcomes, without conditioning on the other outcomes. BAHADUR (1961) proposed a marginal model, accounting for the association via marginal correlations. This model has also been studied by COX (1972), KUPPER and HASEMAN (1978) and ALTHAM (1978). Assuming exchangeability, in the sense that each fetus within a litter has the same malformation probability, and in addition setting all the three- and higher-way correlations equal to zero, Bahadur's representation can be simpli ed to give the following marginal distribution of Z i, the number of malformations in cluster i: f (z i jð i, r i ) ˆ ni z i ð z i i (1 ð i) n i z i " ( z i 1 ði 3 1 r i n )# i z i ði z i (n i z i ), 2 ð i 2 1 ð i where ð i denotes the malformation probability in the ith cluster, r i the pairwise correlation between any 2 malformation outcomes and n i denotes the litter size. A drawback of this approach is the fact that the correlation r i is highly constrained when the higher order correlations have been removed. Even when higher order parameters are included, the parameter space of marginal parameters and correlations often has a very peculiar shape. BAHADUR (1961) discusses restrictions on the parameter space in the case of a second order approximation. From these, it can be deduced that the lower bound approaches zero as the cluster size increases. However, it is important to note that also the upper bound for r i is constrained. Indeed, even though it is one for clusters of size two, the upper bound is in the range [1=(n i 1), 2=(n i 1)] for larger clusters. Taking a (realistic) cluster of size 12, the upper bound is in the range (0:09; 0:18). KUPPER and HASEMAN (1978) present numerical values for the constraints on r i for choices of ð i and n i. Restrictions for a speci c version where a third order association parameter is included as well, have been studied by PRENTICE (1988). A more general situation is discussed in DECLERCK, AERTS and MOLENBERGHS (1998). They have shown that the range of second order associations is markedly enlarged in a four-way Bahadur model. But

12 330 G. Molenberghs and H. Geys tting higher order Bahadur models is dif cult, due to the increasingly complicated nature of the restrictions on the parameter space. MOLENBERGHS and LESAFFRE (1994) and LANG and AGRESTI (1994) have proposed models which parameterize the association in terms of marginal odds ratios. DALE (1986) de ned the bivariate global odds ratio model, based on a bivariate Plackett distribution (PLACKETT, 1965). MOLENBERGHS and LESAFFRE (1994) extended this model to multivariate ordinal outcomes. They generalize the bivariate Plackett distribution in order to establish the multivariate cell probabilities. Their method involves solving polynomials of high degree and computing their derivatives. LANG and AGRESTI (1994) exploit the equivalence between direct modelling and imposing restrictions on the multinomial probabilities, using undetermined Lagrange multipliers. Alternatively, the cell probabilities can be tted using a Newton iteration scheme, as suggested by GLONEK and MCCULLAGH (1995). However, even though a variety of exible models exist, maximum likelihood can be unattractive due to excessive computational requirements, especially when high dimensional vectors of correlated data arise. As a consequence, alternative methods have been in demand. LIANG and ZEGER (1986) proposed so-called generalized estimating equations (GEE1) which require only the correct speci cation of the univariate marginal distributions provided one is willing to adopt `working' assumptions about the association structure. They estimate the parameters associated with the expected value of an individual's vector of binary responses and phrase the working assumptions about the association between pairs of outcomes in terms of marginal correlations. PRENTICE (1988) extended their results to allow joint estimation of probabilities and pairwise correlations. LIPSITZ, LAIRD and HARRINGTON (1991) modi ed the estimating equations of PRENTICE (1988) to allow modelling of the association through marginal odds ratios rather than marginal correlations. When adopting GEE1 one does not use information of the association structure to estimate the main effect parameters. As a result, it can be shown that GEE1 yields consistent main effect estimators, even when the association structure is misspeci ed. However, severe misspeci cation may seriously affect the ef ciency of the GEE1 estimators. In addition, GEE1 should be avoided when some scienti c interest is placed on the association parameters. A second order extension of these estimating equations (GEE2) that include the marginal pairwise association as well, has been studied by LIANG, ZEGER and QAQISH (1992). They note that GEE2 is nearly fully ef cient though bias may occur in the estimation of the main effect parameters when the association structure is misspeci ed. The results of applying the maximum likelihood and GEE2 method for the Bahadur model to the DEHP data are given in Table 6. They are not directly comparable with the parameters of the conditional model. The intercepts correspond to a low baseline malformation rate. The dose parameters show a signi cant dose trend in all cases. LE CESSIE and VAN HOUWELINGEN (1994) suggested approximating the true likelihood by means of a pseudo-likelihood function that is easier to evaluate and to

13 Clustered data in toxicity studies 331 Table 6. Parameter estimates for the Bahadur model Par. External Visceral Skeletal Collapsed Maximum likelihood Estimates (standard errors) â (0.39) 4.42 (0.33) 4.67 (0.39) 3.83 (0.27) â d 5.15 (0.56) 4.38 (0.49) 4.68 (0.56) 5.38 (0.47) â a 0.11 (0.03) 0.11 (0.02) 0.13 (0.03) 0.12 (0.03) GEE2 Estimates (standard error) â (0.37) 4.49 (0.36) 5.23 (0.40) 5.23 (0.40) â d 5.29 (0.55) 4.52 (0.59) 5.35 (0.60) 5.35 (0.60) â a 0.15 (0.05) 0.15 (0.06) 0.18 (0.02) 0.18 (0.02) maximize. They replace the likelihood contribution f (y i1,..., y ini ) by the product of all pairwise contributions f (y ij, y ik )(1< j, k < n i ). Grouping the outcomes for subject i into a vector Y i, the contribution of the ith cluster to the log pseudolikelihood is pl i ˆ P j, k ln f (y ij, y ik ) if it contains more than one observation and an ordinary logistic regression contribution otherwise. For binary data and taking the exchangeability assumption into account, the log pseudo-likelihood contribution pl i can be formulated as: pl i ˆ zi 2 ln ð i11 n i z i 2 ln (1 2ð i10 ð i11 ) z i (n i z i )ln(ð i10 ð i11 ) (4) where ð i11 denotes the bivariate probability of observing two successes and ð i10 is the marginal probability of observing one success. A non-equivalent speci cation of the pseudo-likelihood contribution for the ith cluster is pl i ˆ pl i =(n i 1). The factor 1=(n i 1) corrects for the fact that each response Y ij occurs n i 1 times in the ith contribution to the PL and it ensures that the PL reduces to full likelihood under independence. GEYS, MOLENBERGHS and LIPSITZ (1998) explore the connection between these pseudo-likelihoods and generalized estimating equations for marginally speci ed odds ratio models. They show under which conditions both PL approaches coincide and study the general differences. The relative merits of both methods in terms of computational ease and relative ef ciency are assessed. Table 7 shows the parameter estimates obtained by tting a marginal odds ratio model to the DEHP data (collapsed outcome only), using pseudo-likelihood as well as GEE methods. Table 7 shows that the parameter estimates, obtained by either the pseudolikelihood or generalized estimating equations approach, are comparable. Because main interest is focused on the dose effect, pl was used rather than pl. Dose effects and association parameters are always signi cant, except for the GEE1 association

14 332 G. Molenberghs and H. Geys Table 7. Pseudo-likelihood, GEE2 and GEE1 estimates (standard errors) for the marginal odds ratio model (collapsed outcome) Method â 0 â d â a PL 3.98 (0.30) 5.57 (0.61) 1.11 (0.27) GEE (0.25) 5.06 (0.51) 0.97 (0.23) GEE (0.31) 5.79 (0.62) 0.41 (0.34) estimates. The GEE1 standard error for â a is much larger than for its PL and GEE2 counterparts; the GEE2 standard error is the smallest. 3.3 Marginalized random-effects model In random-effects models, the intracluster correlation is assumed to arise from natural heterogeneity in the parameters across litters. SKELLAM (1948) assumes the random success probability P i in cluster i to follow a beta distribution with mean ð i and, given P i, the outcomes within the ith cluster follow a binomial distribution. This results in the beta-binomial model with marginal distribution B(ði (r 1 1) z i,(1 ð i )(r 1 1) (n i z i )) f (z i jð i, r) ˆ B(ð i (r 1 1), (1 ð i )(r 1 1)) ni z i where B(:, :) denotes the beta function. Note that the beta-binomial model and the Bahadur model have the same rst and second order moments and hence they both feature the intraclass correlation coef cient r as a measure of association. Table 8 gives parameter estimates for the beta-binomial model, applied to the DEHP study. Bahadur (ML and GEE2) and beta-binomial parameters have the same interpretation. The intercepts â 0 and dose effect â d parameters have similar numerical values but the situation is slightly different for â a. The beta-binomial estimate for â a is typically about double the corresponsing Bahadur maximum likelihood estimate. This is due to range restrictions on â a in the Bahadur model. AERTS, DECLERCK, and MOLENBERGHS (1997) and MOLENBERGHS, DECLERCK, and AERTS (1998) compared the conditional model, the Bahadur model, and the betabinomial model for parameter estimation, hypothesis testing, and safe dose determination. They concluded that the conditional model is computationally faster and more stable while the beta-binomial model has readily interpretable parameters. In both cases, the likelihood ratio test for no dose effect has satisfactory behaviour. The (5) Table 8. Maximum likelihood estimates (standard errors) for the beta-binomial model Par. External Visceral Skeletal Collapsed â (0.42) 4.38 (0.36) 4.88 (0.44) 3.83 (0.31) â d 5.20 (0.59) 4.42 (0.54) 4.92 (0.63) 5.59 (0.56) â a 0.21 (0.09) 0.22 (0.09) 0.27 (0.11) 0.32 (0.10)

15 Clustered data in toxicity studies 333 Bahadur model is hard to use, both from the computational view-point as well as due to parameter space restrictions (DECLERCK,AERTS, and MOLENBERGHS, 1998). 3.4 Cluster-speci c modelling Population-averaged models (PA) are commonly used in standard teratology studies with only cluster-level covariates. Nevertheless, the effects of individual-level exposures can also be estimated, but their interpretations are then based on the overall population. In contrast, in cluster-speci c models, explicit regression adjustments are made for cluster effects and hence all parameters are interpreted as within-cluster effects. Clearly, the choice between one or the other modelling approach depends primarily on the scienti c question that needs to be answered. If interest lies in overall effects of exposure on response, population-averaged models are most appropriate. In contrast, if interest lies in within-cluster comparisons, cluster-speci c approaches are most appropriate. Population-averaged models do not explicitly control for cluster effects and therefore within-litter differences may be confounded by within-litter variation due to unmeasured genetic and environmental factors (TEN HAVE and HARTZEL, 1995). Within the class of cluster-speci c models, one can study a mixed-effect logistic model as an alternative way of accounting for intra-litter heterogeneity as well as a conditional likelihood method. In the mixed-effect logistic procedure cluster effects are accommodated by assuming that they are realizations of a random variable and integrating over their distribution. With conditional likelihood, one conditions on the suf cient statistics for the cluster-speci c effects (TEN HAVE,LANDIS and WEAVER, 1995; CONAWAY, 1989). One should however bear in mind that it is not always appropriate to compare the results obtained with both approaches. NEUHAUS and KALBFLEISCH (1998) show that conditional likelihood methods estimate purely within-cluster covariate effects, whereas mixture model approaches estimate a weighted average of between- and within-cluster covariate effects. Therefore, in practice, mixed effect logistic models and conditional logistic models may estimate different types of effects and are uncomparable. Only in the case where within- and between-cluster covariate effects are the same, both approaches yield identical estimates with improved ef ciency of the mixed effects approach over the conditional logistic approach. Let us rst consider the mixed-effects logistic models, where the intercept terms b i are allowed to vary from cluster to cluster, according to a normal distribution: logit P(Y ik ˆ 1jb i, x ik ) ˆ x ik â b i : (6) In this formulation, x ik denotes the kth row of the design matrix X i. The regression parameters ( â) in this CS mixed-effects logistic model measure the change in the conditional logit of the probability of response with a unit increase in the corresponding covariates for individuals at the same random-effects level (e.g. within a cluster with only individual-level covariates). The association between littermates is induced by the random intercept. Because cluster sizes for developmental toxicology studies

16 334 G. Molenberghs and H. Geys are relatively small, more complex random-effect structures can seldom be addressed from a practical perspective. GEYS, MOLENBERGHS and WILLIAMS (2001a) used a direct maximum likelihood method using numerical integration, such as implemented in, for example, the MIXOR software package (HEDEKER and GIBBONS, 1993). Table 9 shows the parameter estimates, with p-values, for the mixed-effects logistic model (MIXLOG) and the compound symmetry model (CSYM), the latter belonging to the class of PA models. Notice the observed `shrinkage effect', which is in agreement with ndings of NEUHAUS and JEWELL (1993). One exception is formed by the OPT outcome, for which the correlation parameter was estimated negative. For all outcomes there is evidence of a signi cant effect of the cumulative exposure (dt) and a signi cant effect of duration of exposure at temperatures above normal body temperature (t ). Furthermore, the parameter estimate for t is negative, indicating that shorter durations of the same cumulative exposure cause more developmental damage than longer ones. Table 9 also shows the conditional logistic regression (CONDLOG) parameter estimates. All cluster level effects are conditioned out. Therefore we cannot obtain parameter estimates for the intercepts. Clearly, there is a large discrepancy between the MIXLOG and CONDLOG parameter estimates, especially for the OPT and OLF responses. NEUHAUS and KALBFLEISCH (1998) note that a covariate has both a between-cluster component, which may be summarized in terms of x i, the cluster mean, and a within-cluster component x ik x i. The CONDLOG approach estimates the pure within-cluster covariate effect of x ik x i. However, the MIXLOG approach estimates the effect of x ik. Therefore, the results of CONDLOG and MIXLOG are comparable only under the assumption of common between- and within-cluster covariate effects, in which case the MIXLOG approach is more ef cient. Otherwise, both procedures yield discrepant results and comparison of standard errors or statistical signi cance is not relevant. Table 9. Heatshock study: parameter estimates (standard errors; p-values) for the mixed effects logistic (MIXLOG), compound symmetry (CSYM) and conditional logistic (CONDLOG) models Outcome Par. Model MIXLOG CSYM CONDLOG MBN â (0.23;0.00) 1.82 (0.21;0.00) â t 4.23 (1.52;0.01) 3.97 (1.66;0.02) 4.64 (2.55;0.07) â dt 6.38 (1.60;0.00) 5.99 (1.69;0.00) 6.84 (2.63;0.01) OPT â (0.29;0.00) 2.47 (0.24;0.00) â t 3.68 (1.47;0.04) 3.73 (1.67;0.03) 1.46 (3.04;0.63) â dt 5.60 (1.36;0.00) 5.65 (1.66;0.00) 3.96 (3.01;0.19) OLF â (0.32;0.00) 1.56 (0.22;0.00) â t 5.70 (1.93;0.01) 4.71 (1.74;0.01) 3.40 (2.96;0.25) â dt 8.06 (1.95;0.00) 6.55 (1.77;0.00) 6.30 (3.04;0.04)

17 Clustered data in toxicity studies 335 For the heatshock studies the assumption of common between- and within-cluster covariate effects was satis ed for the MBN response. That explains the similarity in MIXLOG and CONDLOG parameter estimates for that response and the increased ef ciency of MIXLOG as opposed to the CONDLOG method. Where we found strong signi cant effects for c and h by the MIXLOG and PA approaches, we now observe a reduction in statistical signi cance of the CONDLOG estimates. This is in agreement with the results of NEUHAUS and LESPERANCE (1996), summarized in Section 3. The cumulative exposure and duration of exposure at `positive increases' of temperature are highly correlated (correlation coef cient ˆ 97%) and moreover the cluster sizes in the heatshock study are relatively small (mean cluster size is 5). In contrast, for OPT and OLF, the above assumption was not satis ed, explaining the large discrepancy between MIXLOG and CONDLOG estimates. A comparison of standard errors or statistical signi cance is thus not appropriate here, unless we t a mixed effects logistic model with separate parameters for the between- and withincluster covariate component. The within-cluster covariate effect estimates thus obtained for OPT are (s.e. ˆ 2:647) and 1:255 (s.e. ˆ 2:779) for cumulative exposure c and high temperature h respectively. Similarly, we found (s.e. ˆ 3:044) and 2:839 (s.e. ˆ 2:968) for the within-cluster covariate effects of c and h on OLF. Clearly, these estimates are again similar to the CONDLOG estimates, but not more ef cient. 4 Risk assessment Risk assessment can be de ned as (ROBERTS and ABERNATHY, 1996) `the use of available information to evaluate and estimate exposure to a substance and its consequent adverse health effects'. The ultimate goal in the risk assessment process is to determine a safe level of exposure. Traditionally, quantitative risk assessment in developmental toxicology has been based on the NOAEL, or No Observable Adverse Effect Level, which is the dose immediately below that deemed statistically or biologically signi cant when compared with controls. The NOAEL, however, has been criticized for its poor statistical properties (see for example, WILLIAMS and RYAN, 1996), so that attention has turned to more formal dose-response models. The standard approach requires the speci cation of an adverse event, along with r(d) representing the probability that this event occurs at dose level d. For developmental toxicity studies where offspring are clustered within litters, there are several ways to de ne the concept of an adverse effect. First, one can state that an adverse effect has occurred if a particular offspring is abnormal (fetus based). Alternatively, one might conclude that an adverse effect has occurred if at least one offspring from the litter is affected (litter based). Based on this probability, a common measure for the excess risk over background is de ned as r (d) ˆ r(d) r(0) (7) or as

18 336 G. Molenberghs and H. Geys r r(d) r(0) (d) ˆ, (8) 1 r(0) where de nition (8) puts greater weight on outcomes with large background risks. The benchmark dose (BMD q ) is then de ned as the dose satisfying r (d) ˆ q, where q corresponds to the pre-speci ed level of increased response and is typically speci ed as 0:01, 1, 5 or 10%. In practice, calculation of the BMD follows several steps. After choosing and tting an appropriate dose-response model, the excess risk function is solved for the dose, d, that yields r (d) ˆ q. Since the dose-response curve is estimated from data and has inherent variability, the BMD is itself only an estimate of the true dose that would result in this level of excess risk. The nal step therefore consists of acknowledging this sampling uncertainty for the model on which the BMD q is based, by replacing the BMD q by its lower con dence limit (WILLIAMS and RYAN, 1996). Several approaches have been proposed. The conventional approach was to use a Wald based method: BMDL d q ˆ BMD d q 1:645 q dvar( BMD d q ): However, it turned out that this approach suffers from severe drawbacks: it may yield negative lower limits, can yield unstable estimates, etc. (CRUMP and HOWE, 1983; KREWSKI and VAN RYZIN, 1981; CATALANO,RYAN,SCHARFSTEIN, 1994). Alternatively, an upper limit for the risk function can be computed, and thus the dose that corresponds to a q% increased response above background is determined from this upper limit curve by solving: ^r q (d) 1:645 dvar(^r (d)) ˆ q, where the variance of the estimated increased risk function ^r (d) is estimated as:! T dvar(^r (d)) r (d) and where dcov( ^â) is the estimated covariance matrix of ^â. The resulting dose level is referred to as the lower effective dose (LED q )(KIMMEL and GAYLOR, 1988). CRUMP and HOWE (1983) recommend using the asymptotic distribution of the likelihood ratio (if available). According to this method, an approximate 100(1 á)% lower limit for the BMD, denoted by BMD(1), corresponding to an excess risk of q is de ned as minfd(â): r(d; â) ˆ q over all â such that 2(l( ^â) l(â)) < 2 p (1 á)g, where l denotes the log-likelihood and p is the number of model parameters. A second approach, denoted BMD(2), is based on the pro le likelihood method (MORGAN 1992, Section 2.7.3). First, construct a pro le likelihood based con dence interval for the dose effect parameter â d. Secondly, transform this interval into an âˆ^â

19 Clustered data in toxicity studies 337 interval for d and check that the transformation is monotonic. AERTS, DECLERCK and MOLENBERGHS (1997) compare the different lower limits for the BMD and show that, in general, BMD(1) yields lower results than BMD(2). Furthermore, they note that for conditionally speci ed models, the transformation is not monotonic, and hence the BMD(2) should not be applied to such models. In Table 10 BMD(1) and BMD(2) are applied to the DEHP data. In general, VSD(1) yields lower results than VSD(2), and the values obtained with the conditional model are somewhat higher than for both other models. A variation on this theme, suggested by many authors (CHEN and KODELL, 1989; RYAN, 1992), rst determines a lower con dence limit, e.g. corresponding to an excess risk of 1%, and then linearly extrapolates it to a BMD. The main advantage quoted for this procedure is that the determination of a BMD is less model dependent. 5 Goodness-of- t for likelihood based models with clustered binary data In order to evaluate how effective models are in describing the outcome variable, we need to assess the quality of their t. LE CESSIE and VAN HOUWELINGEN (1995) considered a goodness-of- t test for generalized linear models with canonical link function and known dispersion parameter, based on the score test for extra variation in a random effects model. LIPSITZ, FITZMAURICE and MOLENBERGHS (1996) note that for the special case of a binary response, several methods for assessing the goodness-of- t of binary logistic regression models have been proposed. All these methods are based on the notion of partitioning the covariate space into groups or regions. TSIATIS (1980) proposed a goodness-of- t statistic for the logistic regression model for a given partition of the covariate space, but he did not provide a method for partitioning the covariate space into suitable regions. HOSMER and LEMESHOW (1989) proposed the partition of subjects into groups or regions on the basis of the percentiles of the predicted probabilities from the tted logistic regression model. To construct a goodness-of- t measure for clustered binary data, we adapted the methods proposed by HOSMER and LEMESHOW (1989) and TSIATIS (1980). Following Table 10. Effective doses and lower con dence limits for DEHP study. Entirely model based computation. All quantities shown should be divided by 10 4 Model Statistic External Visceral Skeletal Collapsed Bahadur ED VSD(1) VSD(2) BB ED VSD(1) VSD(2) Cond. ED VSD(1)

20 338 G. Molenberghs and H. Geys these authors, groups are constructed according to deciles of the predicted malformation probabilities in each temperature-duration combination. Given this partition, the goodness-of- t statistic is formulated by de ning G 1 group indicators (in our example, G ˆ 10): I g ik ˆ 1 if ^ð ik is in region g (g ˆ 1,..., G 1) 0 otherwise, where ^ð ik is the estimated malformation probability of the kth individual within the ith cluster, calculated from the model that takes into account the clustering between the individuals. For example, in the context of the heatshock studies, the following model could be considered: ð ik ln ˆ â 0 â t t ik 1 ð â dtdt ik XG 1 I g ik ã g: ik The association is modelled similarly as in the model for which the goodness-of- t is assessed. If the mean structure in the original model is correctly speci ed, then ã 1 ˆˆã G 1 ˆ 0. MOORE and SPRUILL (1975) note that, even though I g ik is based on random quantities ^ð ik, the partition can be treated asymptotically as if it were based on the true ð ik. To test the goodness-of- t of the model, one can use either a likelihood ratio, Wald or score statistic to test H 0 : ã 1 ˆˆã G 1 ˆ 0. For large samples, each of these statistics has approximately a 2 distribution with G 1 degrees of freedom, if the model under the null hypothesis is correctly speci ed. GEYS, MOLENBERGHS and WILLIAMS (2001a) suggest the use of the likelihood ratio statistic, since it is simple to calculate and is fairly powerful. For large samples, all estimated expected frequencies should typically be greater than 1 and at least 80% should be greater than 5. Otherwise, one can collapse some frequencies, reducing the number of groups G (LIPSITZ, FITZMAURICE and MOLENBERGHS, 1996). HOSMER and LEMESHOW (1989) noted that G ˆ 6 should be a minimum, since a test statistic calculated from fewer than six groups will usually have low power. Note that in the goodness-of- t assessment described above, correlation is essentially treated as a nuisance parameter and interest is focused on the relationship between the covariates and the probability of response. Recent work uncovered de ciencies of the goodnessof- t tests based on the ones proposed by HOSMER and LEMESHOW (HOSMER, HOSMER, LEMESHOW, LE CESSIE, 1997). Decisions on model t may depend more on choice of cutpoints than on lack-of- t and their test statistic may have relatively low power with small sample sizes. Developing improved goodness-of- t test statistics for likelihood based models for clustered binary data is a topic of further research. gˆ1 6 Joint modelling of continuous and discrete outcomes Developmental toxicity studies may seek to determine the effects of dose on fetal weight (continuous) and malformation incidence (binary) simultaneously, as both

T E C H N I C A L R E P O R T A HIERARCHICAL MODELING APPROACH FOR RISK ASSESSMENT IN DEVELOPMENTAL TOXICITY STUDIES

T E C H N I C A L R E P O R T 0464 A HIERARCHICAL MODELING APPROACH FOR RISK ASSESSMENT IN DEVELOPMENTAL TOXICITY STUDIES FAES, C., GEYS, H., AERTS, M. and G. MOLENBERGHS * I A P S T A T I S T I C S N