Mixture Cure Model with an Application to Interval Mapping of Quantitative Trait Loci

Mixture Cure Model with an Application to Interval Mapping of Quantitative Trait Loci Abstract. When censored time-to-event data are used to map quantitative trait loci (QTL), the existence of nonsusceptible subjects entails extra challenges. If the heterogeneous susceptibility is ignored or inappropriately handled, we may either fail to detect the responsible genetic factors or find spuriously significant locations. In this article, an interval mapping method based on parametric mixture cure models is proposed, which takes into consideration of nonsusceptible subjects. The proposed model can be used to detect the QTL that are responsible for differential susceptibility and/or time-to-event trait distribution. In particular, we propose a likelihood based testing procedure with genome-wide significance levels calculated using a resampling method. The performance of the proposed method and the importance of considering the heterogeneous susceptibility are demonstrated by simulation studies and an application to survival data from an experiment on mice infected with Listeria monocytogenes. Keywords: EM algorithm; Parametric proportional hazards model; QTL mapping; Time-to-event data. 1. Introduction Mapping genes that underlie complex traits is of great interest (Lander and Schork, 1994; Glazier et al., 2002). In standard interval mapping of quantitative traits loci (QTL), the trait distribution is often modeled as a mixture of two (or more) normal components corresponding to two (or more) different genotypes at the putative QTL (Lander and Botstein, 1989; Zeng, 1993). Time-to-event data as quantitative trait values (e.g. age-at-onset of cancer, timeto-recurrence of tumor) have been used to identify various disease genes (Claus et al., 1990; Carter et al., 1992; Miki et al., 1994; Boyartchuk et al., 2001; Symons et al., 2002; among others). To study time-to-event traits, however, classical survival models may be more natural than the mixture of normals. Indeed, the standard interval mapping methods for normally distributed and fully observed quantitative traits have been extended successfully to the time-to-event traits subject to random censoring 1

(e.g. Li and Thompson, 1997; Diao et al., 2004; Diao and Lin, 2005, among others). A further challenge that has not received adequate attention when studying time-to-event trait is the issue of latent heterogeneous susceptibilities. In a population consisting of susceptible and nonsusceptible individuals, all susceptible subjects would eventually experience the event of interest in the absence of censoring, while the nonsusceptible ones can be regarded as cured, i.e. not at risk of developing the particular event. For example, Boyartchuk et al. (2001) considered a data set consisting of the survival times of 116 female intercross mice after infection with Listeria monocytogenes. About 30% of the mice had survived longer than 240 hours to the end of the study and might be considered as cured or nonsusceptible from a biological point of view. In addition to good scientific evidence for the existence of nonsusceptible subpopulation, the data with heterogeneous susceptibility usually have heavy censoring at the end of the study. The corresponding Kaplan-Meier curve has a long non-zero tail or the histogram shows a spike in the right end (Broman, 2003). When analyzing such genetic data, failure to account for the latent heterogeneous susceptibility might result in significant power loss in detecting the responsible genetic factor and/or lead to spurious significant results (Farewell, 1982; Hodge and Elston, 1994; Hodge et al., 2001). Therefore, statistical methods that incorporate the mixed susceptibilities into the modeling and analysis are needed. For the aforementioned mouse survival data, Broman (2003) proposed a two-part model: a normal distribution for log survival times of observed failure mice and a point mass at the end of the study for observed survivors. This two-part model may be useful to detect QTL when there is one common administrative censoring time, as being the case of this particular data set. Under general random censorship, such two-part separation of subjects may not always be reasonable. Therefore, it is of great practical interest to develop general statistical methods applicable to QTL mapping for randomly censored time-to-event traits from a population of latent heterogeneous susceptibility. In this paper, we propose a parametric mixture cure model for QTL mapping when the primary trait is the randomly censored time-to-event data from a population of mixed susceptibility. In the context of survival analysis, cure models have been developed to handle heterogenous susceptibilities and are applicable under general 2

random censoring (Kuk and Chen, 1992; Sy and Taylor, 2000; Peng and Dear, 2000; Lu and Ying, 2004, among others). Cure models may also be viewed as a special case of the competing risks models with unobserved cured events (Fine, 1999). To adapt these cure models for mapping QTL, we need to overcome a few challenges. In particular, we must account for the missing covariates due to the fact that the genotypes of the putative QTL are unknown. Furthermore, we need to identify appropriate genome-wide critical values for the proposed test statistics at certain nominal levels. The interval mapping tests are usually carried out at multiple locations along the chromosomes and the test statistics are typically not independent. Thus obtaining appropriate genomewide critical values is crucial in the context of genome-wide QTL mapping. The paper is organized as follow. In Section 2, we introduce the parametric mixture cure model and propose an EM-based likelihood ratio test (LRT). When the log-normal model is used for event times of susceptible subjects, the proposed cure model generalizes the two-part model of Broman (2003) to allow for latent susceptible status under general random right censoring. When using the parametric proportional hazards model for the time-to-event trait of susceptible subjects, the proposed cure model extends the parametric proportional hazards (PPH) model of Diao et al. (2004) to deal with heterogeneous susceptibility. Since the proposed cure model characterizes the QTL effects on susceptibility and/or survival distribution of susceptible subjects, it can be used to test such effects separately or simultaneously. It can also be used to account for potential effects of other risk factors by incorporating covariates into the regression model in a natural way. The issue of genome-wide significance level is also addressed in Section 2. Recently, Diao et al. (2004), Zou et al. (2004), and Lin (2005) introduced an efficient resampling method for assessing the genome-wide significance level. The resampling method is computationally less intensive and applicable to many complex genetic models (Zou et al., 2004). Therefore, we adopt this resampling method to obtain the genome-wide thresholds for the proposed likelihood ratio tests at certain nominal levels. In Section 3, the performance of our proposed methods and the importance of considering the heterogeneous susceptibility are demonstrated by simulation studies and an application to survival times of intercrossed mice following infection with Listeria 3

monocytogenes (Boyartchuk et al., 2001; Broman, 2003). Some concluding remarks are given in Section 4. 2. Methods In this section, we first propose the general parametric mixture cure model. We then formulate the likelihood based genome-wide tests and discuss the determination of genome-wide thresholds. 2.1. Notation and Models Consider a sample of n individuals with mixed susceptibility. Let T i denote the potential time-to-event trait for individual i, i = 1,, n. In addition, T i < stands for the failure time of a susceptible subject and η i takes value 1 or 0, indicating whether the ith subject is susceptible or not. Thus T i has the following decomposition: T i = η i T i + (1 η i ), (1) where the multiplication of 0 and is defined to be 0. The observation on the trait value of the ith individual consists of two components: the observed event time Y i = min(t i, C i ) and the censoring indicator δ i = I(T i C i ), where C i denotes a random censoring time and is assumed to be noninformative (Kalbfleish and Prentice, 2002). Note that, δ i = 1 implies η i = 1, but η i is unobservable when δ i = 0. Thus, the susceptible statuses are uncertain for censored subjects. Suppose we have data on trait values and a set of genetic markers. Let M i denote the multiple marker genotype information of the ith subject. We consider a putative QTL with two alleles Q and q and denote its unknown QTL genotype by G i. Here G i is coded as a dummy variable for all possible combinations of genotypes. For example, in F 2 intercross design, G i can be recorded using a two dimensional vector with three possible values, (1, 0), (0, 1) and (0, 0), according to the genotypes QQ, Qq and qq, respectively. In addition, let Z i denote other observed covariates of interest, such as environmental exposures, which is assumed to be independent of G i. For a susceptible subject, i.e. η i = 1, its failure time T i f(t Z i, G i ; β, β g, θ), where the parameter β g is assumed to follow a parametric distribution 4 depicts the effects of the QTL on the

time-to-event trait distribution for susceptible subjects, β indicates the corresponding effects of covariates and θ is the inherent distribution parameter of the parametric distribution f. More specifically, a linear regression model for QTL and covariates effects leads to a simple model f(t Z i, G i ; β Z i + β g G i, θ). For the binary outcome of susceptible indicator η i, it is natural to consider a logistic regression model where Z i pr(η i = 1 G i, Z i ) = exp(γ Z i + γ g G i ) 1 + exp(γ Z i + γ g G i ), (2) = (1, Z i) so that γ contains an intercept term, and the regression parameter γ g represents the QTL effects on susceptibility. Note that, at any putative QTL except for markers, genotype information G i is unknown. A natural idea is to treat G i as missing data, which can be handled by an EM algorithm. The conditional probability of G i given the marker information M i at a specific location d is denoted by pr(g i M i ; d). Under the assumption of no crossover interference and no genotyping errors, pr(g i M i ; d) is determined by the two flanking markers and the position of the QTL in the interval. For many experimental cross studies, explicit formulas for pr(g i M i ; d) are available in several books and papers, for example in Lynch and Walsh (1998, P435, equation (15.2)). The complete data consist of independent copies of C = {Y, δ, η, G, M, Z}, while the observed data consist of n independent copies of O = {Y, δ, M, Z}. Let µ = (γ, γ g, β, β g, θ). The likelihood function of the complete data is constructed as follows: L C (µ; d) = n { exp(γ Zi + γ } gg i ) ηi { } 1 ηi 1 1 + exp(γ Z i=1 i + γ gg i ) 1 + exp(γ Zi + γ gg i ) n [ {f(y i Z i, G i ; β, β g, θ)} δ i {1 F (Y i Z i, G i ; β, β g, θ)} 1 δ i i=1 n pr(g i M i ; d). (3) i=1 where F (t Z i, G i ; β, β g, θ) = t 0 f(s Z i, G i ; β, β g, θ)ds denotes the cumulative distribution function of the survival time for susceptible subjects. It is straightforward to verify that, using the complete data likelihood (3), the observed data likelihood is a mixture of several components corresponding to different 5 ] ηi

genotypes and susceptibilities: L(µ, d) = n { K i=1 j=1 [ p i (j) {f(y i Z i, G i ; β, β g, θ)} δ i {1 F (Y i Z i, G i ; β, β g, θ)} 1 δ i exp(γ Zi + γ gg j ) 1 + exp(γ Zi + γ gg j ) + 1 δ ]} i 1 + exp(γ Zi +. (4) γ gg j ) where K denotes the number of possible genotypes of putative QTL, and {G j } denote the coded values of genotypes. For example, for F 2 intercross population, we may use G 1,2,3 = {(1, 0), (0, 1), (0, 0)}. Standard interval mapping methods (Lander and Botstein, 1989; Zeng, 1993) examine the existence of QTL through the chromosome in a specified distance, e.g. 1 or 2 centi-morgan (cm), using a likelihood ratio test (LRT). In all our numerical studies, we evaluate the LRT with the specified distance 1 cm. To construct such a profile of LRT over the regions of the chromosome, the maximum likelihood estimates (MLE) ˆµ under the alternative model and the restricted MLE µ under the null hypothesis need to be calculated at each given position d. 2.2. Hypotheses and LRT Under the proposed parametric mixture cure model, the QTL has two types of effects on the trait distribution: γ g is the long-term effect on susceptibility and β g is the shortterm effect on survival of the susceptible subjects. Therefore, the proposed mixture cure model can be used to test the following hypotheses: No overall QTL effects, H 0 : γ g = 0 and β g = 0 vs. H 1 : γ g 0 or β g 0; No QTL effects on susceptibility, H 0γ : γ g = 0 vs. H 1γ : γ g 0; No QTL effects on the survival of susceptible subjects, H 0β : β g = 0 vs. H 1β : β g 0. To test the above hypotheses, the LRT statistic LR(d) = 2 ln{l(ˆµ; d)/l( µ; d)} is calculated at each location d. Under the null hypothesis H 0, the MLE µ does not depend on the testing location d, so µ needs to be calculated only once for each data set. But ˆµ, LR(d) and µ under other null hypotheses do depend on the location d since pr(g i M i ; d) varies along d. We employ the EM algorithm (Dempster et al., 1977) to obtain the parameter estimates. In the EM algorithm, we need to calculate the conditional expectation of 6

l C (µ; d) = log L C (µ; d) in (3) with respect to the unobserved quantities {η i, G i } given the current estimated parameter values and the observed data O i = {Y i, δ i, M i, Z i }. For example, consider the parametric proportional hazards mixture cure model in which the hazard function for the survival time of susceptible subject is specified as λ(t G i, Z i ) = λ 0 (t; θ) exp(β Z i + β g G i ). (5) Denote the cumulative hazard function by Λ 0 (t; θ) = t 0 λ 0(s; θ)ds. Then substituting the density and distribution functions in (3) by their counterparts specified by the proportional hazards model (5), simple algebraic manipulation yields L C (µ; d) = n { exp(γ Zi + γ } gg i ) ηi { } 1 ηi 1 1 + exp(γ Z i=1 i + γ gg i ) 1 + exp(γ Zi + γ gg i ) n { λ0 (Y i ; θ) exp(β Z i + β gg i ) } δ i η i e η iλ 0 (Y i ; ) exp( Z i + g G i) i=1 n pr(g i M i ; d). i=1 Note that the conditional expectation of the complete data log-likelihood (6) can be written as a function of conditional expectations of {η i, G i, η i G i }. Thus, in each E-step of the EM iteration, it suffices to compute the conditional expectation of these quantities given the current parameter values and the observed data. (6) In the kth step, the corresponding conditional expectations of {η i, G i, η i G i }, denoted by {E(η i O i, µ (k) ), E(G i = G j O i, µ (k) ), E(η i G i = η i G j O i, µ (k) )}, can be derived explicitly. To simplify the notation, the superscript (k) of parameters are suppressed in the following formulas for these conditional moments. 1 δ i = 1 E(η i O i, µ (k) ) = D 1 K i0 j=1 e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j )p i (j) δ i = 0 D 1 E(G i = G j O i, µ (k) i1 ) = e g G j e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j )p i (j) δ i = 1 D 1 i0 [e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j ) + {1 π i (G j )}]p i (j) δ i = 0 D 1 E(η i G i = η i G j O i, µ (k) i1 ) = e g G j e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j )p i (j) δ i = 1 D 1 i0 e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j )p i (j) δ i = 0, 7

where p i (j) = pr(g i = G j M i ), π i (G j ) pr(η i = 1 G i = G j, Zi ) as defined in equation (2), and K D i1 = e g G j e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j )p i (j), j=1 K D i0 = {e Λ 0(Y i ; ) exp( Z i + g Gj) π i (G j ) + 1 π i (G j )}p i (j). j=1 In the M-step, we obtain the arguments that maximize the expected log-likelihood. Then the EM algorithm iterates until it converges. The LOD score is defined as log 10 {L(ˆµ; d)/l( µ; d)} = LR(d)/(2 ln10). Evaluation of the LOD score at each location yields a LOD profile over the chromosome. The location with the largest LOD score can be used as an estimate of the QTL location provided that this largest value exceeds the threshold of a certain significance level. 2.3. Genome-wide Threshold Assessing the genome-wide significance level is challenging when the QTL is searched over the whole genome because the tests are performed at multiple locations and the test statistics are not independent. The point-wise significance level based on the χ 2 approximation without the multiplicity correction is no longer appropriate. The Bonferroni correction becomes too conservative when the number of tests is large. Recently, Diao et al. (2004), Zou et al. (2004), and Lin (2005) proposed a novel numerical method for searching the genome-wide threshold using a resampling approach. The method is computationally feasible and is applicable to many genetic models. Zou et al. (2004) and Lin (2005) gave detailed discussions of the performance of this resampling method and comparisons with other competing methods (e.g. Rebai et al., 1995; Dupuis and Siegmund, 1999). We employ the resampling method to assess the genome-wide significance level. Using the well known asymptotic equivalence between the likelihood ratio test and the score test (Cox and Hinkley, 1974), the resampling approach computes the empirical threshold for LRT by generating a large number of randomly perturbed score test statistics. At each location d, the score test statistic is a sum of independent and identically distributed (i.i.d.) terms with mean zero, thus it is convenient to perturb 8

each term with an independent standard Gaussian random variable as discussed in Lin (2005). More specifically, let U(µ; d) = n U i (µ; d) =. i=1 n l i (µ; d)/ µ (7) i=1 denote the score function at a location d. To test different null hypotheses, the score functions corresponding to the parameters to be tested will be used to construct the test statistic. For example, to test H 0 : γ g = 0 and β g = 0, the corresponding score functions are U g g (µ; d) =. n i=1 l i(µ; d)/ γ g n i=1 l i(µ; d)/ β g. The score test statistic is defined as W (d) = 1 T g g nû ( µ; d) ˆV 1 (d)û ( µ; d) (8) g g where Û g g is a consistent estimator of U g g under the null hypothesis and its formulation is derived in Appendix. The estimated covariance matrix ˆV (d) can be obtained by n 1 n i=1 Û g g,i (d)û T g g,i (d). To approximate the distribution of W (d), we generate a large number, say R = 10000, of W (d) = 1 T g g nû ( µ; d) ˆV 1 (d)û ( µ; d) (9) g g using randomly perturbed score functions Û g g (d) = n i=1 Û g g,i (d)x i, X i N(0, 1), while fixing the observed data (Y, δ, M, Z). From each set of (X 1,..., X n ), we can calculate sup d W (d). Then the threshold for the genome-wide significance level α is determined by the 100(1 α)th percentile of the R simulated values of sup d W (d). It is clear that the resampling method only involves generating standard normal random variables and some straightforward calculations. Remark: In the presence of missing marker genotypes, the conditional probability of the putative QTL genotype can be calculated based on the two closest observed flanking markers of the specific location. The proposed approach can easily deal with missing and/or dominant marker situations since only the conditional probabilities {p i (j)} need modifed formulas (Jiang and Zeng, 1997; Zou et al., 2004). We have considered both types of complications in the following real data application. 9

3. Numerical Results In this section we first demonstrate the proposed mixture cure models with an application to a real data set. Then we report numerical results of our simulations conducted to assess the performance of the proposed methods under various settings motivated by the real data example. 3.1. Real Data Example To illustrate our methods, we considered the data from the study on the survival of 116 female mice from an intercross experiment between the BALB/cByJ and C57BL/6ByJ strains after infection with Listeria monocytogenes (Boyartchuk et al., 2001; Broman, 2003). The mice were genotyped at 133 markers over 20 chromosomes, including 2 on the X chromosome. In this specific data, the only censoring occurred at the end of the study. From the biological point of view, the mice surviving more than 240 hours may have recovered from the infection. We employed our proposed mixture cure model to detect the QTL effects on several related chromosomes. More specifically, we consider the parametric proportional hazards mixture cure model (6) and use the two-parameter Weibull hazard function, λ 0 (t; θ) = θ 1 θ 2 t θ2 1. For the inter-crossed mice data, the genotypes G i are recorded using a two dimensional vector with three possible values, (1, 0), (0, 1) and (0, 0), according to the genotypes QQ, Qq and qq, respectively. The EM algorithm and LRT tests were conducted as described in the last section. The threshold at the 5% genome-wide significance level obtained from the resampling approach is 3.83. The corresponding estimates of regression parameters at the locations with the largest LOD scores for each chromosome are presented in Table 1. Recall that the regression parameters were estimated when the putative QTL location was fixed, for the purpose of constructing the likelihood ratio test statistics at each d. (INSERT TABLE 1 HERE) We also carried out LOD Score analysis using the PPH model on the same set of data to gain insights on the importance of taking into account possible heterogeneous susceptibility. The profiles of both the LOD(γ g, β g ) for the cure model and LOD(β g ) 10

for the PPH model for testing H 0 are shown in the top plot of Figure 1. These two methods show some discrepancies, especially in chromosomes 1 and 5, which will further discussed later. (INSERT FIGURE 1 HERE) To test more specific effects of the QTL, we proceed with testing hypotheses H 0 and H 0 using the proposed cure mixture model. The corresponding LOD(d) profiles are shown in the middle and bottom plots of Figure 1. It is obvious from the plots that, only in chromosome 13, all three peaks of LOD scores of the cure models exceed the genomewide thresholds. This indicates that the QTL in chromosome 13 has significant joint and separate effects on susceptibility and survival times of susceptible mice. Additional interesting patterns were found in chromosomes 1 and 5. In chromosome 1, there is no significance for susceptibility (H 0γ : γ g = 0, Figure 1: middle plot), but a significant effect on the survival distribution among the susceptible mice (H 0β : β g = 0, Figure 1: bottom plot). On the other hand, the QTL in chromosome 5 significantly affects only the susceptibility of mice (Figure 1: middle plot) but not on the survival distribution of the susceptible mice (Figure 1: bottom plot). Next we examine chromosomes 1 and 5 more closely to properly interpret the differences between our results and findings using the PPH method. To be specific, we first examine the survival distribution based on marker D1M355, the marker closest to the estimated QTL position (81 cm) on chromosome 1. Various survival distribution plots are presented in Figure 2. (INSERT FIGURE 2 HERE) The Kaplan-Meier plots of censored survival times grouped by marker D1M355 genotypes are displayed on the upper left; the Kaplan-Meier plots for the observed failure times only are displayed on the upper right. The lower two plots are the estimated survival distributions for the corresponding upper plots using our proposed mixture cure model and the resulting estimates of the parameters in Table 1. The upper left Kaplan-Meier plot shows that the three survival curves nearly approach to the same level at the end of the study, which indicates that cure proportions of three group are 11

close. This tail similarity obscures the vertical differences among survival curves, while such differences are much more obvious when considering only the survival distribution for susceptible subjects, as shown in the upper and lower right plots. Thus, the PPH model that essentially assumes no cured fraction, which looks for differences in the upper left plot, does not yield significance. But the proposed parametric mixture cure model, which considers both the upper left and right plots, yields significance on chromosome 1. These findings are consistent with the results reported in Broman (2003) and Diao et al. (2004). Additionally, based on our testing results, chromosome 5 seems to be a typical case where the QTL only affects susceptibility. To see this, Figure 3 displays four survival distribution plots on marker D5M357 presented in the same layout as described above for Figure 2. It is obvious from the two plots on the left that cure fractions of the three groups are very different. The estimated cure fractions are 0.64, 0.29 and 0.03 for the three genotypes AA, Aa, and aa respectively. But the two plots on the right show that the survival distributions of susceptible subjects are very similar among these three groups. Because the tails are well separated, the overall survival curves can still be distinguished from each other even without considering cure effects. Hence the QTL effect in chromosome 5 can be detected using both the cure model and the PPH model. (INSERT FIGURE 3 HERE) 3.1.1 Model Diagnostic The proportional hazards mixture cure model with two-parameter Weibull baseline hazard function is used in the analysis of listeria data. Based on Figures 2 and 3, the estimated survival curves (lower plots) seem very similar to the observed Kaplan- Meier curves (upper plots). Model diagnostic and goodness-of-fit analysis are critical in practical applications of parametric models. In this subsect, we provide some formal examination of the goodness of fit of the proposed parametric mixture cure model to the listeria data. The listeria data have no other covariates besides the genotype groups. We first assess the Weibull baseline hazard assumption for the survival distribution of the susceptible subjects by examining each marker genotype group separately. More specifically, 12

at a specific marker and for each genotype group, an overall survival distribution S(t) was estimated by the nonparametric Kaplan-Meier estimate which is denoted by Ŝ(t). Under the assumption of existing cure proportion, the overall survival distribution function can be written as S(t) = ps s (t) + (1 p), where p stands for the susceptible probability and S s (t) denotes the survival distribution for the susceptible subjects in this genotype group. Therefore, we can represent S s (t) as {S(t) (1 p)}/p. A consistent nonparametric estimate of S s (t) is thus Ŝs(t) = {Ŝ(t) Ŝ( )}/(1 Ŝ( )). To test for a shared Weibull hazard assumption for the survival distribution of susceptible subjects, we plot log{ log Ŝs(t)} against log t which should be linear under the assumption of a Weibull distribution. The plots are approximately linear, supporting a Weibull baseline assumption. The plots on markers D1M355 and D5M357 were presented in Figure 4. We note that there appears to be some deviations from the straight-line at the beginning of the study. One possible reason is that the animals were not immediately at risk of death after infection. (INSERT FIGURE 4 HERE) We next examine the overall fit of the proposed mixture cure model to the listeria data graphically and numerically. The population survival distribution S (t) was estimated nonparametrically using the Kaplan-Meier estimate Ŝ (t). Under the proposed parametric mixture cure model, the survival distribution S (t; µ) is estimated by fixing the parameters at the estimated values ˆµ (from Table 1). We plot Ŝ (t) against S (t; ˆµ) and the resulting P-P plot can be used to assess the closeness to the diagonal line. The results on markers D1M355 and D5M357 were shown in Figure 5. The two sample Kolmogorov-Smirnov tests were employed to evaluate the goodness-of-fit between Ŝ (t) and S (t; ˆµ) at these two markers and yielded p-values of 0.34 and 0.18, respectively. Therefore, these results and plots indicate that there are no serious violation for using the proposed proportional hazards mixture cure model with the Weibull baseline hazards function. INSERT FIGURE 5 HERE 3.2. Simulations 13

A series of simulation studies were conducted to assess the performance of the proposed cure model under practical settings. For illustration, we also present the results from the PPH model assuming no cured fraction. The survival times were generated from an F 2 population of mixed susceptibility that mimic the settings of the real data example presented above. No covariates other than genetic factors are included. To be specific, we consider the parametric proportional hazards mixture cure model with the following specifications, Pr(η i = 1 G i ) = exp(γ 0 + γ g G i ) 1 + exp(γ 0 + γ g G i ), and λ(t G i, η i = 1) = λ 0 (t; θ)e g G i, where G i is defined for F 2 population as before and λ 0 (t; θ) is the Weibull baseline hazard function. The independent censoring times C i are generated from the uniform distribution U(0, 10). Under this general random censoring, it is no longer appropriate to simply attribute all censored individuals to the category of cured. One chromosome of total length 100 cm is simulated, and the markers are generated from a Markov chain at evenly spaced positions with distances of 10 and 20 cm. Under the alternative hypotheses, the QTL is posited at 35 cm with different combinations of the long-term and short-term effects. Under each scenario, we perform 1000 runs with sample size n = 300. In each run, we resample 10000 times. To examine performances of the LOD Score tests using the cure model and the PPH model, the null hypotheses H 0 : γ g = 0, and β g = 0 are tested under various proportions of nonsusceptible subjects. The genome-wide threshold is obtained by the resampling method described in the proceeding section, and the estimated threshold values are presented in Table 2. (INSERT TABLE 2 HERE) The empirical type I errors and powers of the two methods using the cure model and the PPH model were summarized in Table 3. The simulation results indicate that the proposed cure model produces reasonable type I errors and good powers to detect the QTL effects under all types of alternative models that we simulated. On the other hand, the PPH method inflates the type I errors and may lose power to detect the QTL under some cure model alternatives. More specifically, under the null hypothesis of no 14

overall QTL effects (γ g = β g = 0), the cure fraction is 38% (γ 0 = 0.5), and it is clear that the PPH method yields inflated type I errors. This demonstrates that without taking into account the existence of non-susceptible individuals, the PPH method tends to find a spurious QTL. (INSERT TABLE 3 HERE) The powers are calculated under three different alternatives, which mimic the different situations presented in the real data. For simplicity, we directly compare the powers of our proposed cure method with the PPH method without adjustment for type I error inflation. However, since the PPH method produces inflated type I errors, adjustments are needed if practically applied in such context. For the first alternative (Table 2), we consider the case β g = (0.5, 0.5) and γ g = 0, which mimics the situation of chromosome 1 (Figure 2), i.e. the QTL only affects the survival distribution of susceptible subjects but not the susceptible probability. With 6 markers, at the 5% nominal level, the proposed cure method has 85% power to detect the QTL but the PPH method only attains less than 1/3 of the power of the cure method. Similar results are observed with 10-marker setting and other nominal levels. This agrees with the performance of the PPH model on chromosome 1 in the analysis of the real example. To mimic the situation in chromosome 5 (Figure 3), we consider the second alternative β g = 0 and γ g = (0.75, 0.75). Both the cure method and the PPH method retain good power to detect the QTL under this situation. The PPH method has a slightly higher power than the cure model method, but note that the PPH method is not adjusted for its inflated type I errors here. Such observations provide some evidence in support of the results of the real data analysis in chromosome 5, where the PPH method has larger LOD scores than the cure model method, but both methods can find the QTL. To mimic chromosome 13 in the real data, we consider the third alternative that the QTL has joint effects: β g = (0.5, 0.5) and γ g = (0.25, 0.25). It is evident from Table 3 that the proposed cure model method has almost twice the power of the PPH method to detect the presence of the QTL in this case. We conclude this simulation section with a few comments. During the search for genome-wide thresholds, for each scenario, we performed 1000 runs with sample size 15

n = 300. We thereby obtained a large number of {W (d), d} as defined in (7) directly using the data drawn from each null hypothesis, which enable us to evaluate the empirical distribution of the genome-wide LOD score thresholds of the given significance levels. For the LOD scores based on the mixture cure model, the estimated thresholds (Table 2) obtained using the resampling approach (using {W (d), d} in (8)) were close to the corresponding percentiles of the empirical thresholds from the {W (d), d}. This fact ensures that our method will maintain proper type I errors at the given sample sizes. In addition, we performed simulations to study the situation when the underlying population is actually homogeneous, i.e. no presence of non-susceptible subjects in a survival model. Under such situations, the LOD score based on the traditional survival model provides the valid test. Using the mixture cure model, the estimates become questionable because the long-term cure effects γ g are not identifiable. However, for testing various hypotheses for QTL using the LRT (LOD score), the cure model approach is still valid for homogeneous populations (Hodge and Elston, 1994; Liu and Shao, 2003) in terms of maintaining correct significant levels, though it often has a small to moderate loss of power (results not shown) compared to the LOD scores based on the correct homogeneous model. 4. Discussion In this article, we propose a mixture cure model for interval mapping of QTL using timeto-event trait from a population of mixed susceptibility. The method is applicable when the time-to-event trait is subject to random censoring. This method provides a natural tool for detecting QTL which affects susceptibility and/or the survival distribution of the susceptible population. Genome-wide significance levels for the LRT can be obtained using the resampling method. Goodness-of-fit of the parametric mixture cure model is also discussed. The proposed method can be generalized to composite QTL models along the lines of Zeng (1993, 1994) as discussed in Diao et al. (2004). More recently, Diao and Lin (2005) developed a semiparametric proportional hazards model for mapping QTL using time-to-event traits. Resampling method was also implemented through the efficient scores to obtain genome-wide threshold. Similar strategy may also be extended to 16

the semiparametric proportional hazards cure model, which seems to be a worthy further research project. The challenge is that semiparametric efficient scores for the regression parameters are much more complicated in the mixture cure model which requires further investigation. We have also demonstrated through simulation studies that, if the underlying population is really a mixture of susceptible and nonsusceptible subjects, the LOD score based on the proposed mixture cure models can be used to test for QTL effects. On the other hand, as indicated by our simulation results, the methods that ignore the latent heterogeneous susceptibility such as the simple PPH model may fail to detect the true QTL and may also produce spurious QTL in the presence of heterogeneous susceptibility. Thus the proposed mixture cure model are useful to map QTL based on time-toevent data whenever there exist biological reasons and/or long enough follow-up time in the study which indicate the existence of a latent non-susceptible sub-population. Appendix Appendix A. Score functions of observed likelihood in the case of proportional hazards mixture cure model for F 2 intercross family. In this appendix, we present the formulas of the score functions based on the observed likelihood for the parametric mixture cure model. The scores are required in the resampling stage and are defined as the partial derivatives of the observed likelihood in (4) with respect to the parameters, U(µ; d) = l(µ; d) β = ( l(µ; d) l(µ; d) l(µ; d) l(µ; d) l(µ; d) =,, µ γ γ g β,, β g l(µ; d) γ = l(µ; d) γ g = n i=1 1 n i=1 n i=1 D i j=1,2,3 1 D i j=1,2,3 1 D i j=1,2,3 [ p i (j) [ p i (j) [ p i (j) e Z i + g G j A i δ i (1 + e Zi + g G j ) 2 e Zi + g G j A i δ i (1 + e Zi + g G j ) 2 ) l(µ; d) θ ] Z i ; ] G j ; e Z i + g G j A i {δ i Λ 0 (Y i )e Z i + g G j } 1 + e Zi + g G j 17 where ] Z i ;

l(µ; d) β g = n i=1 1 D i j=1,2,3 [ p i (j) e Z i + g G j A i {δ i Λ 0 (Y i )e Z i + g G j } 1 + e Zi + g G j ] G j ; l(µ; d) θ = n i=1 1 D i j=1,2,3 [ log λ 0 (Y i ) p i (j) A i {δ i θ Λ 0(Y i ) θ where A i = (λ 0 (Y i )e Z i + g G j ) δ i exp( Λ 0 (Y i )e Z i + g G j ) and D i = j=1,2,3 [ p i (j) A i e Z i + g G j 1 + e Z i + g G j + 1 δ i 1 + e Z i + g G j e Z i + g G j } e ] Zi + g G j 1 + e Zi + g G j The above score functions are employed to approximate the genome-wide threshold using the resampling method of Diao et al. restricted MLE under the null hypothesis. Appendix B. The observed information matrix. ]. (2004) with µ = µ evaluated at the Let I(µ) denote the observed information matrix, which consists of the second derivatives of the observed log-likelihood with respect to the parameters. The direct calculation of the second derivatives can be complicated, hence we prefer to use Louis (1982) s formula to calculate it based on the complete likelihood, i.e. I(µ; d) = E µ (I C (µ; d) O) E µ (U C (µ; d)u T C(µ; d) O)+E µ (U C (µ; d) O)Eµ T (U C (µ; d) O) We divide the above observed information matrix into several blocks with respect to different parameters I γ (µ; d) I γγg I γβ I γβg I γθ I γg (µ; d) I γg β I γg β g I γg θ I(µ; d) = I β (µ; d) I ββg I βθ. I βg (µ; d) I βgθ I θ (µ; d) To test the null hypothesis H 0, the observed information matrix is evaluated at the restricted MLE µ under H 0. Then the likelihood ratio test statistics constructed at 18

each location d are asymptotically equivalent to the following estimated score test statistics at the same location: Û γg β g ( µ; d) = U γ g ( µ; d) I I γ gγ I γgβ I γ I γβ I γθ γgθ I U βg ( µ; d) I βg γ I βg β I βg β I βθ θ I θ 1 U γ ( µ; d) U β ( µ; d) U θ ( µ; d). Reference V. L. Boyartchuk, K. W. Broman, R. E. Mosher et al., Multigenic control of listeria monocytogenes susceptibility in mice, Nat. Genet., vol. 27 pp. 259-260, 2001. K. W. Broman, Mapping quantitative trait loci in the case of a spike in the phenotype distribution, Genetics, vol. 163 pp. 1169-1175, 2003. B. S. Carter, T. H. Beaty, G. D. Steinberg, B. Childs, and P. C. Walsh, Mendelian inheritance of familial prostate cancer, Proc. Natl. Acad. Sci. USA, vol. 89 pp. 3367-3371, 1992. E. B. Claus, N. J. Risch and W. D. Thompson, Using age of onset to distinguish between subforms of breast cancer, Annals of Human Genetics, vol. 54 pp. 169-177, 1990. D. R. Cox, Regression models and life tables (with discussion), J. R. Stat. Soc. B, vol. 34 pp. 187-220, 1972. D. R. Cox and D. V. Hinkley, Theoretical Statistics. Chapman & Hall, London, 1974. A. P. Dempster, N. M. Laird and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. B, vol. 39 pp. 1-38, 1977. G. Diao, D. Y. Lin and F. Zou, Mapping quantitative trait loci with censored observations, Genetics, vol. 168 pp. 1689-1698, 2004. Diao, G. and Lin, D. Y. (2005) Semiparametric methods for mapping quantitative trait loci with censored data. Biometrics, vol. 61, 789-798. 19

J. Dupuis, D. Siegmund, Statistical methods for mapping quantitative trait loci from a dense set of markers, Genetics, vol. 151 pp. 373-386, 1999. V. T. Farewell, The use of mixture models for the analysis of survival data with long-term survivors, Biometrics, vol. 38 pp. 1041-1046, 1982. J. P. Fine, Analysing competing risks data with transformation models, J. R. Stat. Soc. B,, vol. 61 pp. 817-830, 1999. A. M. Glazier, J. H. Nadeau and T. J. Aitman, Finding genes that underlie complex traits, Science, vol. 298 pp. 2345-2349, 2002. S. E. Hodge and R. C. Elston, Lods, Wrods, and Mods: the interpretation of Lod scores calculated under different models, Genet Epidemiol. vol. 11 pp. 329-342, 1994. S. E. Hodge, V. Vieland and D. A. Greenberg, HLODs remain powerful tools for detection of linkage in the presence of genetic heterogeneity, Am. J. Hum. Genet. vol. 70 pp. 556-558, 2001. C. J. Jiang and Z-B. Zeng, Mapping quantitative trait loci with dominant and missing markers in various crosses from two inbred lines, Genetica, vol. 101 pp. 47-58, 1997. J. D. Kalbfleish and R. L. Prentice, The Statistical Analysis of Failure Time Data, Ed. 2. Wiley, NJ, 2002. A. Y. C. Kuk and C. H. Chen, A mixture model combining logistic regression with proportional hazards regression, Biometrika, vol. 79 pp. 531-541, 1992. E. S. Lander and D. Botstein, Mapping mendelian factors underlying quantitative traits using RFLP linkage maps, Genetics, vol. 121 pp. 185-199, 1989. E. S. Lander and N. J. Schork, Genetic dissection of complex traits, Science, vol. 265 pp. 2037-2048, 1994. 20

H. Li and E. A. Thompson, Semiparametric estimation of major gene and random familial effects for age of onset, Biometrics, vol. 53 pp. 282-293 1997. D. Y. Lin, An efficient Monte Carlo approach to assessing statistical significance in genomic studies, Bioinformatics. vol. 21 pp. 781-787, 2005. X. Liu and Y. Shao, Asymptotics of likelihood ratio test under loss of identifiability, Ann. Statist. vol. 31 pp. 807-832, 2003. T. A. Louis, Finding the observed information matrix when using the EM algorithm, J. R. Stat. Soc. B, vol. 44 pp. 226-233, 1982. W. Lu and Z. Ying, On semiparametric transformation cure models, Biometrika, vol. 91 pp. 331-343, 2004. M. Lynch and B. Walsh, Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA, 1998. Y. Miki, J. Swensen, D. Shattuck-Eidens, P. A. Futreal, et al., A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1, Science, vol. 166 pp. 66-71, 1994. Y. Peng and K. B. G. Dear, A nonparametric mixture model for cure rate estimation, Biometrics, vol. 56 pp 237-243, 2000. A. Rebai, B. Goffinet and B. Mangin, Comparing power of different methods for QTL detection, Biometrics, vol. 51 pp. 87-99, 1995. Sy, J. P. and Taylor, J. M. G. (2000) Estimation in a Cox proportional hazards cure model. Biometrics, vol. 56, 227-236. R. C. Symons, M. J. Daly, J. Fridlyand, T. P. Speed, W. D. Cook, et al., Multiple genetic loci modify susceptibility to plasmacytoma-related morbidity in Eµ-v-abl transgenic mice, Proc. Natl. Acad. Sci. USA, vol. 99 pp. 11299-11304, 2002. Z. B. Zeng, Theoretical basis for separation of multiple linked gene effects in mapping QTL, Proc. Natl. Acad. Sci. USA, vol. 90 pp. 10972-10976, 1993. 21

Z. B. Zeng, Precision mapping of quantitative trait loci, Genetics, vol. 136 pp. 1457-1468, 1994. F. Zou, J. P. Fine, J. Hu and D. Y. Lin, An efficient resampling method for assessing genome-wide statistical significance in mapping quantitative trait loci, Genetics, vol. 168 pp. 2307-2316, 2004. 22

Table 1: Estimated QTL positions and effects from the Listeria data Chromosome Pos(cM) LOD γ 0 γ g1 γ g2 β g1 β g2 1 81 8.434 0.408 0.936 0.469 2.257 1.010 5 28 6.389 3.435-4.017-2.547-0.405-0.065 6 11 4.663-0.164 2.108 1.09-1.02-1.141 13 26 7.632 0.319 1.975-0.051 0.106-0.951 15 16 3.955 1.797-0.436-1.617-0.795-0.202 Table 2: LOD score thresholds at significance level α obtained from resampling (with the sample standard errors in the parentheses) Resampling Cure model PPH model No. markers α = 5% α = 1% α = 5% α = 1% H 0 : γ g1 = 0, γ g2 = 0 and β g1 = 0, β g2 = 0 6 2.93 (0.03) 3.81 (0.06) 2.13 (0.03) 2.90 (0.052) 11 3.07 (0.03) 3.95 (0.06) 2.26 (0.03) 3.04 (0.05) H 1a : γ g1 = 0, γ g2 = 0 and β g1 = 0.5, β g2 = 0.5 6 2.95 (0.03) 3.82 (0.06) 2.12 (0.03) 2.89 (0.05) 11 3.09 (0.03) 3.97 (0.05) 2.26 (0.03) 3.03 (0.05) H 2a : γ g1 = 0.75, γ g2 = 0.75 and β g1 = 0.0, β g2 = 0.0 6 2.93 (0.03) 3.80 (0.05) 2.13 (0.03) 2.90 (0.05) 11 3.07 (0.03) 3.95 (0.06) 2.27 (0.03) 3.04 (0.05) H 3a : γ g1 = 0.25, γ g2 0.25 and β g1 = 0.5, β g2 = 0.5 6 2.95 (0.03) 3.83 (0.06) 2.12 (0.03) 2.89 (0.05) 11 3.09 (0.03) 3.97 (0.06) 2.26 (0.03) 3.04 (0.05) 23

Table 3: Simulated Type I errors and powers ( in %) using the setup of Table 2 H 0 H 1a H 2a H 3a No. markers Model α = 5% 1% α = 5% 1% α = 5% 1% α = 5% 1% 6 Cure 6.5 1.0 85.0 67.7 82.2 62.8 91.7 79.8 PPH 18.5 6.7 22.4 10.4 95.6 90.2 53.5 35.2 11 Cure 6.3 1.4 87.4 72.5 86.9 71.3 92.9 84.3 PPH 16.3 4.7 26.2 10.7 95.1 89.2 56.3 37.5 24

LOD Score 0 2 4 6 8 10 Cure LOD PPH LOD chr1 chr5 chr6 chr13 chr15 81 28 11 26 16 Posititions (cm) LOD Score 0 1 2 3 4 5 6 Cure LOD of Susceptibility chr1 chr5 chr6 chr13 chr15 Posititions (cm) LOD Score 0 2 4 6 8 Cure LOD of Survival chr1 chr5 chr6 chr13 chr15 Posititions (cm) Figure 1: Top: Testing H 0 of no overall QTL effects, the LOD scores from two QTL mapping methods: Cure mixture model and PPH survival model on the Listeria data. The threshold (dotted horizontal line) is the 5% genome-wide significance level based on the resampling method using cure mixture model. Middle: Testing H 0γ of no QTL effects on susceptibility, the LOD score profile from cure mixture model and the threshold (dotted horizontal line) based on resampling method. Bottom: Testing H 0β of no QTL effects on survival distribution of the susceptible population, the LOD score profile from cure mixture model and the threshold based on resampling method. 25

Survival Probability 0.0 0.4 0.8 AA Aa aa 0 50 100 150 200 250 Survival in Susceptible Subjects 0.0 0.4 0.8 0 50 100 150 200 250 Estimated Overall Survival Probability 0.0 0.4 0.8 0 50 100 150 200 250 Hours Estimated Survival in Susceptible Subjects 0.0 0.4 0.8 0 50 100 150 200 250 Hours Figure 2: On Marker D1M355 in chromosome 1. Upper left is the Kaplan-Meier curves of survival times of all mice after infection of Listeria; Upper right is the Kaplan-Meier curves using only survival times of the mice with observed death; Lower left is the estimated overall survival distribution using mixture cure model; Lower right is the estimated survival distribution of susceptible population using mixture cure model. 26

Survival Probability 0.0 0.4 0.8 AA Aa aa 0 50 100 150 200 250 Survival in Susceptible Subjects 0.0 0.4 0.8 0 50 100 150 200 250 Estimated Overall Survival Probability 0.0 0.4 0.8 0 50 100 150 200 250 Hours Estimated Survival in Susceptible Subjects 0.0 0.4 0.8 0 50 100 150 200 250 Hours Figure 3: On Marker D5M357 in chromosome 5. Upper left is the Kaplan-Meier curves of survival times of all mice after infection of Listeria; Upper right is the Kaplan-Meier curves using only survival times of the mice with observed death; Lower left is the estimated overall survival distribution using mixture cure model; Lower right is the estimated survival distribution of susceptible population using mixture cure model. 27

AA group Aa group aa group log( log(s(t))) 3 2 1 0 1 2 log( log(s(t))) 3 2 1 0 1 2 log( log(s(t))) 3 2 1 0 1 2 4.2 4.6 5.0 5.4 log t 4.2 4.6 5.0 5.4 log t 4.2 4.6 5.0 5.4 log t AA group Aa group aa group log( log(s(t))) 2 1 0 1 2 log( log(s(t))) 3 2 1 0 1 2 log( log(s(t))) 3 2 1 0 1 2 4.2 4.6 5.0 5.4 log t 4.2 4.6 5.0 5.4 log t 4.2 4.6 5.0 5.4 log t Figure 4: For each genotype group at a marker, log( log Ŝs(t)) was plotted against log(t) and the straight line was fitted using the least square method. Top row: Marker D1M355; Bottom row: Marker D5M357. 28

Proposed parametric estimated survival distribution 0.0 0.2 0.4 0.6 0.8 1.0 Chr1:D1M355 pv=0.34 Chr5:D5M357 pv=0.18 0.0 0.2 0.4 0.6 0.8 1.0 Kaplan Meier estimate Figure 5: P-P plots of nonparametric Kaplan-Meier estimate of survival distribution against parametric estimates by proportional hazards mixture cure model with a Weibull baseline function. 29