QUANTITATIVE trait analysis plays an important observed. These assumptions are likely to be false when

Size: px

Start display at page:

Download "QUANTITATIVE trait analysis plays an important observed. These assumptions are likely to be false when"

Meghan Allen
5 years ago
Views:

1 Copyright 2004 y the Genetics Society of America DOI: /genetics Mapping Quantitative Trait Loci With Censored Oservations Guoqing Diao, D. Y. Lin 1 and Fei Zou Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina Manuscript received Octoer 31, 2003 Accepted for pulication June 2, 2004 ABSTRACT The existing statistical methods for mapping quantitative trait loci (QTL) assume that the phenotype follows a normal distriution and is fully oserved. These assumptions may not e satisfied when the phenotype pertains to the survival time or failure time, which has a skewed distriution and is usually suject to censoring due to random loss of follow-up or limited duration of the experiment. In this article, we propose an interval-mapping approach for censored failure time phenotypes. We formulate the effects of QTL on the failure time through parametric proportional hazards models and develop efficient likelihood-ased inference procedures. In addition, we show how to assess genome-wide statistical significance. The performance of the proposed methods is evaluated through extensive simulation studies. An application to a mouse cross is provided. QUANTITATIVE trait analysis plays an important oserved. These assumptions are likely to e false when role in the understanding of genetic variations the phenotype pertains to the survival time or failure in plants and animals. Mapping quantitative trait loci time. The Weiull distriution and other skewed distri- (QTL) can lead to improvements in economic traits, utions with long right tails are more appropriate than such as yield and quality in crop plants and milk producis the normal distriution. Furthermore, the failure time tion in cows. QTL mapping in animals can also provide often suject to censoring so that the trait value is valuale insights into the genetic etiologies of complex known only to e eyond the censoring time. An exam- human diseases (Hilert et al. 1991; Jaco et al. 1991; ple of failure time in plant experiments is the flowering Shepel et al. 1998). time, which may e censored due to limited duration Much of the modern statistical methodology for QTL of the experiment; see Ferreira et al. (1995). In animal mapping in experimental crosses originates from the studies, the failure times of interest include time to seminal work of Lander and Botstein (1989). The tumor and time to death (i.e., survival time), which Lander-Botstein interval-mapping method postulates may e suject to censoring ecause of limited study that QTL occur at a series of positions within a set of duration or death due to unrelated causes. One particu- adjacent marker intervals and that the trait value dein lar example is a mice cross presented y Broman (2003), pends on the QTL genotype through a linear regression which the trait of interest is time to death after a model. The distance etween each pair of genetic markstill acterial infection and in which 30% of the mice are ers is assumed known. The method steps through the alive at the end of the study period. Symons et al. genome in specified increments, say every 1 or 2 cm, (2002) presented another interesting study, in which and calculates the likelihood-ratio statistic for testing the phenotype is the time until terminal illness due to no QTL present at each position. The position with the tumor for E-v-al transgenic mice. largest value of the likelihood-ratio statistic is declared The incompleteness of the trait values presents major to e the candidate QTL location provided that the challenges in the application of the interval-mapping value exceeds a certain threshold level. It is widely recogwhich approach. Broman (2003) considered a cure model in nized that the interval-mapping method has higher power the mice that are alive at the end of the study and requires fewer progenies than the single-marker are regarded as cured and in which the survival times analysis (Lander and Botstein 1989; Haley and Knott among the deaths follow a log-normal distriution. This 1992; Zeng 1994). Doerge et al. (1997) descried in is a specialized model, which can deal only with the greater detail this method and various extensions. situations in which the potential censoring times are Most of the existing QTL-mapping methods require equal among all study sujects. Symons et al. (2002) that the phenotype e normally distriuted and fully utilized a variant of the expectation-maximization (EM) algorithm (Lipsitz and Irahim 1998) to map QTL with censored oservations. This method is computationally intensive and its properties have not een inves- 1 Corresponding author: Department of Biostatistics, University of North Carolina, 3101E McGavran-Greenerg Hall, CB 7420, Chapel tigated. Hill, NC lin@ios.unc.edu In this article, we provide a simple and rigorous exten- Genetics 168: (Novemer 2004)

2 1690 G. Diao, D. Lin and F. Zou sion of the interval-mapping approach of Lander and interference and no genotyping errors, these proailities Botstein (1989) to censored quantitative traits. Specifically, are determined y the genotypes of the two flanking we formulate the effects of QTL on the failure markers and the location of the QTL; see Equation 15.2 time through proportional hazards models (Kalfleish of Lynch and Walsh (1998). and Prentice 2002, Sect. 2.3). We then develop effi- We assume that censoring is noninformative (Kalfleish cient likelihood-ased methods for locating QTL and and Prentice 2002, p. 195) and that the censorcient estimating the effects of QTL. In addition, we show how ing time is independent of the failure time and QTL to assess genome-wide statistical significance y extending genotype. The likelihood for the vector of parameters the analytical results of Lander and Botstein (1989) ased on the complete data (Y i, i, G i, M i )(i1,..., and Dupuis and Siegmund (1999) and y developing n) is proportional to an accurate and efficient Monte Carlo procedure. We conduct extensive simulation studies to evaluate the n i (Yi g)e Y 0 i (t g)dt i,g I(G ig), (2) performance of the proposed methods. Finally, we provide an application to the mice data of Broman (2003). i1 g1,0,1 while the likelihood ased on the oserved data (Y i, i, M i )(i 1,...,n) is proportional to METHODS Interval mapping: In this section, we develop an interval-mapping method for potentially censored failure- time traits in an F 2 intercross population y modeling a single QTL. Expanding the results to other crosses is straightforward. Consider n progenies from an intercross etween two inred strains. Let T i denote the quantitative trait for the ith suject, which pertains to a failure time that can potentially e censored and thus incompletely oserved. Let C i e the censoring time for the ith suject. The oservation on the trait value of the ith suject consists of two components: Y i min(t i, C i ) and i I(T i C i ), where I() is the indicator function for event. The failure time T i is fully oserved only when it is uncensored, i.e., i 1. Suppose that we have data on a set of genetic markers with a known genetic map. Let M i denote the multipoint marker genotype data for the ith suject. We consider a putative QTL locus d in the genome with two possile alleles q and Q from the two inred parents and define G i 1, 0, or 1 according to whether the ith suject has genotype qq, Qq, orqq, respectively, at the QTL. We specify a proportional hazards model for the effects of the QTL genotype on the failure time such that, conditional on the QTL genotype G i, the hazard function of T i takes the form (t G i ) 0 (t)e 1 G i 2 (1 G i ), i 1,...,n, (1) n i1 g1,0,1 i (Yi g)e 0 Y i (t g)dt i,g. (3) To otain the maximum-likelihood estimator (MLE) of, we may maximize the oserved-data likelihood (3) directly. An alternative approach is to apply the EM algorithm (Dempster et al. 1977) to (2). The expected value of the complete-data log-likelihood given the o- served data can e shown to e, up to a constant, where p i,g () and n i1g1,0,1 p i,g () i ln (Y i g) Y i p i,g() v1,0,1 p i,v(), 0 (t g)dt, (4) g 1, 0, 1; i 1,...,n, p i,g() i,g exp i ( 1 g 2 (1 g )) Y i 0 (t g)dt. In the E-step, we evaluate p i,g () at the current estimate of. The M-step can proceed in a similar manner to the case of complete data since expression (4), with in p i,g () fixed, takes the same form as the complete- data log-likelihood. We egin the EM algorithm y as- signing an initial value to and iterate until conver- gence. The initial value for is set to 0 and that of to some value in the parameter space of. The resulting MLE is denoted y ˆ. See appendix a for further detail. We test the null hypothesis of no QTL effects, i.e., H 0 : 0, y the likelihood-ratio statistic where 1 and 2 pertain to the additive and dominant effects of the QTL, and 0 ( ) is an unknown aseline hazard function (Kalfleish and Prentice 2002, Sect. 2.3). In this article, we assume a parametric model for LR 2ln L(ˆ) L( ), 0. In particular, we consider a Weiull hazard function 0 (t) 1 2 t 21, 1 0, 2 0(Kalfleish and Pren- where L( ) is the oserved-data likelihood, and (0, tice 2002, p. 33). ) with eing the restricted MLE of under H 0. The Write (, ), where ( 1, 2 ) and ( 1, LOD score is LR/(2 ln 10). Under H 0, LR is asymptotically 2 2 ). At each locus, we may calculate i,g Pr(G i g M i ) -distriuted with 2 d.f. (appendix ). Note that (g 1, 0, 1; i 1,...,n), which are the conditional p i,g (), ˆ, L(ˆ), LR, and LOD all depend on the locus proailities of the QTL genotypes given the oserved d through the dependence of i,g on d. In the sequel, marker data. Under the assumptions of no crossover we include d in the expressions to emphasize their de-

3 QTL Mapping With Censored Data 1691 pendence on d if amiguity arises. Note also that and W(d) n 1 U T 1 ( ; d)vˆ (d)u ( ; d), L( ) do not depend on d. Thus, as in the case of stanwhere Vˆ (d) n 1 i1û n i (d)û T i (d) is a consistent estimadard interval mapping, the likelihood under H 0 is calcutor of the limiting covariance matrix V(d). It can e lated once while the likelihood under the alternative is shown that W(d) is asymptotically equivalent to LR(d) evaluated at each location in the genome to produce a (Cox and Hinkley 1974, Sect. 9.3). LOD curve for each chromosome. The position with In general, the limiting distriution of sup d W(d) is the largest value of the LOD score is declared to e the not analytically tractale. We use a resampling approach QTL location provided that the value exceeds a certain similar to that of Lin et al. (1993) to approximate the threshold level. We show how to determine the threshnull distriutions of n 1/2 U ( ; d) and sup d W(d). From old level in the following section. appendix, it is easy to see that n 1/2 U ( ; d) converges Thresholds: When searching the entire chromosome to a zero-mean Gaussian process with covariance funcor whole genome for QTL, one should select a threshold tion (d 1, d 2 ) that is the limit of n 1 i1ũ n i (; d 1 )Ũ T i (; level for the LOD score such that the proaility (under d 2 )at(d 1, d 2 ) and that (d 1, d 2 ) can e consistently the null hypothesis) that LOD or some other test statistic estimated y n 1 i1û n i (d 1 )Û T i (d 2 ). Define Û(d) i1 exceeds this level anywhere in the genome equals the n Û i (d)z i, where Z i (i 1,...,n) are independent desired false-positive rate. The pointwise significance standard normal random variales that are independent level ased on the 2 -approximation is inadequate eof the oserved data. Conditional on the oserved data, cause of the multiple tests while the Bonferroni correcn Û(d) is a Gaussian process with mean 0 and covarition is too conservative ecause of the dependence of ance function n 1 i1û n i (d 1 )Û T i (d 2 )at(d 1, d 2 ), which conthe test statistics at different points in the genome. In verges to (d 1, d 2 ). Thus, the conditional distriution appendix c, we show that the likelihood-ratio statistic of n 1/2 Û(d) given the oserved data converges to the LR(d) can e partitioned into the sum of the squares limiting distriution of n 1/2 U ( ; d). As a result, the distriof two asymptotically independent Ornstein-Uhleneck ution of W(d) can e approximated y that of processes. This result is analogous to those of Lander and Botstein (1989) and Dupuis and Siegmund (1999) Ŵ(d) n 1 Û T 1 (d)vˆ (d)û(d). and implies that the analytical approximations of thresholds for normal traits can e applied to the case of To approximate the distriution of sup d W(d), we gen- censored failure time oservations. These analytical renumer of times while holding the oserved data fixed; erate the normal random sample (Z 1,...,Z n ) a large sults assume that the markers are dense or equally spaced with no missing data and thus may not work well for each sample, we calculate Ŵ(d) and sup d Ŵ(d). The in practice. Using results of Davies (1977, 1987), Reai 100(1 )th percentile of the simulated sup d Ŵ(d) is et al. (1994) provided approximate thresholds in acklevel of. This resampling approach is computationally the threshold value for the genome-wide significance cross (BC) and F 2, which are applicale in the intermedimuch more efficient than the use of permutation (Churate map density case. The calculations are formidale, even for F 2, and do not accommodate missing marker chill and Doerge 1994) and other simulation methods data. ecause it involves only simulation of normal random To overcome the limitations of the analytical approxilated data sets. variales and does not entail repeated analysis of simumations, we propose a novel resampling approach to determining the thresholds for genome-wide statistical significance. This approach allows aritrary distriutions of the markers as well as aritrary test positions. It SIMULATIONS also accommodates missing marker data and dominant To investigate the operating characteristics of the proposed markers. methods in practical situations, we performed For evaluating the correlations of the test statistics extensive simulation studies. We generated the failure among different locations, it is more convenient to work times from the Weiull distriutions with aseline haz- with the score statistic than the likelihood-ratio statistic. ard function 0 (t) 1 2 t 21, where and 2 Let U ( ; d) e the score function (ased on the o- 2. We reparameterized 1 and 2 according to k served-data likelihood) for at location d, which can e k (k 1, 2) so as to ensure that the estimates of 1 e approximated y the sum of n independent zero- and 2 are positive. The censoring times were generated mean random vectors i1ũ n i ( 0 ; d), where 0 (0, 0 ) from the uniform (0, ) distriution, where was chosen is the true parameter value under H 0 ; see appendix. to yield 30% censored oservations. Assuming no Thus, n 1/2 U ( ; d) is asymptotically zero-mean normal crossover interference, we generated the marker data with covariance matrix V(d) that is the limit of n 1 from the Markov chain. The interval-mapping step size i1ũ n i ( 0 ; d)ũ T i ( 0 ; d). We replace the unknown quantities was set at 1 cm. in Ũ i ( 0 ; d) y their sample estimators to yield We considered a chromosome with a total length of Û i (d) shown in appendix. The score statistic for testing 100 cm. Genetic maps with different numers of equally H 0 : 0 against H 1 : 0 takes the form spaced markers were simulated. Under H 1, one QTL

4 1692 G. Diao, D. Lin and F. Zou TABLE 1 Summary statistics for the estimator of the additive QTL effect at the true QTL location H 0 : 1 0, 2 0 H 1 : , No. of markers Mean a SE SEE c CP d Mean a SE SEE c CP d a Mean is the mean of the parameter estimator. SE is the standard error of the parameter estimator. c SEE is the mean of the standard error estimator. d CP is the coverage proaility of the 95% confidence interval. located at 35 cm was simulated with and We generated 10,000 replicates of 300 oservations from an F 2 population. We evaluated the finite-sample of unevenly spaced markers, we placed m markers at the following locations, properties of the MLEs of the QTL effects at the true LOC j 50(j 1)/(m 1), j 1,...,[m/2], 100(j 1)/(m 1), otherwise, QTL location. The results for the estimator of the additive QTL effect are summarized in Tale 1. The pro- where LOC j is the jth marker location and [m/2] is the posed estimator appears to e virtually uniased. The largest integer that is less than or equal to m/2. In these standard error estimator reflects accurately the true vari- settings, the first half of the markers is denser than ation. The confidence intervals have proper coverage the second half of the markers. We generated 10,000 proailities. We otained similar results for the estima- replicates of 300 oservations from an F 2 population. tor of the dominant QTL effect (data not shown). We The dense-map and sparse-map approximations were also examined the performance of the proposed inter- otained from Equation C1 in appendix c. The threshval-mapping methods for locating the QTL and estimat- olds for the resampling method were ased on 10,000 ing the QTL effects at the location with the maximum normal samples. The results are summarized in Tales LOD. The results are summarized in Tale 2. There is 3 and 4. little ias for the estimator of the QTL location or the The thresholds ased on the resampling method are estimators of the QTL effects. The mapping is more close to the empirical values, whether the data are generprecise for denser marker maps. ated under H 0 or H 1 ; consequently, the LR tests ased We conducted additional simulation studies to evalu- on these thresholds have proper type I error and power. ate the performance of the analytical and resampling This is true of any genetic map, with or without missing methods for determining genome-wide statistical sig- marker genotypes and dominant markers. The densenificance. We generated oth equally and unequally map approximations are too conservative and thus respaced markers. We also simulated data with missing sult in power loss, while the sparse-map approximations marker genotypes and dominant markers, which are tend to e too lieral. We also assessed the approximamore comparale with real data. We considered one tions y Reai et al. (1994), which turn out to e conservative chromosome with a total length of 100 cm. For the cases when the genetic map is dense. For example, in TABLE 2 Sampling means of the estimators for the QTL location and for the QTL effects at the estimated QTL location in the simulation studies H 0 : 1 0, 2 0 H 1 : , QTL effects QTL effects No. of QTL QTL markers location (cm) 1 2 location (cm) (34.2) (0.141) (0.235) 36.6 (11.3) (0.107) (0.173) (32.8) (0.143) (0.233) 35.8 (10.9) (0.101) (0.162) (31.8) (0.148) (0.251) 35.7 (9.3) (0.098) (0.154) (31.4) (0.150) (0.255) 35.5 (8.9) (0.100) (0.152) Standard errors are shown in parentheses.

5 QTL Mapping With Censored Data 1693 TABLE 3 Analytical and resampling-ased thresholds at the targeted genome-wide significance level of Resampling c Analytical Empirical d H 0 H 1 Dense-map Sparse-map No. of Marker markers pattern a 5% 1% 5% 1% 5% 1% 5% 1% 5% 1% (0.16) (0.25) 9.81 (0.16) (0.25) (0.17) (0.25) (0.16) (0.25) (0.19) (0.27) (0.18) (0.27) (0.16) (0.25) 9.64 (0.17) (0.25) (0.17) (0.25) (0.17) (0.26) (0.19) (0.27) (0.18) (0.26) a Under pattern 1, markers are evenly spaced with no missing marker genotypes or dominant markers; under pattern 2, markers are unevenly spaced with 20% missing marker genotypes and 5% dominant markers. Percentiles of the test statistic ased on 10,000 simulated data sets. c Average thresholds from 10,000 simulated data sets. The values in parentheses are the standard errors of the thresholds. d QTL is located at 35 cm with and the case of 51 markers with 0.05, the sizes are 2.66 and 2.26% for marker patterns 1 and 2, respectively. APPLICATION In this approach, the censored oservations are treated as the true failure times and an average rank is assigned to those oservations. In the two-part approach, Broman considered a cure model in which the mice that are alive at the end of the study are regarded as cured while the survival times among the deaths follow a log-normal distriution. We applied the proposed methods to these data, as- suming a Weiull aseline hazard, and the results are shown in Tale 5 and Figures 1 and 2. The threshold for the LOD score at the 5% genome-wide significance level ased on the resampling approach is 3.43, which is close to 3.27, the threshold otained y permutation for the NP approach of Broman (2003). Our results are fairly consistent with those of Broman (2003). We detect almost the same QTL on chromosomes 5, 13, and 15 except that we detect an additional QTL on chromosome 6 rather than on chromosome 1. The QTL To illustrate our methods, we consider the mice data previously analyzed y Broman (2003). A total of 116 female mice from an intercross etween the BALB/cByJ and C57BL/6ByJ strains were genotyped at 133 markers, including 2 on the X chromosome. The phenotype of interest is the time to death following infection with Listeria monocytogenes. Approximately 30% of the survival times are censored. Broman (2003) proposed a nonparametric (NP) ap- proach and a two-part model. The NP approach is an extension of the Kruskal-Wallis statistic (Lehmann 1975, Sect. 5.2) y assigning a prior weight ( i,g ) to the rank of the ith oservation for each QTL genotype group g. TABLE 4 Sizes/powers (%) according to the analytical and resampling-ased thresholds Analytical Resampling Dense-map Sparse-map No. of Marker H 0 H 1 H 0 H 1 H 0 H 1 markers pattern a 5% 1% 5% 1% 5% 1% 5% 1% 5% 1% 5% 1% a Under pattern 1, markers are evenly spaced with no missing marker genotypes or dominant markers; under pattern 2, markers are unevenly spaced with 20% missing marker genotypes and 5% dominant markers. QTL is located at 35 cm with and

6 1694 G. Diao, D. Lin and F. Zou TABLE 5 Estimates of the QTL positions and QTL effects along with the maximum LOD scores for the data on survival time following infection with Listeria monocytogenes in 116 intercross mice Proposed method Nonparametric Two-part model Chromosome Pos (cm) LOD 1 2 Pos (cm) LOD Pos (cm) LOD Pos, position. on chromosome 5 appears to have a strong additive locations ecause there are two more free parameters in effect and the hazard ratio of the survival time with the two-part model than in the other two methods. This genotype QQ vs. qq is Genotypes qq and Qq at will decrease the power to detect QTL since a larger the QTL on chromosome 6 seem to have similar effects. threshold (i.e., 4.91) is required. To evaluate different The QTL on chromosome 13 appears to have oth methods on a common scale, we converted the LOD additive and dominant effects. The QTL on chromosome curves to the estimated pointwise P-values. Figure 2 dis- 15 appears to have a strong dominant effect. At plays the values of log 10 P for chromosomes 1, 5, 6, 13, most detected QTL locations, our LOD scores are larger and 15. Comparisons with the nonparametric method than those of Broman s NP approach. This suggests that and two-part model reveal that the proposed method our approach may e more efficient in detecting QTL. yields more significant results on the aove chromo- Figure 1 shows the LOD curves from the three methods: somes except chromosome 1. Incidentally, the resampling proposed method, nonparametric method, and two-part method is 100 times faster than the permutation model. The LOD scores from the two-part model are method in this application. larger than those of the other two methods at some QTL To get some ideas aout the adequacy of the Weiull Figure 1. The LOD scores from three QTL mapping methods for the data on survival time following infection with Listeria monocytogenes in 116 intercross mice. The threshold pertains to the 5% genome-wide significance level under the resampling method.

7 QTL Mapping With Censored Data 1695 Figure 2. Plot of the log 10 P-values for three QTL mapping methods for the data on survival time following infection with Listeria monocytogenes in 116 intercross mice. The P-values for the proposed method are ased on 100,000 normal samples. In the region etween 26 and 30 cm on chromosome 5, the P-values are 10 5 and thus are not displayed. The P-values for the NP and two-part models are ased on 11,000 permutation replicates. distriution in descriing the Listeria data, we fitted oth the semiparametric model and the Weiull model at marker D5M357, which is close to the peak of the LOD score on chromosome 5. The estimated QTL effects from the two models are very similar. It would e worthwhile to develop formal goodness-of-fit methods for assessing the adequacy of the parametric survival model at the true QTL location. EXTENSIONS In this section, we extend the single-qtl model to multiple QTL. The approach of interval mapping (IM) considers one putative QTL at a time. The QTL located elsewhere on the genome can have interfering effects, so that the estimators for the locations and effects of QTL may e iased and the power of detecting QTL may e compromised (Lander and Botstein 1989; Haley and Knott 1992; Zeng 1994). Boer et al. (2002) showed that the IM method fails to detect three inter- acting QTL with no main effects through simulation studies. A variety of approaches have een proposed for mapping multiple QTL. These methods can increase the power to detect QTL and reduce iases in the estimators of the QTL effects and locations. In this section, we consider mainly composite-interval mapping (CIM; Jansen 1993; Zeng 1993, 1994) and multiple-interval mapping (MIM; Kao et al. 1999) for censored traits. Composite-interval mapping: The idea of CIM is to comine IM with multiple regression analysis in mapping QTL y conditioning on markers outside a region of interest to account for the effects of other QTL. To extend the original CIM model to censored traits, we consider the following proportional hazards model, (t G i ) 0 (t)exp 1 G i 2 (1 G i ) ( 1k M ik 2k (1 M ik )) kj,j1, (5) where j and j 1 conform to two flanking markers ordering the putative QTL, and M ik is the indicator variale for the marker genotype, which takes values 1, 0, and 1 for genotypes aa, Aa, and AA, respectively. We may further enhance the model y considering the interaction effects etween the putative QTL and controlling markers. Replacing (t G i ) in (2) and (3) with (5), we otain the complete-data and oserved-data likelihood functions, respectively. As in the case of standard interval mapping, we can maximize the oserved-data likelihood directly or apply the EM algorithm to otain the MLEs. We can test H 0 : 0 at any position in the genome. The CIM approach requires that the sample size e large relative to the numer of markers included in the model. In practice, the sample size is generally not very large. Thus, Zeng (1994) suggested including in the model only those markers that are more or less evenly

8 1696 G. Diao, D. Lin and F. Zou spaced in the genome or those preidentified markers plete-data partial-likelihood score function given the that explain most of the genetic variation in the genome. oserved data. The solution to this conditional expecta- This suggestion also applies to our setting. tion is not a maximum partial-likelihood estimator, so Multiple-interval mapping: The MIM approach pro- that the method is not statistically efficient. The Monte posed y Kao et al. (1999) uses multiple marker intervals Carlo simulation is time-consuming. There is no formal simultaneously to fit multiple putative QTL directly in justification for the method of Lipsitz and Irahim the model. Consider K QTL, Q 1,...,Q K, located at d 1, (1998) or that of Symons et al. (2002). Our method is...,d K in the genome. There are 3 K possile QTL computationally much simpler than Monte Carlo simugenotypes. Some of the K QTL may exhiit epistasis. lation. It is ased on the maximum-likelihood estimator We formulate the effects of the K QTL on the failure and is thus statistically efficient. Furthermore, we have time through a proportional hazards model, such that, estalished the theoretical properties of our method conditional on the joint genotype G i (G 1i,...,G Ki ), and assessed its empirical performance through simulathe hazard function of T i takes the form tion studies. The method of Symons et al. (2002) has the advantage that the aseline hazard function is unspecified. (t G i ) 0 (t)exp K j x ij K jk (x T ij B jk x ik ), (6) j1 jk The proposed resampling approach to determining genome-wide significance is applicale to aritrary gewhere x ij (G ij,1 G ij ) T, jk is an indicator variale netic maps and accommodates missing marker data and for epistasis etween Q j and Q k, and j and B jk pertain dominant markers. For the CIM and MIM models, no to the main effects and epistatic effects, respectively. analytical thresholds are availale and the permutation The variale jk indicates, y the values 1 vs. 0, whether method is extremely time-consuming. The proposed or not Q j and Q k interact. Given the marker data M i for resampling approach can e applied to the CIM and the ith suject and assuming no crossover interference, MIM models since all the relevant formulas have een we may calculate i,g (g 1,...,3 K ), the conditional presented for an aritrary likelihood. For the MIM proailities of the 3 K possile genotypes of the K QTL. method, the resampling approach produces appro- The complete-data likelihood takes the same form as priate thresholds for testing each putative QTL given Equation 2 except that the summation of g is now over the others, with or without adjustment for the fact that (1,...,3 K ). To otain the MLEs and LOD scores, we multiple QTL are tested simultaneously. can again apply the EM algorithm. We have written a computer program in C to imple- Since the true numer and locations of the QTL are ment the proposed method. This program is availale unknown, model selection is a critical issue in the MIM from the authors upon request. approach. Kao et al. (1999) suggested stepwise and chunkwise selection with the likelihood-ratio test statistic as a selection criterion to identify QTL, to separate linked QTL, and to analyze epistasis etween QTL. Broman LITERATURE CITED and Speed (2002) developed a modified Bayesian Boer, M. P., C. J. Braak and R. C. Jansen, 2002 A penalized likeli- information criterion (Schwarz 1978) for model selecone-dimensional hood method for mapping epistatic quantitative trait loci with genome searches. Genetics 162: tion. When many QTL are included in the MIM model, Broman, K. W., 2003 Mapping quantitative trait loci in the case of the computation can e very intensive. Sen and Chur- a spike in the phenotype distriution. Genetics 163: chill (2001) reduced the computation urden of MIM Broman, K. W., and T. P. Speed, 2002 A model selection approach for the identification of quantitative trait loci in experimental y replacing the EM algorithm with a Monte Carlo algocrosses (with discussion). J. R. Stat. Soc. B 64: rithm. All these strategies can e applied to our setting. Churchill, G. A., and R. W. Doerge, 1994 Empirical threshold values for quantitative trait mapping. Genetics 138: Cox, D. R., and D. V. Hinkley, 1974 Theoretical Statistics. Chapman & DISCUSSION Hall, London. Davies, R. B., 1977 Hypothesis testing when a nuisance parameter We have descried our methods in the context of an is present only under the alternative. Biometrika 64: Davies, R. B., 1987 Hypothesis testing when a nuisance parameter F 2 population. All the proposed methods can e easily is present only under the alternative. Biometrika 74: generalized to other crosses such as BC, F 3, or even more Dempster, A. P., N. M. Laird and D. B. Ruin, 1977 Maximum complicated crosses such as comined crosses (Zou et likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39: al. 2001). We may also extend our methods to accommo- Doerge, R. W., Z-B. Zeng and B. S. Weir, 1997 Statistical issues in date covariates, such as lock factors in an agriculture the search for genes affecting quantitative traits in experimental field trial or cage numer in a mouse cross. populations. Stat. Sci. 12: Symons et al. (2002) formulated the effects of the Dupuis, J., and D. Siegmund, 1999 Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151: QTL on the failure time through the semiparametric proportional hazards model. They utilized a variant of Ferreira, M. E., J. Satagopan, B. S. Yandell, P. H. Williams and the EM algorithm developed y Lipsitz and Irahim T. C. Osorn, 1995 Mapping loci controlling vernalization re- quirement and flower time in Brassica napus. Theor. Appl. Genet. (1998), in which Monte Carlo simulation is used to 90: approximate the conditional expectation of the com- Haley, C. S., and S. A. Knott, 1992 A simple regression method

9 QTL Mapping With Censored Data 1697 for mapping quantitative trait loci in line crosses using flanking Lynch, M., and B. Walsh, 1998 Genetics and Analysis of Quantitative markers. Heredity 69: Traits. Sinauer, Sunderland, MA. Hilert, P., K. Lindpaintner, J. S. Beckmann, T. Serikawa, F. Sou- Reai, A., B. Gofinet and B. Mangin, 1994 Approximate thresholds rier et al., 1991 Chromosomal mapping of two genetic loci of interval mapping test for QTL detection. Genetics 138: 235 associated with lood-pressure regulation in hereditary hyperten sive rats. Nature 353: Schwarz, G., 1978 Estimating the dimension of a model. Ann. Stat. Jaco, H. J., K. Lindpaintner, S. E. Lincoln, K. Kusumi, R. K. Bunker 6: et al., 1991 Genetic mapping of a gene causing hypertension in Sen, S., and G. A. Churchill, 2001 A statistical framework of quanti- the stroke-prone spontaneously hypertensive rat. Cell 67: 213 tative trait mapping. Genetics 159: Shepel, L. A., H. Lan, J. D. Haag, G. M. Brasic, M. E. Gheen et al., Jansen, R. C., 1993 Interval mapping of multiple quantitative trait 1998 Genetic identification of multiple loci that control reast loci. Genetics 135: cancer susceptiility. Genetics 149: Kalfleish, J. D., and R. L. Prentice, 2002 The Statistical Analysis Siegmund, D., 1985 Sequential Analysis: Tests and Confidence Intervals. of Failure Time Data, Ed. 2. Wiley, Hooken, NJ. Springer-Verlag, New York. Kao, C. H., Z-B. Zeng and R. D. Teasdale, 1999 Multiple interval Symons, R. C., M. J. Daly, J. Fridlyand, T. P. Speed, W. D. Cook mapping for quantitative trait loci. Genetics 152: et al., 2002 Multiple genetic loci modify susceptiility to plas- Lander, E. S., and D. Botstein, 1989 Mapping Mendelian factors macytoma-related moridity in E-v-al transgenic mice. Proc. underlying quantitative traits using RFLP linkage maps. Genetics Natl. Acad. Sci. USA 99: : Zeng, Z-B., 1993 Theoretical asis of precision mapping of quantita- Lehmann, E. L., 1975 Nonparametrics: Statistical Methods Based on tive trait loci. Proc. Natl. Acad. Sci. USA 90: Ranks. Holden-Day, San Francisco. Zeng, Z-B., 1994 Precision mapping of quantitative traits loci. Genet- Lin, D. Y., L. J. Wei and Z. Ying, 1993 Checking the Cox model ics 136: with cumulative sums of martingale-ased residuals. Biometrika Zou, F., B. S. Yandell and J. P. Fine, 2001 Statistical issues in the 80: analysis of quantitative traits in comined crosses. Genetics 158: Lipsitz, S. R., and J. G. Irahim, 1998 Estimating equations with incomplete categorical covariates in the Cox model. Biometrics 54: Communicating editor: G. Churchill APPENDIX A: COMPUTATIONS OF MLE AND COVARIANCE MATRIX In this section, we present the formulas for the M-step of the EM algorithm under the Weiull model. In addition, we provide the oserved information matrix as well as a consistent estimator of the covariance matrix of MLE ˆ. In the M-step of the (k 1)th iteration, the first and second derivatives of the expected value of the log-likelihood (k) given the oserved data and current estimate ˆ are given y U (k1) () n p i,g (ˆ (k) )U i,g (), (A1) where I (k1) () n i1 g1,0,1 i1 g1,0,1 ig i,gg U i,g () i Yi i,g i (1 e 2 log Yi ) i,g e, 2 log p i,g (ˆ (k) )I i,g (), i,ggg T i,gg i,ge 2 log Yi g I i,g () i,g g T i,g i,g e 2 log Yi i,g e 2 log Yi g T i,g e 2 log Yi e 2 log Yi { i i,g (1 e 2 log Yi )}, g (g, 1 g ) T, and i,g Y i 0 (t g)dt, which is the cumulative hazard function conditional on the QTL genotype g. Then, we can apply the Newton-Raphson algorithm to update the current estimate with the new maximizer ˆ (k1). The oserved information matrix is given y I() n p i,g ()I i,g () n p i,g (){U i,g ()} 2 n p i,g ()U i,g () 2, i1 g1,0,1 i1 g1,0,1 i1 g1,0,1 where for a column vector a, a 2 denotes the matrix aa T. Thus, a consistent estimator of the covariance matrix of ˆ is given y the inverse of the oserved information matrix evaluated at ˆ, i.e., Cov(ˆ) I 1 (ˆ). (A2) APPENDIX B: ASYMPTOTIC PROPERTIES OF SCORE AND LIKELIHOOD-RATIO STATISTICS In this section, we show that LR(d) is asymptotically 2 2-distriuted under H 0 and provide the necessary ingredients for deriving the thresholds. Let U(; d) and I(; d) e the oserved-data score function and information matrix at location d with the following partitions to conform with the partition (, ) of,

10 1698 G. Diao, D. Lin and F. Zou U(; d) U (; d) U (; d), and I(; d) I (; d) I (; d) I (; d) I (; d). Under H 0, n 1 I( ) converges to (d). Denote (d) 1 (d) (d) (d) (d). It can e shown that LR(d) n 1 U T ( ; d) (d)u ( ; d) asymptotically (Cox and Hinkley 1974, Sect. 9.3). Through Taylor series expansions, n 1/2 U ( ; d) n 1/2 i1ũ n i ( 0 ; d) asymptotically, where Ũ i ( 0 ; d) U,i ( 0 ; d) (d) 1 (d)u,i ( 0 ; d) and 0 (0, 0 ). The replacements of the unknown quantities in Ũ i ( 0 ; d) with their sample estimators yield Û i (d) U,i ( ; d) I ( ; d)i 1 ( ; d)u,i ( ; d). Let z(d) ( (d)) 1/2 (n 1/2 U ( 0 ; d) (d) 1 (d)n 1/2 U ( 0 ; d)). Then z(d) converges to a normal distriution with mean 0 and an identity 2 2 covariance matrix. Thus, LR(d) is asymptotically distriuted as 2 2 under H 0. APPENDIX C: ANALYTICAL APPROXIMATIONS OF THRESHOLDS We show in this appendix that, for infinitely dense markers, the null distriution of LR(d) can e approximated y an Ornstein-Uhleneck process. Under H 0, (d) does not depend on d. Let d 1 and d 2 denote two points on the chromosome, and r e the recomination fraction corresponding to the genetic distance d 1 d 2. Under the assumption of no crossover interference, it is easy to show that the correlation etween g(d 1 ) and g(d 2 ) is given y Corr(g(d 1 ), g(d 2 )) 1 2r r provided that r is small. Since U,i (; d) ( i 0 (Y i ))g i (d) under H 0, we have Corr(z(d 1 ), z(d 2 )) Corr(g(d 1 ), g(d 2 )) 1 2r r. The aove result implies that z 1 (d) and z 2 (d), the first and second components of z(d), are approximately independent Ornstein-Uhleneck processes with means zero and variances 1 2r and 1 4r, respectively. By the arguments of Dupuis and Siegmund (1999), the tail distriution of sup d LR(d) under H 0 satisfies P(sup LR(d) a 2 ) 1 exp{(c 3va 2 L)exp(a 2 /2)}, (C1) d where is the average marker distance (in morgans), C is the numer of chromosomes, L is the total length of the genome (in morgans), and v v(a(6) 1/2 ), the definition of which can e found in Siegmund (1985). When 0, the aove formula reduces to that of Lander and Botstein (1989). These results imply that all the analytical thresholds for the normal trait can e applied to our case.

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations, Copyright 2004 by the Genetics Society of America DOI: 10.1534/genetics.104.031427 An Efficient Resampling Method for Assessing Genome-Wide Statistical Significance in Mapping Quantitative Trait Loci Fei