A Robust Identity-by-Descent Procedure Using Affected Sib Pairs: Multipoint Mapping for Complex Diseases

Size: px

Start display at page:

Download "A Robust Identity-by-Descent Procedure Using Affected Sib Pairs: Multipoint Mapping for Complex Diseases"

Shannon Willis
5 years ago
Views:

1 Original Paper Hum Hered 001;51:64 78 Received: May 1, 1999 Revision received: September 10, 1999 Accepted: October 6, 1999 A Robust Identity-by-Descent Procedure Using Affected Sib Pairs: Multipoint Mapping for Complex Diseases Kung-Yee Liang a Yen-Feng Chiu a Terri H. Beaty b Departments of a Biostatistics and b Epidemiology, School of Hygiene and Public Health, Johns Hopkins University, Baltimore, Md., USA Key Words Affected sib pairs W Generalized estimating equations W Identity by descent W Multipoint W Robustness W Sample size and power Abstract Multipoint linkage analysis is a powerful tool to localize susceptibility genes for complex diseases. However, the conventional lod score method relies critically on the correct specification of mode of inheritance for accurate estimation of gene position. On the other hand, allelesharing methods, as currently practiced, are designed to test the null hypothesis of no linkage rather than estimate the location of the susceptibility gene(s). In this paper, we propose an identity-by-descent (IBD)-based procedure to estimate the location of an unobserved susceptibility gene within a chromosomal region framed by multiple markers. Here we deal with the practical situation where some of the markers might not be fully informative. Rather the IBD statistic at an arbitrary within the region is imputed using the multipoint marker information. The method is robust in that no assumption about the genetic mechanism is required other than that the region contains no more than one susceptibility gene. In particular, this approach builds upon a simple representation for the expected IBD at any arbitrary locus within the region using data from affected sib pairs. With this representation, one can carry out a parametric inference procedure to locate an unobserved susceptibility gene. In addition, here we derive a sample size formula for the number of affected sib pairs needed to detect linkage with multiple markers. Throughout, the proposed method is illustrated through simulated data. We have implemented this method including exploratory and formal model-fitting procedures to locate susceptibility genes, plus sample size and power calculations in a program, GENEFINDER, which will be made available shortly. Introduction Copyright 000 S. Karger AG, Basel Likelihood-based linkage analysis and allele-sharing methods remain the two most commonly used tools to test whether markers with known chromosomal locations are linked to unobservable genes controlling susceptibility to a complex disease. Recent advances in molecular biology have generated dense maps of polymorphic markers which can be used individually or in multipoint linkage analysis (where multiple markers are considered simultaneously) to identify and map susceptibility gene for com- ABC Fax karger@karger.ch S. Karger AG, Basel Accessible online at: Dr. Kung-Yee Liang Department of Biostatistics, School of Hygiene and Public Health Johns Hopkins University Baltimore, MD 105 (USA) Fax

2 plex disorders. Logical connections between these two approaches were drawn [e.g., Lander and Kruglyak, 1995; Whittemore, 1996; Kruglyak et al., 1996] and consequently, a statistical package, GENEHUNTER, was made available for unified multipoint analyses of qualitative traits [Kruglyak et al., 1996]. When there is prior evidence that the region contains a susceptibility gene, it is intuitive that the additional information provided by simultaneously considering multiple genetic markers should yield greater power to pinpoint the unobserved disease locus. However, the parametric lod score approach requires the specification of the mode of inheritance, and it is well known that conclusions regarding localizing susceptibility genes drawn from this approach are sensitive to model misspecification. On the other hand, nonparametric allele-sharing methods were designed to test the null hypothesis that marker alleles shared by pairs (or larger sets) of relatives are independent of any putative disease gene; see Hauser and Boehnke [1998] for an excellent review on the methods. For individual markers, this does not provide specific information about map location or genetic distance. Even for multipoint analysis, these allele-sharing methods remain essentially a test of the null hypothesis of no linkage. The multipoint approach toward testing this null hypothesis does, however, create the temptation to conclude that the map location giving the highest evidence against H 0 represents the most likely site for the susceptibility locus. A commonly raised question has been whether or not the map location corresponding to the maximum nonparametric linkage (NPL) score from GENEHUNTER provides direct evidence for the location of the disease gene. The conventional wisdom is that the magnitude of this test statistic depends, among other factors, heavily on sample size. In the context of allele-sharing methods, the notion of sample size has not only to do with the number of pedigrees (or affected individuals), but also with the informativeness of the individual genetic marker. Thus, a more polymorphic marker may give rise to a larger NPL test statistic value, even though it is further away from the disease locus than closer less informative markers. In this paper, we propose a method to estimate the location of an unobserved susceptibility gene when there is preliminary evidence that the chromosomal region framed by multiple markers includes a disease gene. The method is based upon the familiar identity-by-descent (IBD) statistic and hence it avoids the need to specify the model of inheritance as does the lod score method. Furthermore, the proposed method focuses on estimating the location of the disease gene rather than on testing the null hypothesis. As such, we capitalize upon the extra information provided by multiple markers compared to a single marker. The paper has the following organization. First, we study the robustness of these IBD statistics for multipoint analysis. Here, a simple representation relating the expected IBD statistic from a single marker to its distance from the disease locus is derived. The robustness reflects the fact that this expression is the same regardless of the true mode of inheritance. On the other hand, by closely examining the key coefficient in this expression, one obtains insight as to why information for locating an unobserved disease gene is reduced in the presence of oligogenic inheritance and/or genetic heterogeneity. Second, motivated by the simple expression noted above, we propose a sequence of IBDbased statistics to approximate the expected IBD when all markers are not fully informative. This approach, which is exploratory in nature, provides a way to identify, by inspection, the interval formed by the flanking markers in the chromosomal region. Third, a more formal inferential procedure is introduced to estimate the location of the disease gene. In so doing, we draw upon the analogy of this approach to longitudinal data analysis. Fourth, based on our proposed method, we outline how one may compute the sample size needed for multipoint linkage analysis. Here the sample size refers to the number of independent pairs of affected sibs needed to achieve the prespecified statistical power. Robustness of IBD Statistics Consider a chromosomal region R of length T centimorgans which contains no more than one unobserved susceptibility gene at some location Ù. For simplicity, we assume for now the affected sib pair design has been adopted in which M markers at loci 0 ^ t 1!...! t M ^ T were genotyped for each individual; see figure 1. Extension to multiple affected relatives (other than full sibs) will be discussed later. Define S(t) as the number of alleles (0, 1 or ) shared IBD for an affected sib pair at any arbitrary locus t, 0 ^ t ^ T. The following proposition is crucial for the subsequent development: Proposition 1. Under the conventional assumptions of random mating, linkage equilibrium and generalized single ascertainment [Hodge and Vieland, 1996]. The expected number of alleles shared IBD S(t) has the form Ì(t) = E(S(t)A ) = 1 + ( t,ù 1)(E(S(Ù)A ) 1), (1) Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

3 Fig. 1. Hypothetical locations of M observed markers and unobserved susceptibility gene in a chromosomal region of T cm. where denotes the event that both siblings are affected (a sampling criterion) and the map distance between t and the location of the true susceptibility gene Ù is t,ù = ( t,ù ) = t,ù + (1 t,ù ), () with t,ù being the recombination fraction between marker t and the unobserved disease gene at Ù. The proof of Proposition 1 is given in Appendix 1. Obviously, when the region R is unlinked to the postulated disease gene, t,ù = 1/, in which case Ì(t) = 1 as would be expected. Note that the expression in (1), i.e. E(S(t)A ) being linear in t,ù, holds regardless of the mode of inheritance for the disease. Furthermore, Remark 1. The expected value of S(t) is strictly decreasing in At ÙA, the genetic distance between loci t and Ù, and attains its maximum value in E(S(Ù)A ) at t = Ù. An important statistical implication of this observation is that if S(t) is available for all t D [0, T], then one can examine the plot of S(t) against t. The value tˆ whose S value reaches the peak of the plot would provide a consistent estimate of Ù, the location of the disease locus [e.g. Huber, 1967]. Here the phrase consistency corresponds to the ideal situation where the number of available affected sib pairs is sufficiently large. The coefficient in (1), that is C = E(S(Ù)A ) 1 (3) does depend on the underlying genetic mechanism. In particular Model 1 (Single Locus). For a single locus model in which the disease gene resides within the region R, one has C SL = (Ï M 1) /(4Ï S ), (4) where Ï M and Ï S are the risk ratio for the MZ twin and a sibling of an affected individual, respectively [Risch, 1990a]; see also Suarez et al. [1978] for a different expression. Model (Single Locus with Heterogeneity). In the presence of heterogeneity as characterized by the admixture model by Smith [1963], one has C SLH = C SL, (5) where is the proportion of the linked families, i.e. 0!! 1. Model 3 (Two-Locus Additive Model). When there are two unlinked susceptibility loci involved acting additively, formulas (30) and (31) of Risch [1990b] give C TLA = K 1 K (Ï 1M 1) 4Ï S, (6) where K is the population prevalence, K 1 is the prevalence summand for the first locus and Ï 1M the risk ratio for an MZ twin attributed to locus 1 (the locus linked to the chromosomal region under consideration). This twolocus model can be thought of as a mapping exercise for one locus when a second unlinked locus exists. Model 4 (Two-Locus Multiplicative Model). In the situation that the two unlinked loci operated multiplicatively, formulas (7) and (8) of Risch [1990b] give C TLM = (Ï 1M 1)/(4Ï 1S ). (7) Intuitively, one s ability to estimate Ù, in light of Remark 1, depends on the magnitude of C which lies between zero and one. As shown in figure, the smaller the C, the flatter the plot of Ì(t), the expected IBD statistic, against t, which makes it more difficult to distinguish Ì(t) between Ù and adjacent t values. Here we have used Haldane s [1919] mapping function relating to the genetic distance between two loci, i.e. = (1 e 0.0 A t Ù A )/. (8) 66 Hum Hered 001;51:64 78 Liang/Chiu/Beaty

4 Fig.. Plot of Ì(t), the expected IBD statistics at locus t, versus t. The location for the susceptibility gene is at Ù = 35 cm. This observation is consistent with the conventional wisdom that heterogeneity would in general reduce the power to detect linkage; see (4) and (5). Furthermore, considering single-locus models as a special case of the two-locus model, one can show (see Appendix ) from (4) and (6) that 0! * = C TLA C SL = K 1 K Ï 1S Ï S!1. (9) Thus, the power for locating an unobserved susceptibility gene is compromised (as reflected by *) if the true mechanism is a two-locus additive model. The fact that C TLA can be re-expressed as * C SL also suggests that the reduction in power due to the presence of the second additive locus is equivalent to the impact due to the presence of heterogeneity of magnitude *. We note that similar observations have been made for the likelihood-based linkage analysis [e.g., Goldin, 199; Vieland et al., 199; Schork et al., 1993]. Exploratory Analysis for Locating Ù With the expression in (1), i.e. Ì(t) = 1 + ( t, Ù 1) WC and remark 1, a potential question to raise is how one can approximate Ì(t) from the data at hand. Here the investigators observe for each one of n sib pairs, Y i = (Y i (t 1 ),..., Y i (t j ),..., Y i (t M )) where Y i (t j ) represents the marker information at locus t j, j = 1,..., M for the ith pedigree. In addition, we denote i the affected status of the sib pair (both are affected in this case) along with the parents genotypes, if available, i = 1,..., n. If all individuals were typed for a marker at locus t which is highly polymorphic so IBD sharing can be counted directly, S i (t) is also directly counted. In this case, one can simply estimate Ì(t) by S(t) = n i =1 S i (t)/n, where S i (t) is the number of alleles shared IBD at locus t for the ith sib pair. However, only marker loci t 1! t!..., t M are available and some markers may be less than completely informative about IBD sharing. In this situation, one needs to consider all possible IBD configurations at locus t consistent with the observed marker information, Y i to make inference about S i (t). To this end, we propose the use of the following statistic to estimate Ì(t) by imputing S i (t) given Y i, namely S*(t) = where and n i = 1 S i (tay i ) = S i (tay i )/n, (10) l = 0 Pr(S i (t) = lay i ) = l Pr(S i (t) = lay i ) (11) M j = 1 l j = 0 {Pr(S i (t) = las i (t 1 ) = l 1,..., S i (t M ) = l M ) WPr(S i (t 1 ) = l 1,..., S i (t M ) = l M AY i )}. (1) Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

5 Fig. 3. Plot of Ì*(t), the expected value of S*(t) at locus t, versus t. The flanking markers are at 30 and 50 cm. a Ù = 35 cm. b Ù = 40 cm. c Ù = 45 cm. Fig. 4. Plot of Ì*(Ù), the expected value of S*(t ) at locus t. The flanking markers are at 30 and 40 cm; Ù = 35 cm. 68 Hum Hered 001;51:64 78 Liang/Chiu/Beaty

6 The computation of (1) through inheritance vectors has been discussed in detail in Whittemore [1996] and Kruglyak et al. [1996] and has been implemented in the GENEHUNTER program. We note that the first term in (1) involves the recombination fractions among markers at loci t,..., t M and t; whereas the second term involves the population allele frequencies of the markers at t j, j = 1,..., M, which are assumed to be known. In the special case that t = t j for a particular j and that this marker at t j is fully polymorphic, S i (tay i ) = S i (t) in which case S*(t) = S(t). It is straightforward to see that S*(t) is an unbiased estimate of Ì*(t) = E(S(tAY )A ). (13) The following proposition examines the connection between Ì*(t) and Ì(t). Proposition. Assuming, without loss of generality, that Ù is flanked by t l and t l +1 for a particular l, l = 1,..., M 1, then we have (i) when t ^ t l or t 6 t l +1, i.e. t is outside the interval formed by the flanking markers, Ì*(t) = Ì(t) = 1 + ( t, Ù 1) WC, (ii) when t l! t! Ù! t l +1, i.e. t is within the same interval as Ù and is to the left of Ù, then Ì*(t) = 1 + C( t, Ù 1) 1 4 t l,t(1 tl,t) Ù,tl +1 (1 Ù,tl +1 ) tl,t l +1 (1 tl,t l +1 ) and (iii) when t l! Ù! t! t l +1, then (14) Ì*(t) = 1 + C( t,ù 1) 1 4 t l,ù(1 tl,ù) t, tl +1 (1 t, tl +1 ) tl, t l +1 (1 tl, t l +1, ) (15) where is defined in (). A sketch of the proof is given in Appendix 3. This proposition shows that our proposed statistic, S*(t), 0 ^ t ^ T provides, irrespective of the true mode of inheritance, unbiased estimation of Ì(t) for loci outside of the flanking interval (t l,t l +1 ). For any arbitrary point t in the same interval as Ù,S*(t) has the tendency to underestimate Ì(t) as the last term in (14) and (15) is!1; see Appendix 3. Figure 3 presents plots of Ì*(t) against t for selected C values. Here we assumed Ù is flanked by two markers at 30 and 50 cm. For loci within the interval of (30, 50), rather than for Ì*(t) to climb up from both ends to the peak at t = Ù, instead a volcano is created. This volcano would not be symmetric unless Ù were in the middle of the interval, i.e. Ù = 40 cm in this case; see figure 3. Figure 4 demonstrates the benefit of having a denser marker map. Here Ù at 35 cm is flanked by markers at 30 and 40 cm, resulting in a smaller spread of the volcano rim. Armed with the characteristics of Ì*(t) stated above, the plot of S*(t) versus t would be useful in locating, by inspection, the flanking interval, formed by adjacent markers around a disease locus at Ù. To illustrate, we simulated fully informative marker data from a single chromosome of length 100 cm for samples of n = 50, 100 and 00 affected sib pairs. We assumed a single locus model with incomplete penetrances of 0.9 for genotype Dd and a phenocopy rate of 0.1 for genotype dd. No specification on the penetrance of DD is needed in this simple example as we further assumed that all siblings were offspring of a Dd! dd mating. This simple model was considered in MacLean et al. [1993], Hodge and Elston [1994] and Liang et al. [1996]. Figures 5a,b give plots of S*(t) versus t over a 100-cM region with M = 10 and 0 equally spaced markers, respectively. In all cases, the estimated curves, S*(t) resemble the theoretical ones, Ì*(t), and the true location for the susceptibility gene (Ù = 45 cm when M = 10 and 47.5 cm when M = 0) was demarcated well by the corresponding flanking markers whose S*(t) values are higher than all other locations. The plots also demonstrate the benefit of having denser markers and larger numbers of affected sib pairs which produce smoother curves. Modelling Approach for Locating Ù The approach suggested earlier is exploratory in nature as it provides a means to visually identify the map interval that may contain Ù. With multiple markers (M 6 ) typed and the fact that Ì(t) can be characterized parametrically by two parameters, Ù and C, a more formal statistical inference for Ù may be warranted. Remark. It is worth reiterating the interpretation of Ù and C before proceeding further. Here Ù is the location of an unobserved susceptibility gene. No assumption has been made as to whether more than one locus is involved in the disease process. The parameter C, as defined in (3), is one less than the expected number of alleles shared IBD at locus Ù for an affected sib pair. While estimable, as will be seen below, the magnitude of estimated C, regardless of its precision, would not necessarily reveal a single true genetic mechanism. For example, even if a one-locus model is correct, one cannot rule out the possibility of linkage heterogeneity as, the proportion of linked fami- Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

7 Fig. 5. Plots of simulated S*(t) versus t for n = 50, 100 and 00, respectively. a 10 equally spaced markers and Ù = 45 cm. b 0 equally spaced markers and Ù = 47.5 cm. lies, and C SL are totally confounded with each other; see (5). To this end, we propose the use of S i (t j AY i ), j = 1,..., M, i = 1,..., n as the basis for inference on = (Ù,C). The primary reason for utilizing the proposed statistics S i (tay i ) at loci t 1,..., t M only is because, according to Proposition, E(S i (t j AY i )A i ) = Ì(t j ). (16) This property, i.e. S i (t j AY i ) being unbiased for Ì(t j ), is crucial for subsequent development. On the other hand, in the absence of the knowledge as to which intervals formed by the t j s that may cover Ù, one cannot be sure, according to Proposition, if S i (tay i ) is unbiased for Ì(t) for an arbitrary locus t (where no marker data are available). We propose to estimate = (Ù,C) by solving the following estimating equations for : n FÌ( ) i =1 F ) Cov 1 (S i (Y i )A i )(S i (Y i ) Ì( )) I 0, (17) where S i (Y i ) = (S i (t 1 AY i ),..., S i (t M AY i ))) and Ì( ) = (Ì(t 1 ; ),..., Ì(t M ; ))), both of which are M! 1 vectors; the symbol ) denotes the transpose of a matrix of arbitrary dimension. Here we have stressed the dependence of Ì(t j ) on I (Ù,C) by reexpressing it as Ì(t j ; ). This approach was developed in the context of longitudinal data analysis, known as the generalized estimating equations (GEE) method [Liang and Zeger, 1986] where n represents the number of individuals and M is the number of repeated observations, S i (t j AY i ), j = 1,..., M in this case, at occasions t 1, t,..., t M. This method has the desired property that the derived estimates for Ù and C and their estimated precision are valid so long as (16) holds up, which is the case as suggested by Proposition. One minor modification is needed when employing this method (or any method required the differentiability 70 Hum Hered 001;51:64 78 Liang/Chiu/Beaty

8 Fig. 6. Plots of Ì(t) and approximated Ì(t) based on equation 18, respectively, versus t. Here C = 0.5 and Ù = 35 cm. Table 1. GEE estimates and their standard errors of Ù and C for the six simulated data sets in figure 5 Number of markers, M Number of pedigrees, n True location for Ù, cm Estimate B s.e. Ù C B B B B B B B B B B B B0.050 assumption for parameters) is that strictly speaking, Ì(t) in (1) is not differentiable with respect to Ù; see the Haldane function in (8). This can be fixed by replacing At ÙA in (8) by At ÙA if At ÙA 6 Â, 1 Â (t Ù) + 1 W Â if At ÙA! Â, (18) where Â is some prespecified positive number. Such modification is commonly used in the context of robust regression analysis as a means to reduce the impact of potential outliers [e.g. Huber, 1964]. Figure 6 contrasts Ì(t) versus the one in (18) with Â = 1, instead of At ÙA, is employed when computing ( t, Ù ). As expected, both curves are identical except for locus t which is within Â cm of Ù, the true location. The difference appears to be negligible and more importantly, for the new curve, it peaks at Ù as well. We have applied this GEE method to the six simulated data sets presented in figure 5 and results for estimates of Ù and C and the corresponding standard error (s.e.) estimates are given in table 1. In all 6 cases considered, the proposed method provides reliable estimates of Ù, the true (but unobserved) location of the susceptibility gene. The s.e. estimates of Ù strongly suggest the benefit, as expected, of having a greater sample size and denser markers. For instance, the variance estimate of Ù reduces from.76 [= (4.77) ] for n = 50 to 5.76 [= (.40) ] for n = 00 with M = 10 markers, a 74% reduction in uncertainty. Meanwhile, a 39.% reduction in uncertainty is achieved if the number of equally spaced markers increases from 10 to 0 where n = 50. As a side remark, we have also applied the same GEE method with Â values of 0.5 and 0.1. Results are virtually identical and therefore are not reported here. Finally, figure 7 gives plots of S*(t) and the fitted Ì(t) along with the estimated Ù and C values versus Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

9 Fig. 7. Plots of simulated S*(t) and fitted Ì(t) versus t. a n = 100, M = 10 and Ù = 45 cm. b n = 00, M = 0 and Ù = 47.5 cm. t for two simulated data sets. These plots present graphical evidence on the complementary information provided by the exploratory and confirmatory approaches. Power to Detect Linkage Another usage of formula (1) in the GEE approach is that sample sizes (in terms of the number of affected sib pairs n) needed to detect linkage for multipoint analysis can be readily computed. Under the null hypothesis of no linkage between the region and the susceptibility gene, i.e. H 0 : Ù =, the estimating function in (17) reduces to L = n i =1 1)Cov 1 (S i (Y i )A ; H 0 )(S i (Y i ) 1) = n i =1 L i, (19) where 1 = (1,... 1)), a M! 1 matrix. This statistic has the feature of equating S i (t j AY i ), which is observable, to its expected value (1) under H 0. This suggests L, which combines IBD information across markers and pedigrees, can serve as the basis for testing H 0. Specifically, one may test against H 0 by referring L* = L (0) n 1) Cov 1 (S i (Y i )A i ; H 0 )1 1/ i =1 to the standard normal distribution in a one-sided test. Straightforward derivation gives the following sample size formula with type I error Á and type II error ß: ( v n = 0 z Á + v 1 z ß ) (1) Cov 1 (S 1 (Y 1 )A ; H 0 )(Ì( ) 1)), (1) where z Á denotes the (1 Á)th quantile for the standard normal distribution and v 0 = var(l 1 ; H 0 ) = 1) Cov 1 (S 1 (Y 1 )A ; H 0 )1, v 1 = Var(L 1 ; H A ) = 1 Cov 1 (S 1 (Y 1 )A ; H 0 )Cov(S 1 (Y 1 )A ; H A )Cov 1 (S 1 (Y 1 )A ; H 0 ) 1. 7 Hum Hered 001;51:64 78 Liang/Chiu/Beaty

10 Explicit expressions for Cov(S 1 (Y 1 )A ; H 0 ) and Cov(S 1 (Y 1 ) ; H A ), both of which are M! M matrices are given in Appendix 4. For simplicity, we have assumed, as in Risch [1990b], complete polymorphism for all M markers so that S(t j AY) I S(t j ). Remark 3. The following three pieces of information are needed in order to employ the sample size formula shown in (1): (i) the number of markers, M, along with their relative locations, i.e. t 1, t,..., t M in the chromosomal region, (ii) the postulated location of the targeted susceptibility gene, i.e. Ù, which is within the region R formed by the M markers. (iii) the postulated genetic mechanism via E(S(Ù)A ) = 1 + g g 0 var(s(ù)a ) = g + g 0 (g g 0 ), where g l = Pr(S(Ù) = la ), for l = 0, 1,. Remark 4. The denominator of (1) can be reexpressed as M j = 1 (Ì(t j ) 1) = C M j = 1 ( tj, Ù 1), where C is defined in (3). In light of comments made after (3), it is our speculation that it is the magnitude of E(S(Ù)A ), rather than that of var(s(ù)a ), which appears in the numerator of (1), that will have a greater impact on the final sample size in any situation. In particular, Model 1 (single locus) E(S(Ù)A ) = 1 + (Ï M 1)/(4Ï S ) I E var(s(ù)a ) = (Ï M 1) + Ï M Ï S ÏS 4Ï S I V. Model (single locus with heterogeneity) E(S(Ù)A ) = 1 + (Ï M 1) /(4Ï S ), var(s(ù)a ) = (V + (1 )(E 1) 1/) + 1/. Model 3 (two-locus additive model) E(S(Ù)A ) = Ï S K 1 K (Ï1M 1), var(s(ù)a ) = (Ï 1M 1) 16Ï S K 1 K 4 + Ï 1M Ï 1S + 1 4Ï S K 1 K + 1. Model 4 (two-locus multiplicative model) E(s(Ù)A ) = 1 + (Ï 1M 1)/(4Ï 1S ), var(s(ù)a ) = (Ï 1M 1) + Ï 1M Ï 1S Ï 1S. 16Ï 1S For the most part, the above expressions are derived from formulas provided in Risch [1990b]. Figure 8 shows plots of sample sizes needed, in log scale, versus Ù, the true location of the susceptibility gene. Here the type I and II errors are taken as and 0., respectively, the former corresponding approximately to a lod of 3. A single-locus model with Ï S = Ï O [Risch, 1990b] was assumed so that E(S(Ù)A ) = 1 + (Ï S 1)/(Ï S ), Var(S(Ù)A ) = 1 (Ï S 1). 4Ï S We consider three cases: Case I: M = 1 with t 1 = 45 cm; Case II: M = with t 1 = 45 cm and t = 55 cm; Case III: M = 4 with t 1 = 35 cm, t = 45 cm, t 3 = 55 cm and t 4 = 65 cm. Several remarks are worth noting. First, whether having more markers would help to reduce the sample size necessary to detect linkage depends heavily on two accounts: (i) whether Ù is within the region spanned by the markers and (ii) whether one of the observed markers is adjacent to Ù. For example, when Ù = 46 cm and Ï S =, fewer number of affected sib pairs is required in Case I (where a single marker is at t = 45 cm) which requires n = 176 compared to Case II (n = 199) and Case III (n = 65). However, when Ù = 55 cm, one needs 36 pairs to detect linkage to a single marker at t = 45 cm ( " 0.09) as opposed to n = 196 and 6 for Cases II and III, respectively. Second, one important advantage of having multiple markers for mapping the susceptibility gene is that the sample size is remarkably stable over the range spanned by the markers so long as Ù is within that region; see the numbers quoted above for Cases II and III. This is to be contrasted with the single marker situation (Case I) in which the logarithm of sample size increases approximately linearly in At ÙA. Given that one is never certain about the exact location of Ù, the multiple marker approach provides a conservative yet more robust approach for detecting linkage. Third, while the advantage of having multiple (or more) markers for detecting linkage (hypothesis testing) is not overwhelming, its advantage for more precisely locating a susceptibility gene (mapping) is rather convincing, as demonstrated in the previous sections. Thus one should not discount the importance of having multiple and dense markers when estimating the location of Ù is just as critical, if not more so, than simply testing the hypothesis of linkage to the region. Fourth, the vertical scales of three figures match well with the intuition that fewer sample sizes are needed for larger Ï S [e.g. Risch, 1990b]. One striking result is the similarity of the Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

11 Fig. 8. Plots of sample size, in log scale, versus Ù. Case I: M = 1 and Ù 1 = 45 cm; Case II: M = and t 1 = 45 cm and t = 55 cm; Case III: M = 4 and t 1 = 35 cm, t = 45 cm, t 3 = 55 cm and t 4 = 65 cm. a Ï S =. b Ï S = 5. c Ï s = 10. Fig. 9. Plots of sample size ratio for Ï S 1 versus Ï S =. = Case I; WWWWWW = Case II; = Case III; = 4 (Ï S 1) /Ï S. 74 Hum Hered 001;51:64 78 Liang/Chiu/Beaty

12 patterns as shown in figures 8a c for different Ï S values. To further explore this observation, express n as a function of Ï s i.e. n(ï s ). Figure 9 shows plots of n(ï s = )/n(ï s ) versus Ï S, ^ Ï S ^ 10, for three cases with Ù = 47 cm; results are very similar for other choices of Ù values. It suggests that the ratios of sample sizes for Ï s = versus other Ï s values are very similar regardless of number of markers and the true Ù values. Furthermore, this ratio is well approximated by C (Ï s ) C (Ï s = ) = (Ï s 1) /(Ï s ) = 4(Ï S 1) 1/4 where we have expressed C = E(S(Ù) A ) 1, which appears in the denominator of the sample size formulas (1), as a function of Ï S for single-locus models. Thus the quantity C = E(S(Ù)A ) 1 is not only critical for determining minimum sample size, as reflected by (1), but provides a meaningful approximation to contrasting sample sizes with different Ï S values; see Remark 4. Discussion In this paper, we propose an IBD-based method to locate an unobserved susceptibility gene when data from multiple marker are available. The main novelty of the proposed work lies on the representation seen in (1) which has the following feature: regardless of the true genetic mechanism, the expected IBD for an affected sib pair at any arbitrary locus t is linear in the distance between it and the true susceptibility locus Ù,, so long as the region formed by the markers contains no more than one susceptibility gene. Based upon this representation, we developed both exploratory and a formal model-fitting procedure to locate a susceptibility gene within the chromosomal region of interest. Also presented is the sample size (in terms of the number of affected sib pairs) formula to detect linkage with multiple markers. Extension of the proposed method to the situation in which some pedigrees may possess three or more affected siblings is straightforward as one may replace S i (tay i ) in (10) by m i S il (tay i )/m i () l = 1 where m i is the number of affected sib pairs in the ith pedigree. For designs containing affected relative pairs other than siblings, an expression similar to (1) can be established as well. For example, it is easy to show for a grandparent-grandchild affected pair, denoted as *, one has Ï S, E(S(t) A *) = 1 + (1 t, Ù) E(S(Ù)A *) 1 = 1 + (1 t, Ù)WD. Thus the GEE method can be applied to estimate Ù, C and D in situations where both affected sib pairs and grandparent-grandchild pairs were sampled. Questions as to whether other scoring functions such as Sib all [Whittemore and Halpern, 1994] may be more efficient in detecting linkage than () in more complicated situations including multiple affected relatives are still under investigation. It is worth noting that in order to compute these imputed IBD statistics, one needs to assume the knowledge regarding the ordering and distances of the multiple markers and their allele frequencies. The proposed work should not be viewed as a competitor to the existing methods such as the lod score and NPL methods as implemented in the GENEHUNTER program. Rather, our method implicitly assumes that there is some preliminary evidence of linkage within the chromosomal region. Our main goal is to estimate the map position of a single unobserved susceptibility locus while providing a conventional confidence interval for its map position. Assumption about the evidence of linkage can be validated through testing the null hypothesis of no linkage by using either the methods noted above or, for example, the test statistics considered in Kruglyak et al. [1996], Kong and Cox [1997] and Teng and Siegmund [1998]. Thus, we view the proposed method as a supplement to the existing methods with the ultimate goal of locating a susceptibility gene in a robust fashion. Obviously, this approach is dependent on the mapping function used and we have only considered Haldane s mapping function here. Further work to explore the impact of this assumption and conditions, such as variable levels of independence across the region of interest, possible gender differences in recombination fractions and imprinting, is needed. Finally, the proposed work including the exploratory plots, the GEE method to estimate Ù and sample size and power calculations, has been implemented in a FOR- TRAN software, GENEFINDER. This program will be made available through the web site when it is properly documented and tested. Acknowledgments This work is supported by NIH grant GM The authors are grateful to Paul Rathouz and Steve Self for helpful discussions and to Chiung-Yu Huang for computing assistance. Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

13 Appendix 1: Proof of Proposition 1 Define f l = Pr( AS(Ù) = l) / Pr( ), l = 0, 1,, where is the event that both siblings are affected. Under the assumption that R contains no more than one susceptible gene, one has for k = 0, 1, Pr(S(t) = k A ) = l = 0 f l WPr(S(t) = k,s(ù) = l), where the joint distribution of S(t) and S(Ù) has been derived by Haseman and Elston [197] as a simple function of t, Ù. Consequently, we have Ì(t) = E(S(t)A ) = 1 f t, Ù + f 1 + f 0 (1 t, Ù ). But it is straightforward to show that Pr(S(Ù) = A ) = f W 1 4, Pr(S(Ù) = A ) = f 1 W 1, Pr(S(Ù) = 0A ) = f 1 0 W 4 Thus, Ì(t) can be expressed as Ì(t) = t, Ù Pr(S(Ù) = A ) + Pr(S(Ù) = 1A ) + (1 t, Ù ) Pr(S(Ù) = 0 A ) = ( t, Ù 1) Pr(S(Ù) = A ) + Pr(S(Ù) = 1A ) + (1 t, Ù ) = 1 + ( t, Ù 1)(E(S(Ù)A ) 1). Appendix 3: A Sketch of the Proof of Proposition According to (11), it is easy to see that Ì*(t) = E(S(tAY )A ) = 1 + E(Pr(S(t) = AY)A ) E(Pr(S(t) = 0AY)A ). For notational simplicity, we assume here that Pr(S(t) AY) follows the Markov chain of order 1 (MC1) assumption, i.e., the distribution of S(t) depends only on the information provided by the flanking markers. For locus t, let t j and t j + 1 denote the loci which cover t. Without loss of generality, we further assume that Y(t j ) = S(t j ) and Y(t j + 1 ) = S(t j + 1 ), i.e. markers at loci t j and t j + 1 are fully polymorphic. With these assumptions, Ì*(t) can be reexpressed as Ì*(t) = 1 + = 1 + W l = 0 a = 0 a = 0 b = 0 b = 0 [ Pr(S(t) = AS(t j ) = a, S(t j + 1 ) = b) Pr(S(t) = 0AS(t j ) = a, S(t j + 1 ) = b) WPr(S(t j ) = a, S(t j + 1) = b A )] [ Pr(S(Ù) = Aa,b ) Pr(S(Ù) = 0Aa,b ) f l Pr(S(t) = l, S(t j ) = a, S(t j + 1) = b ) ] = 1 + l = 0 a = 0 b = 0 f l Pr(S(t) = Aa, b) Pr(S(t) = 0 A,b) WPr(S(Ù) = l, S(t j ) = a,s(t j + 1 ) = b ) Appendix : The Inequality in (9) Using formulas (5) and (14) of Risch [1990a] repeatedly, we have (6) = K 1 (Ï 1M 1) K 4Ï S K 1 K (Ï1M 1)+(Ï 1O 1) + K K (Ï M 1)+(Ï O 1) +4 = K 1 K Ï 1M 1! K 1 K Ï 1M 1 K 1 K (4Ï1S 4) + 4 K 1 K! K 1 K Ï 1M 1 Ï = 1M 1 W 4Ï 1S 4Ï 1S = (4). Here we have adopted the notation that the single locus assumed in Model 1 corresponds to locus 1 in Model 3, the two-locus additive model. Consequently, 0! * = (6) (4) = K 1 Ï 1S K! 1. Ï S = 1 + l = 0 f l B l, (A1) here f l is defined in Appendix 1 as Pr( AS(Ù) = l)/pr( ) and we have used Pr(S(Ù) = Aa,b) to denote Pr(S(Ù) = AS(t j ) = a,s(t j + 1 ) = b ) for simplicity. We consider three exhaustive and exclusive situations: Situation I: Ù is outside of (t j,t j + 1 ). This is equivalent to (i) in Proposition. We consider only the case that Ù is to the right of (t j,t j + 1 ), i.e. t j ^ t ^ t j + 1! Ù as results apply to the other case as well. Applying the MC1 assumption repeatedly, we have B = a b = a b Pr(S(Ù) = AS(tj + 1) = b,s(t) =, S(tj ) = a) * Pr(S(t j + 1 ) = bas(t) =, S(t j ) = a ) * Pr(S(t) =, S(t j ) = a) Pr(S(t) = AS(t j + 1 ) = b,s(ù) = 0, S(t j ) = a) * Pr(S(t j + 1 ) = bas(t) = 0, S(t j ) = a ) * Pr(S(t) = 0, S(t j ) = a) Pr(S(Ù) =, S(t j + 1 ) = b,s(t) =, S(t j ) = a) Pr(S(Ù) =, S(t j + 1 ) = b,s(t) = 0, S(t j ) = a) 76 Hum Hered 001;51:64 78 Liang/Chiu/Beaty

14 = Pr(S(Ù) =, S(t) = ) Pr(S(Ù) =, S(t) = 0) = 1 4 t,ù 1 4 (1 t, Ù) = 1 4 ( j, Ù 1). Similarly, one can show that B 1 = Pr(S(Ù) = 1, S(t) = ) Pr(S(Ù) = 1, S(t) = 0) = 1 t, Ù (1 t, Ù ) 1 t, Ù (1 t, Ù ) = 0 and B 0 = Pr(S(Ù) = 0, S(t) = ) Pr(S(Ù) = 0, S(t) = 0) = 1 4 (1 t, Ù) 1 4 t,ù = 1 4 (1 t, Ù). Thus, one has To show that 0 ^ 1 F ^ 1, note that (1 ) = 1 ( 1) /4 and 0 ^ ( 1) ^ 1. Thus, Ù,tj + 1 (1 Ù,tj + 1 ) tj,t j + 1 (1 tj,t j + 1 ) = 1 ( Ù,t j + 1 1) 1 ( tj,t j + 1 1) 1 ( = Ù,tj + 1 1) 1 ( tj, Ù 1) ( Ù,tj + 1 1). Furthermore, 0 ^ (1 ) ^ 1/4. As a result, 0 ^ F! 1 and this implies that 0 ^ 1 F ^ 1. Situation III: t j ^ Ù ^ t ^ t j + 1 This situation corresponds to (iii) of Proposition. The proof is similar to that in situation II and is therefore omitted. Ì*(t) = 1 + f 1 W 4 ( t, Ù 1) + f 1 W0 + f 1 0 W 4 (1 t, Ù) = 1 + ( t, Ù 1)(Pr(S(Ù) = A ) Pr(S(Ù) = 0 A )) = 1 + ( t, Ù 1)(E(S(Ù)A ) 1) = Ì(t). Situation II: t j ^ t ^ Ù ^ t j + 1 This situation corresponds to (ii) of Proposition. Before proceeding, we state the following lemma which will prove to be useful. Lemma: For any three loci at t 1! t! t 3, t1, t 3 1 = ( t1, t 1)( t, t 3 1). Proof. From (10), one has t1, t 3 1 = (1 t1, t 3 ) = e0.04 A t1 t3 A = e 0.04( A t1 t A + A t t3 A ) = ( t1, t 1)( t, t 3 1). Long and tedious algebraic manipulations give B = B 0 = 1 4 W ( t, Ù 1) 1 4 t j,t (1 tj,t) Ù,tj + 1 (1 Ù,tj + 1 ) tj,t j + 1 (1 tj,t j + 1 (A) and B 1 = 0. Denoting the last term in (A) as 1 F, we have from (A1) Ì*(t) = (f f 1 )( t, Ù 1) (1 F) = 1 + (E(S(Ù)A ) 1)( t, Ù 1) (1 F). Appendix 4: Expressions for Cov(S(Y)1 ; H 0 ) and Cov(S(Y)1 ; H A ) As noted in the text, the jth component of S(Y) = (S(t 1 (Y),..., S(t M AY)) reduces to S(t j ), i.e. S(t j AY ) = S(t j ), if the markers are fully polymorphic, as we assume here. Under H A, i.e. Ù is within the region R, Var(S(t j )A ; H A ) = ( tj, Ù 1) (Var(S(Ù)A ) 1/) + 1/, j = 1,..., M and Cov(S(t j ), S(t l ) ; H A ), j! l = 1,..., M equals to ( tj,t l 1)Var(S(Ù)A ) if Ù D [t j,t l ] ( tj, Ù 1)( tl, Ù 1) (Var(S(Ù)A ) 1/) + 1 ( t j,t l 1) if Ù DA [t j,t l ]. Under the null hypothesis that Ù =, Var(S(Ù)A ) = 1/ and consequently, Cov(s(t j ), S(t l A ; H 0 ) = ( tj,t l 1)/, j! l = 1,..., M. Robust Multipoint Mapping for Complex Diseases Hum Hered 001;51:

15 References Goldin LR: Detection of linkage under heterogeneity: comparison of two-locus vs. Admixture models. Genet Epidemiol 199;9: Haldane JBS: The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet 1919;8: Haseman JK, Elston RC: The investigation of linkage between a quantitative trait and a marker locus. Behav Genet 197;:3 19. Hauser ER, Boehnke M: Genetic linkage analysis of complex genetic traits by using affected sibling pairs. Biometrics 1998;54: Hodge SE, Elston R: Lods, wrods, and mods: The interpretation of lod scores calculated under different models. Genet Epidemiol 1994;11: Hodge SE, Vieland VJ: The essence of single ascertainment. Genetics 1996;144: Huber PJ: Robust estimation of a location parameter. Ann Math Statist 1964;35: Huber PJ: The behavior of maximum likelihood estimators under non-standard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1967, vol 1, pp Kong A, Cox NJ: Allele-sharing models: Lod scores and accurate linkage tests. Am J Hum Genet 1997;61: Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES: Parametric and nonparametric linkage analysis: A unified multipoint approach. Am J Hum Genet 1996;58: Lander ES, Kruglyak L: Genetic dissection of complex traits: Guidelines for interpreting and reporting linkage results. Nat Genet 1995;11: Liang KY, Rathouz PJ, Beaty TH: Determining linkage and mode of inheritance: Mod scores and other methods. Genet Epidemiol 1996;13: Liang KY, Zeger SL: Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13. MacLean CJ, Bishop DT, Sherman SL, Diehl SR: Distribution of lod scores under uncertain mode of inheritance. Am J Hum Genet 1993; 5: Risch N: Linkage strategies for genetically complex traits. I. Multilocus models. Am J Hum Genet 1990a;46: 8. Risch N: Linkage strategies for genetically complex traits: II. The power of affected relative pairs. Am J Hum Genet 1990b;46:9 41. Schork NJ, Boehnke M, Terwilliger JD, Ott J: Twotrait-locus linkage analysis: A powerful strategy for mapping complex genetic traits. Am J Hum Genet 1993;53: Smith CAB: Testing for heterogeneity of recombination fraction values in human genetics. Ann Hum Genet 1963;7: Suarez BK, Rice J, Reich T: The generalized sib pair IBD distribution: Its use in the detection of linkage. Am Hum Genet 1978;4: Teng J, Siegmund D: Multipoint linkage analysis using affected relative pairs and partially informative markers. Biometrics 1998;54: Vieland VJ, Hodge SE, Greenberg DA: Adequacy of single-locus approximations for linkage analysis of oligogenic traits. Genet Epidemiol 199; 9: Whittemore AS: Genome scanning for linkage: An overview. Am J Hum Genet 1996;59: Whittemore AS, Halpern J: A class of tests for linkage using affected pedigree members. Biometrics 1994;50: Hum Hered 001;51:64 78 Liang/Chiu/Beaty

Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives

Genetic Epidemiology 16:225 249 (1999) Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives Mary Sara McPeek* Department of Statistics, University of Chicago, Chicago, Illinois