Mapping quantitative trait loci in oligogenic models

Size: px

Start display at page:

Download "Mapping quantitative trait loci in oligogenic models"

Ethan Wells
5 years ago
Views:

1 Biostatistics (2001), 2, 2,pp Printed in Great Britain Mapping quantitative trait loci in oligogenic models HSIU-KHUERN TANG, D. SIEGMUND Department of Statistics, 390 Serra Mall, Sequoia Hall, Stanford University, Stanford, CA , USA SUMMARY We discuss strategies for mapping quantitative trait loci with emphasis on certain issues of study design that have recently received attention: e.g. genotyping only selected pedigrees and the comparative value of large pedigrees versus sib pairs. We use a standard variance components model and a parametrization of the genetic effects in which the segregation parameters are locally orthogonal to the linkage parameters. This permits simple explicit expressions for the expectation of the score statistic, which we use to compare the power of different strategies. We also discuss robustness of the score statistic. Keywords: Gene mapping; Genome scan; Quantitative trait; Variance components. 1. INTRODUCTION The goal of genetic mapping is to locate the genes affecting particular traits by analysis of the correlation between phenotypic values and genetic markers distributed throughout the genome. The traits can involve a 0 1 phenotype (e.g. human diseases) or can be based on quantitative measurement. One expects relatives who have similar phenotypes to have similar genotypes at marker loci close to genetic loci that influence the trait, while the markers behave stochastically according to the rules of Mendelian inheritance at distant loci. Until recently, the theory and practice of mapping quantitative trait loci (QTLs) in humans has been relatively undeveloped. The purpose of this paper is to discuss, essentially from first principles, the statistical theory for mapping QTLs in humans under the simplifying assumptions of an oligogenic model of inheritance and completely informative genetic markers. Although this model has the deficiency of a strong assumption of normality, its interpretability and its computational tractability in the incorporation of covariates, multivariate phenotypes, interactions, etc. is sufficiently attractive to have encouraged substantial recent development (e.g. Amos (1994); Fulker and Cardon (1994); Kruglyak and Lander (1995); Fulker and Cherny (1996); Almasy and Blangero (1998); Page et al. (1998); Williams and Blangero (1999)). We try insofar as possible to give explicit analytic accounts of a number of issues that have usually been treated in the literature by numerical methods or by simulation, especially the relative values of large versus small pedigrees (cf. Almasy and Blangero (1998); Page et al. (1998); Williams and Blangero (1999)) and of genotyping selected versus random sibships (Risch and Zhang, 1995; Eaves and Meyer, 1994). In the process we generalize the notion of a discordant sib pair to a discordant sibship. In the following section we describe the model to be studied. We discuss sib pairs in Section 3. Sibships of arbitrary size are the subject of Section 4. Selective genotyping is the subject of Section 5. In Section 6 To whom correspondence should be addressed. c Oxford University Press (2001)

2 148 H.-K. TANG AND D. SIEGMUND we give a brief discussion of robustness of the components of variance model when a biallelic major gene model is correct. Our main conclusions are discussed in Section 7. To facilitate our calculations we introduce a parametrization in which the segregation parameters (those that can be estimated from segregation data) and linkage parameters (those requiring data from linked markers for their estimation) are orthogonal under the null hypothesis of no linkage (Cox and Hinkley, 1974, p. 324). We also use the standard asymptotic framework of statistical large sample theory, that the sample size N is large and the noncentrality parameter for a single observation is inversely proportional to N 1/2. These techniques allow us to compute score statistics, their asymptotic expectations, and Fisher information matrices comparatively explicitly. 2. MODEL We assume Hardy Weinberg equilibrium throughout. Suppose there is a QTL at the locus τ, which is an unknown parameter. The phenotypic value Y is assumed to be given by Y = µ + α U + α V + δ U,V + e. (2.1) The mean value µ may be expanded as a linear model with only minor changes to what follows. The parameter α a denotes the additive genetic effect of allele a at locus τ, δ a,b the dominance deviation of alleles a, b. A subscript U denotes the allele contributed by the mother while a subscript V refers to the father. The phenotypic variance is σy 2 = E[(Y µ)2 ]. The variances of the additive and dominance effects associated with the QTL at τ are σa 2 = 2Eα2 U and σ D 2 = Eδ2 U,V. Implicitly we expect that there are several QTLs, which may interact. For this paper we assume that other QTLs are on other chromosomes, are in linkage equilibrium with and do not interact with the QTL at τ. Under these assumptions their contribution to the phenotype Y can be assumed to be a part of the residual term e, which we assume is uncorrelated with the other terms in (2.1) and has variance σe 2. Then σ Y 2 = σ A 2 + σ D 2 + σ e 2. The locusspecific heritability associated with the QTL at τ is h 2 = (σa 2 + σ D 2 )/σ Y 2. The term oligogenic of the title is intended to suggest that at least one QTL has a substantial locus-specific heritability, say h 2 > 0.1, and hence might be detectable by linkage analysis with reasonable sample sizes. In order to have some idea of the magnitudes of different components of variance and their relations to more directly interpretable genetic parameters, we shall occasionally be interested in the special case of two alleles A 1 and A 2 with frequencies of p and q = 1 p. Using a and a for the genotypic values of the homozygotes A 1 A 1 and A 2 A 2, respectively, and d for the dominance deviation of the heterozygote, we have the standard formulas: σ 2 A = 2pq[a + (q p)d]2, σ 2 D = 4p2 q 2 d 2. (2.2) Consider a pair of siblings satisfying the model (2.1). Recall that at any locus two relatives share alleles identical by descent if they inherit the same alleles from a common ancestor. Two siblings can share 2, 1 or 0 alleles identical by descent depending on whether they inherit the same alleles from both mother and father, from one but not both, or from neither. Let ν = ν(τ) denote the number of alleles identical by descent at τ. Letting Y i denote the phenotypic value of the ith sibling (i = 1, 2), we have (Fisher, 1918; Kempthorne, 1955) Cov[Y 1, Y 2 ν] =σ 2 e r + σ 2 A ν/2 + σ 2 D 1 {ν=2}, (2.3) where r = corr(e 1, e 2 ) accounts for the correlation between sibs arising from other QTLs and from a shared environment.

3 Mapping quantitative trait loci in oligogenic models 149 Taking the expectation of (2.3), we find the unconditional covariance. Hence we can rewrite (2.3) as Cov[Y 1, Y 2 ν] =Cov(Y 1, Y 2 ) +[(σa 2 + σ D 2 )/2](ν 1) (σ D 2 /2)[1 {ν=1} 1 2 ], (2.4) so the terms involving ν have mean 0 and are uncorrelated. In the following we will be interested in marker loci t distributed throughout the genome and ν(t) for a sib pair as a stochastic process in t. For markers t 1 and t 2 on different chromosomes, ν(t 1 ) and ν(t 2 ) are stochastically independent. For markers on the same chromosome Cov[ν(t 1 ), ν(t 2 )]=2 1 [1 2φ], where φ is a function of the recombination frequency. We find it convenient to assume that recombination follows the Haldane model of no interference, so 1 2φ = exp( 4λ t 1 t 2 ). Here the marker location t denotes genetic distance in centimorgans (cm) from a designated end of the chromosome, and λ = 0.01/cM. 3. LIKELIHOOD THEORY FOR SIB PAIRS In this section we consider genome scans to detect the QTL that is explicitly modelled in (2.1), although we acknowledge that several QTLs may contribute to the trait. Since the location τ of the QTL is unknown, our statistic takes the form of a stochastic process indexed by markers at loci t distributed throughout the genome. Large values of this process indicate the likelihood of a QTL located near to the markers where the large values occur. Initially we consider N pairs of siblings, and for simplicity we assume that markers are completely informative, so the values of ν(t) are known with certainty. Consider a sample of N sib pairs with phenotypes (Y 11, Y 12 ),...,(Y N1, Y N2 ) and data ν 1 (t i ),...,ν N (t i ) from markers t i distributed throughout the genome. Our basic modelling assumption is that (Y n1, Y n2 ) are independent with a common distribution, which conditional on ν = ν n (τ) is bivariate normal with common means µ, variances σ 2 Y, and correlation (cf. (2.4)) where ρ ν = ρ +{α 0 (ν 1) δ 0 [1 {ν=1} 1 2 ]}/σ 2 Y, (3.1) α 0 =[σa 2 + σ D 2 ]/2, δ 0 = σd 2 /2. (3.2) It is convenient to define D n = (Y n1 Y n2 )/2 1/2 and S n = (Y n1 + Y n2 2µ)/2 1/2. For simplicity we assume initially that the parameters µ and σy 2 are known and equal to 0 and 1, respectively. As we show below, this has no effect on the asymptotic theory. We also make the working assumption that the QTL τ is one of the t i (cf. Remark (iii) below), so the marginal log likelihood function, l = l(τ, α 0,δ 0,ρ),is given by l = 2 1 n [log(1 ρ 2 ν n ) + D 2 n /(1 ρ ν n ) + S 2 n /(1 + ρ ν n )]. (3.3) Here ν n = ν n (τ), and ρ ν is given by (3.1). This can be regarded as the conditional likelihood of the phenotypic data given (ν n, n = 1,...,N) or as the unconditional joint log likelihood of the phenotypic data and the (ν n ). All expectations are taken with respect to the joint distribution. Partial derivatives with respect to unknown parameters are denoted by appropriate subscripts. Let C n = C n (α 0,δ 0,ρ)be defined by C n = ρ νn /(1 ρ 2 ν n ) + S 2 n /2(1 + ρ ν n ) 2 D 2 n /2(1 ρ ν n ) 2. (3.4)

4 150 H.-K. TANG AND D. SIEGMUND Regarding τ momentarily as known, we obtain the components of the efficient score: l α (τ) = n [ν n (τ) 1]C n, (3.5) l δ (τ) = n [ {ν n (τ)=1}]c n, (3.6) and l ρ = n C n. (3.7) When α 0 = 0 (hence also δ 0 = 0), the three coordinates of the score vector are uncorrelated, so the Fisher information matrix is diagonal, with easily computed entries. When it is assumed that there is no dominance variance, so δ 0 = 0, the score statistic for testing α 0 = 0 at a putative QTL τ = t is Z 1 (t) = l α (t)/iαα 1/2, (3.8) where now C n = C n (0, 0, ˆρ) and ˆρ = N 1[ n Y 1,nY 2,n ] + is the maximum likelihood estimator under the null hypothesis. Since the true value of τ is unknown, to scan an entire genome for linkage, we can use max Z 1 (t), (3.9) t where the max is taken over all marker loci. Linkage is detected whenever max t Z 1 (t) b for a suitable threshold b. Thresholds to control the false detection rate have been discussed by Feingold et al. (1993) for markers equally spaced at distance 0. See also Lander and Kruglyak (1995) and Lander and Schork (1994). To allow for the possibility that δ 0 > 0, we can define Z 2 (t) = l δ (t)/i 1/2 δδ. (3.10) and use the two degree of freedom statistic [Z 2 1 (t) + Z 2 2 (t)]1/2, (3.11) constrained by the relation that 0 δ 0 α 0 (cf. (3.2)). Dupuis and Siegmund (2000) discuss this possibility in the context of sib pair analysis of qualitative traits. REMARKS (i) For unknown σy 2, µ, and ρ one uses ˆµ = (2N) 1 n (Y 1,n + Y 2,n ), ˆσ Y 2 = (2N) 1 n [(Y 1,n ˆµ) 2 + (Y 2,n ˆµ) 2 ] and ˆρ = N 1[ n (Y 1,n ˆµ)(Y 2,n ˆµ) ] + / ˆσ 2 Y. The asymptotic results are unchanged since the scores for the segregation parameters σy 2,µ,ρ are uncorrelated with those for the linkage parameters α 0,δ 0 under α 0 = δ 0 = 0 and asymptotically for local alternatives. (ii) The components of variance analysis uses the relation for normally distributed X that the variance of X 2 is twice the square of the variance of X, and hence it is not robust to violations of the assumed normality. A simple device to obtain a more robust test would be to refer the statistic given above to its conditional distribution given (C 1,...,C N ), where C n = C n (0, 0, ˆρ). This would make the type I error probability nonparametric with respect to the distribution of the phenotypes while maintaining full asymptotic efficiency if the normality hypothesis is satisfied. Asymptotically it amounts to replacing (3.8) and (3.10) by [ 1/2 [ 1/2 Z 1 (t) = l α (t)/ Cn] 2, Z 2 (t) = l δ (t)/ Cn] 2, (3.12) n n

5 Mapping quantitative trait loci in oligogenic models 151 Fig. 1. Expected value of the score statistic: QTL location is τ, expectation at τ is ζ, detection threshold is b, and the denote simulated values at equally spaced marker loci. which has the effect of using fourth moments of the phenotypic values to estimate variability. (iii) We have assumed completely informative markers in order to simplify the analysis and have made the working assumption that the QTL τ is one of the markers. If either of these assumptions fails to be true, the likelihood function involves a mixture based on the conditional distribution of ν n (τ) given the marker data, say M n,inthenth family. A convenient representation for the likelihood function is E 0 [exp(l(τ, α 0,δ 0,ρ) l(0, 0,ρ)) M, Y ], (3.13) where M = (M 1,...,M N ), Y = (Y 11, Y 21,...,Y 1N, Y 2N ), l is given by (3.3), and the subscript 0 denotes that the expectation is computed under the assumption that α 0 = δ 0 = 0. For the case of partially informative markers, on the evidence of simulations, Fulker and Cherny (1996) report very good success with the test to detect an additive effect that simply replaces ν n in (3.3) by ˆν n = E 0 (ν n M n, Y n ). Equation (3.5) provides an explanation for the Fulker Cherny results, since the efficient score for testing α 0 = 0 is precisely (3.5) with C n = C n (0, 0, ˆρ) and with ν n replaced by ˆν n. To prove this, differentiate the logarithm of (3.13) with respect to α 0 and evaluate the result at α 0 = δ 0 = 0. See Teng and Siegmund (1998) for a discussion of the impact of marker informativeness and intermarker distance on the power to detect linkage. The same calculation gives the efficient score when the QTL τ is located between markers. In principle one might extend the max in (3.9) from marker loci to all loci t, but in most cases there seems to be very little power gained by this device (cf. Darvasi et al. (1993); Dupuis and Siegmund (1999)), so we do not pursue the more complicated analysis. Display (4.8) derived below gives an expression for the noncentrality ξ = E[Z 1 (τ)]. At a marker t linked to τ, E[Z 1 (t)] =ξ exp[ 4λ t τ ]. These relations are illustrated in Figure 1, where ξ>b, so the probability of detection is large. The asymptotic squared noncentrality of (3.11) at the QTL τ is given by (4.8) with α 2 0 /2 replaced by α2 0 /2 + δ2 0 /4. In Table 1 we have used (4.8) to evaluate the sample size necessary to achieve 90% power for the score statistic (3.8) and various values of α 0 and ρ. We consider a genome scan at 1 cm and assume a genome of 23 chromosomes of average length 140 cm. This yields a detection threshold of b 3.91 (cf. Feingold et al. (1993) or the Appendix). We assume the trait locus is at zero recombination distance from

6 152 H.-K. TANG AND D. SIEGMUND Table 1. Number of sib pairs for 90% power. The overall correlation between two siblings is ρ, and the locus-specific heritability is 2α 0 /σy 2.For = 1, the QTL is assumed to lie at a marker; a 2% larger sample size is required for a QTL midway between two flanking markers. For = 10, the QTL is assumed to lie midway between two markers ρ α 0 /σy 2 N( = 1) N( = 10) the nearest marker. For 90% power the approximation of Feingold et al. (1993) given in the Appendix, leads to the value of N that makes the squared noncentrality parameter given by (4.8) about equal to 25. The required sample size would increase by about 2% if the QTL is midway between markers. We have also included the sample size required for 90% power when markers are spaced at 10 cm and the QTL is midway between markers. The detection threshold decreases to approximately b = 3.6; but for a QTL midway between markers, we need an approximately 16 18% increase in sample size compared to the 1 cm spacing. Simulations indicate that these approximations are very accurate. An interesting comparison is available from Page et al. (1998), who have estimated the sample size required to have a probability of 0.9 for the LOD score evaluated at a single marker to exceed the conventional level of 3, or equivalently b = 3.72 on the normal scale, which we are using here. Although this is in principle quite different from the situation that arises when multiple, closely spaced markers are used, in the special case of 1 cm spacing and the QTL exactly at a marker, by a numerical accident their definition also leads to the value of N that makes the squared noncentrality parameter equal to ( ) 2 = 25. Hence our analytic approximations can be compared with the results in Table 4 of Page et al. (1998), which were based on simulations of an additive model with one biallelic major gene and a polygenic component. The agreement is generally excellent. If the true mode of inheritance is roughly additive, the two-dimensional statistic based on (3.11), which under the conditions given above requires a threshold of about b = 4.11 (Dupuis and Siegmund, 2000), has less power than the corresponding one-dimensional statistic. Some numerical experimentation indicates that (3.11) is more efficient than (3.8) only for a fairly rare recessive allele, which must have a large effect for the QTL to be detectable with a reasonable sample size. Hence the simpler one-dimensional statistic would seem adequate in most situations. 4. LIKELIHOOD THEORY FOR SIBSHIPS OF ARBITRARY SIZE Starting with Blackwelder and Elston (1982), a number of authors have observed on the basis of simulations that sibships of size s provide considerably more power than sib pairs, perhaps as much as s(s 1)/2 independent sib pairs. See also Page et al. (1998); Williams and Blangero (1999). Suppose

7 Mapping quantitative trait loci in oligogenic models 153 we have a sample of N sibships, each of size s. We index sibs within a sibship by i and j and sibships by n = 1,...,N. The subscript n is often suppressed in our notation. For ease of exposition we again assume that µ = 0 and σ 2 Y = 1. Let ν ij(t) denote the number of alleles shared identical by descent at the marker locus t by the ith and jth sibs in the nth sibship. Let A ν denote the s s matrix with entries ν ij 1 for i = j and zeros along the diagonal. Let ν = E(YY A ν ). The log likelihood for a single QTL at τ is l = l(τ, α 0,δ 0,ρ)given by l = 2 1 N n=1 {log ν +tr 1 ν YY }, (4.1) where ν = ν(τ). Recall that if G is a nonsingular matrix depending on x, then log G / x = tr(g 1 G/ x) and G 1 / x = G 1 G/ xg 1. By differentiation of (4.1) we obtain the score equations l α = 2 1 n { tr( 1 ν A ν ) + tr( ν 1 A ν ν 1 YY )} (4.2) and l ρ = 2 1 n { tr( 1 ν B) + tr( ν 1 B ν 1 YY )}, (4.3) where B = ν / ρ = 11 I. We omit the similar expression for l δ (cf. (3.6)). Also l αα = n { 2 1 tr( 1 ν A ν 1 ν A ν ) + tr( 1 ν A ν ν 1 A ν ν 1 YY )}. (4.4) Similar expressions for l ρρ, l αρ, l δδ, etc. are easily obtained. Let E 0 denote expectation under the hypothesis that α 0 = 0 (hence also δ 0 = 0) and let = E 0 (YY ) = (1 ρ)i + ρ11. It is easy to see that I αα = E 0 ( l αα ) = (N/2)trE 0 ( 1 A ν 1 A ν ) (4.5) and I αρ = E 0 (l α l ρ ) = 0. Hence the asymptotic noncentrality of the score statistic Z t = l α (t, 0, 0, ˆρ)/Iαα 1/2 (0, 0, ˆρ) (4.6) can be evaluated by taking the expectation of (4.6) with ˆρ replaced by ρ. From (3.1) and (4.2) we obtain for t = τ Hence by (4.5) the noncentrality of (4.6) is To evaluate (4.7), we first observe that E[l α (τ, 0, 0,ρ)]=(N/2)α 0 tre( 1 A ν 1 A ν ). α 0 {(N/2)trE 0 ( 1 A ν 1 A ν )} 1/2. (4.7) 1 = K ρ {[1 + (s 1)ρ]I ρ11 },

8 154 H.-K. TANG AND D. SIEGMUND where K ρ = {(1 ρ)[1 + (s 1)ρ]} 1. See, for example, Rao (1973, p. 67). It is easy to see that ν ij = ν ji, and for i < j the ν ij are pairwise independent whenever they differ for at least one subscript. Hence EA 2 ν =[(s 1)/2]I. By straightforward and somewhat tedious algebra we find that E(A ν11 A ν ) = (s/2 1)I + (1/2)11. Combining these results, we obtain tre 0 ( 1 A ν ) 2 = K 2 ρ ( ) s {[1 + (s 2)ρ] 2 + ρ 2 }, 2 which can be substituted into (4.7) to obtain the square of the asymptotic noncentrality parameter for N sibships of size s: ξ 2 = N α2 0 2σ 4 Y ( ) s {[1 + (s 2)ρ] 2 + ρ 2 } 2 {(1 ρ)[1 + (s 1)ρ]} 2. (4.8) Although (4.8) increases rapidly with s, there are dependencies among the ν ij within a sibship, so l α has a skewed distribution when s > 2, and hence a larger threshold is required to maintain a fixed false positive error rate. In addition, standard asymptotic theory to the effect that the asymptotic variance of the score statistic for small, positive α 0 is effectively the same as when α 0 = 0 is not a reasonable approximation for large s (hence relatively small sample sizes). Consequently, the increase in power with increasing s is less than is suggested by considering only the increase in the asymptotic noncentrality parameter. Numerical examples are given in Table 2, for which we used a more refined asymptotic analysis, which we describe in an Appendix and have spot checked for accuracy by simulations. (Even for s = 2 there is a small discrepancy between the sample sizes in Tables 1 and 2 because of the different approximations used.) We see from Table 2 that, except for large s or large α 0, a sibship of size s turns out to be roughly as powerful as ( s 2) independent sib pairs. For small sibships Williams and Blangero have also obtained the noncentrality parameter (4.8) and have computed sample sizes directly from (4.8) without the corrections mentioned in the preceding paragraph. These can be misleading, especially if carried out for larger s. For example, for the third row of Table 2 their method would yield N = 100; for the sixth row it would yield N = 12. Our more precise asymptotic analysis seems to be consistent with the simulations of Page et al. (1998). Exactly as for sib pairs, we can obtain a distribution-free false positive rate if we consider the conditional distribution of l α given the phenotypic values. Asymptotically that means that l α should be standardized by {E 0 [l 2 α Y 1,...,Y N ]} 1/2, which is given by the square root of a sum of terms of the form E 0 {[ tr( 1 A ν ) + Y 1 A ν 1 Y ] 2 Y } = sµ 4 (1 ρ) 4 4sµ 3 Ȳ [1 + (s 1)ρ](1 ρ) 3 + 2s(s 3)µ 2 Ȳ 2 [1 + (s 1)ρ] 2 (1 ρ) 2 + s2 µ 2 2 (1 ρ) 4 2sρµ 2 [1 + (s 1)ρ](1 ρ) 3 + s(s 1){ρ[1 + (s 1)ρ]+(1 ρ)ȳ 2 } 2 [1 + (s 1)ρ] 4 (1 ρ) 2, (4.9) where µ k = s 1 (Y i Ȳ ) k for k = 2, 3, 4. In the following section we find (4.9) useful for a completely different purpose. Our methods can be adapted to study extended pedigrees, although it is more difficult to obtain explicit analytic results. An exception is the case of nuclear families consisting of parents and their children. In addition to the inter-sib correlation ρ, we let ρ p denote the parental correlation (due to shared environment only) and ρ = (σa 2/2 + rσ e 2)/σ Y 2 the parent sib correlation. For N nuclear families, each containing s

9 Mapping quantitative trait loci in oligogenic models 155 Table 2. Sample sizes required for 90% power. Sibships of size s; sib pair correlation is ρ; sample size is N; threshold is b; noncentrality is ξ; rel. eff. is the number of independent sib pairs needed to have the same power as one sibship of size s; see the text for definitions of other parameters ρ α 0 /σy 2 s b ξ N σ0 2 σ1 2 rel. eff sibs, the squared noncentrality equals Nα 2 0 2σ 2 Y ( ) s [1 2 ρ 2 /(1 + ρ p ) + (s 2)(ρ 2 ρ 2 /(1 + ρ p )] 2 + (ρ 2 ρ 2 /(1 + ρ p )) 2 2 [1 2 ρ 2 /(1 + ρ p ) + (s 1)(ρ 2 ρ 2 /(1 + ρ p ))] 2 (1 ρ) 2. This is somewhat larger than (4.8) for the siblings alone. For a completely additive trait with ρ = 0.25, for two sibs the squared noncentrality with parents included is 15% larger than without parents. For five sibs it is only 7% larger.

10 156 H.-K. TANG AND D. SIEGMUND 5. SELECTIVE GENOTYPING In some cases it may be relatively easy and inexpensive to phenotype individuals. When the cost of phenotyping is indeed small compared to the cost of genotyping, it is possible to achieve an increase in power by genotyping only sib pairs with particularly favourable phenotypes. See, for example, Risch and Zhang (1995), who recommend using sib pairs with discordant phenotypes, and Xu et al. (1999) for an application. In this section we study Risch and Zhang s suggestion. We begin by considering instead of l α the simpler statistic suggested by Risch and Zhang, (ν 1)/(N/2) 1/2, (5.1) where the summation extends over all genotyped sib pairs. Also assume for simplicity that µ = 0,σ 2 Y = 1. For a given sib pair, by Bayes formula one can easily show that E(ν 1 D, S) is given by a fraction, the numerator of which equals P(ν = i)(i 1)ϕ[S/(1 + ρ i ) 1/2 ]ϕ[d/(1 ρ i ) 1/2 ]/(1 ρ 2 i )1/2 while the denominator is a similar expression without the factor (i 1). For small values of α 0, we see from the first term of a Taylor series expansion that E(ν 1 D, S) α 0 E 0 (ν 1) 2 [ρ/(1 ρ 2 ) + S 2 /2(1 + ρ) 2 D 2 /2(1 ρ) 2 ]. (5.2) It is evident from (5.2) that sib pairs with large values of D are particularly informative. If we genotype only those sib pairs whose phenotypes satisfy D > t, for each genotyped pair we have at the QTL an asymptotic noncentrality of 2 1/2 E(ν 1 D > t) [α 0 /2 2 1/2 (1 ρ)]t ϕ(t )/{1 (t )}, (5.3) where t = t/(1 ρ) 1/2. For a numerical example suppose that ρ = 0.25 (corresponding to a heritability of 0.50 for a purely additive trait) and t = Then (5.3) equals 2.16α 0, while the noncentrality of a random sib pair is 0.78α 0. Hence only about one-eighth as many discordant sib pairs need be genotyped as random sib pairs. On the other hand, roughly 20 random sib pairs must be phenotyped to find one discordant sib pair. Other methods of selecting the sib pairs to be genotyped can be handled by modifications of the preceding argument. For the situation described above, the preferred definition of Risch and Zhang (1995) is about 82% as efficient as our suggestion. The case of concordant sib pairs, defined by max(y 1, Y 2 )< t or min(y 1, Y 2 )>t, can be treated similarly. Now the value of t corresponding to the preceding examples is 1.19, and the asymptotic noncentrality is 1.49α 0. Instead of the ad hoc statistic (5.1), one might consider the score statistic (3.8). Now the unknown nuisance parameter ρ (and in general µ and σ Y also) must be estimated. This poses no problem if, as we assume, the sib pairs to be genotyped are selected from a random sample of sib pairs of known phenotypes, which are available to estimate the nuisance parameters. This will typically be a very large sample, ensuring that the nuisance parameters are estimated accurately. By some routine Taylor series expansions one sees that for purposes of asymptotic analysis one can regard them as known. Before proceeding, it is worth noting, however, that the situation would be quite different if the sib pairs are ascertained through their phenotypes, so the natural estimates of nuisance parameters are biased, perhaps severely. See, for example, Beaty and Liang (1987) for ascertainment corrections. An advantage of the score statistic over (5.1) is that it generalizes naturally to the case of larger sibships. For the score statistic (4.2), the arguments given above show that E(l α Y 1,...,Y N ) (α 0 /4) N n=1 E 0{[ tr( 1 A ν ) + Y 1 A ν 1 Y ] 2 Y }. (5.4)

11 Mapping quantitative trait loci in oligogenic models 157 We define a discordant sibship to be one in which the squared norm of the (s 1)-dimensional vector of orthogonal contrasts exceeds a threshold t. To facilitate the analysis of (5.4) we make the (Helmert) orthogonal transformation from Y = (Y 1,...,Y s ) defined by Z s = (Y 1 + +Y s )/s 1/2 = s 1/2 Ȳ, and for i = 1,...,s 1 Z i =[(i + 1)i] 1/2 [ i j=1 Y j iy i+1 ]. Under the probability P 0, Z 1,...,Z s are independent and normally distributed with Var 0 Z s = 1+(s 1)ρ, and Var 0 Z i = 1 ρ, i = 1,...,s 1. The variables Z 1,...,Z s 1 are the orthogonal contrasts in the Y, so our definition of discordant is that Z Z s 1 2 > t. An expression for the expectation in (5.4) is given in (4.9). To find the asymptotic noncentrality parameter, we evaluate the expectation of (4.9) given that Z Z s 1 2 > t. It is straightforward to compute each term, except possibly those involving i=1 s Y i k for k = 3, 4. Since our definition of discordance is symmetric in the Y, all terms in the sum have the same (conditional) expectation, and from the inverse of the Helmert transformation, we see that Y s = (s 1) 1/2 Z s 1 /s 1/2 + Z s /s 1/2. Hence these expectations are also readily evaluated, and we obtain the following expression: E 0 {[ tr( 1 A ν ) + Y 1 A ν 1 Y ] 2 Z Z s 1 2 [ ] > t} 1 3(s 1) 2t 2 f s 1 (t [ ) (s 3 3s 2 + s + 3)ρ + s 2 ] 3 2t f s 1 (t ) = (1 ρ) 2 1 s(s + 1) F s 1 (t + ) s[1 + (s 1)ρ](1 ρ) 2 F s 1 (t ) + s(s 1){[1 + (s 2)ρ]2 + ρ 2 } [1 + (s 1)ρ] 2 (1 ρ) 2, where t = t/(1 ρ), and f s, F s are respectively the density and right tail distribution functions of a χ 2 s variable. Some examples of the efficiency gained by selective genotyping of the most discordant 10% of a sample of sibships of size s are given in Figure 2. For small s the gain in efficiency compared to genotyping random sibships is quite large; but as the size of the sibship increases the relative value of selective genotyping decreases while, as we saw in the preceding section, the unconditional value of the sibship increases. For example, a random sibship of size 4 is about as powerful as a selected sib pair from the most discordant 10% of the population. 6. ROBUSTNESS As we indicated above, by conditioning on the phenotypic values it is possible to make the statistics we have considered nonparametric with respect to the false-positive error rate. In this section we make a brief study of robustness of the power of these tests when a different model is assumed to be true in particular for the model having a major gene with two alleles and a normally distributed residual. For simplicity we consider only sib pairs. The standardized nonparametric version of the score statistic is l α /[ n C 2 n /2]1/2. Since the expected value of the numerator, E(l α ), is computed without making distributional assumptions, to evaluate the noncentrality parameter we need only evaluate E 0 C 2, which under the assumption of normality equals (1 + ρ 2 )/(1 ρ 2 ) 2. Let = E 0 C 2 (1+ρ 2 )/(1 ρ 2 ) 2. Algebraic expressions for are straightforward, albeit somewhat complicated for the case of a single biallelic QTL and normally distributed residual e. Table 3 contains numerical examples of the percentage increase in E 0 C 2, which is also the percentage increase in sample size that would be required to maintain the same power as determined for the assumed components of variance model. For the most part, the impact of using the components of variance when a biallelic major gene model is correct has a negligible effect on the power, but the effect can be substantial in the case of a rare recessive allele of large effect.

12 158 H.-K. TANG AND D. SIEGMUND Fig. 2. Number of genotyped sibships for selected (proportion = 0.1) and unselected genotyping. It is possible that a likelihood analysis of the correct model would produce a completely different and more efficient statistic. To simplify the notation we assume there is no dominance deviation. Let g i denote the indicator that the allele inherited from the ith parent is A 1. Then we can write Y = µ + a(g 1 + g 2 2p) + e. Let p(g 1, g 2 ν) denote the conditional distribution of (g 1, g 2 ) given ν. The likelihood function for a single pedigree is the mixture g1,g 2 [ p(g 1, g 2 ν) S σe 2(1 r 2 ) 1/2 ϕ a(g1 + g 2 2p)/2 1/2 σ e (1 + r) 1/2 ] ϕ [ D a(g1 g 2 )/2 1/2 σ e (1 r) 1/2 ]. (6.1) If we take the first two terms of the Taylor series expansion of (6.1) about a = 0, we obtain after some calculation that (6.1) ϕ[s/σ e(1 + r) 1/2 ]ϕ[d/σ e (1 r) 1/2 ] { ( D 2 σe α 0 (1 r)2 σ 4 e (1 r) [ 1 (ν 1)/2 2 σe 2(1 r 2 ) 1/2 ] + S2 σe 2 (1 + r) (1 + r)2 σ 4 e [ ])} 3 + (ν 1)/2, (6.2) 2 where, as above, α 0 = σa 2/2 = pqa2. The efficient score for testing α 0 = 0 is the logarithmic derivative of the likelihood function evaluated at α 0 = 0. This is just the coefficient of α 0 in (6.2). Any term not involving ν can be omitted, and unknown parameters must be estimated, so after summing over all sib pairs the statistic becomes { (ν 1) D2 ˆσ e 2 (1 ˆr) 2 ˆσ e 4(1 + S2 ˆσ 2 } e (1 +ˆr) ˆr)2 2σˆ 4. (6.3) e (1 +ˆr) 2

13 Mapping quantitative trait loci in oligogenic models 159 Table 3. Percentage increase in sample size for biallelic major gene. The column headed % change gives the percentage increase in sample size for a biallelic major gene (cf. (2.2)) relative to the assumed oligogenic model with the same variance components. The major gene contributes 25% of the trait variance; its additive effect is a, dominance deviation is d, and allele frequency is p; the sib correlation is ρ a d p ρ % change This is exactly of the form of the score statistic for the components of variance model, except that ˆσ e 2 appears in place of ˆσ 2 Y and ˆr appears in place of ˆρ (cf. (3.5) ff.). However, the estimates for these parameters are calculated under the condition α 0 = 0, and in spite of the difference in notation the parameters, hence the estimates, are the same under that hypothesis. Thus the score statistics for the twoallele model with normal residuals and for our components of variance model are the same statistic. This provides further evidence of the robustness of our components of variance test. 7. DISCUSSION For an oligogenic model with normally distributed phenotypic data, we have introduced a parametrization that makes the linkage parameters orthogonal to the segregation parameters and hence allows us to compute explicitly score statistics, Fisher information matrices, and noncentrality parameters in a number of important special cases. We have evaluated the asymptotic noncentrality parameter for sibships of arbitrary size, which suggests what others have observed as a result of simulations, that a sibship of size s can be roughly as powerful as ( s 2) independent sib pairs. Our more precise analysis shows that this assessment is overly optimistic when s and α 0 are large, but large sibships are, nevertheless, extremely valuable even in these cases.

14 160 H.-K. TANG AND D. SIEGMUND We have evaluated the power of genotyping a selected subset of sibships defined by their phenotypes, which we select from a large random sample of sibships. The relative value of selective genotyping decreases rapidly as the sibship size increases. The Haseman Elston regression statistic (Haseman and Elston, 1972) can be derived as a special case of the calculations of Section 3. One ignores S 1,...,S N and starts from the likelihood function for D 1,...,D N, then uses the robust version of the score statistic suggested in (3.12). Compared to the fully efficient likelihood analysis, for moderate phenotypic correlation (0.25) between sibs, Haseman Elston regression is about 75% efficient when the mode of inheritance is additive. The modified Haseman Elston statistic (Elston et al., 2000) is more (less) efficient than the classical for small (large) correlation between sibs. It also is about 75% efficient for moderate correlation. See also Teng (1996) and Wright (1997). In sibships the dominance component of variance contributes to the noncentrality of the score statistic designed to detect an additive component. Consequently, even if there is a large dominance component, but we model only the additive components, our loss of efficiency is usually relatively modest. Even for rare recessively acting alleles of relatively large effect, the loss rarely exceeds 10 20% of the sample size. Based on a conditioning argument, we have suggested a modified statistic, which is nonparametric under the hypothesis of no linkage and can be expected to be robust to moderate departures from normality. We have briefly discussed robustness against a true model involving a (biallelic) major gene. It appears that our model is robust if the allelic substitutions have small phenotypic effect, or modest effect but small dominance deviation. An interesting case deserving more careful attention involves a major gene having rare alleles of large effect. We expect to discuss gene gene and gene environment interactions in a future paper. ACKNOWLEDGEMENTS This research was partly supported by NIH Grant HG The authors thank two referees and the associate editor for their thoughtful suggestions. APPENDIX: BETTER APPROXIMATIONS FOR SIBSHIPS OF SIZE s Because of the dependence among the ν ij (which are pairwise independent), the null distribution of l α is skewed when the number of siblings is s 3. To deal with a similar problem involving qualitative traits Tu and Siegmund (1999) suggested a p-value approximation that uses the third moment to correct for skewness. Let β be the one-sided derivative at 0 of Cov(Z 0, Z t ), γ be N 1/2 times the third moment of Z t under the hypothesis of no linkage, θ =[ 1 + (1 + 2bγ/N 1/2 ) 1/2 ]/γ, and ν = ν[b(2β ) 1/2 ] the special function defined by Siegmund (1985, p. 82). The following is a slight modification of the approximation of Tu and Siegmund (1999): for a chromosome of length L with markers equally spaced at distance, P 0 { max Z i > b} 0 i <L [2π(1 + γθ)] 1/2 {1/θ N 1/2 + νβ Lb} exp[ Nθ 2 (1 + 2γθ/3)/2]. Substantial calculation shows that the value of γ equals [3/2] ( s 3) {[1 + (s 2)ρ] 3 + (3s 10)ρ 3 + 3ρ 2 } {[ ( s) 2 /2][(1 + (s 2)ρ) 2 + ρ 2 ]} 3/2. As a function of ρ this ratio is practically constant, so in evaluating the thresholds in Table 2 we have used the value for ρ = 0.

15 Mapping quantitative trait loci in oligogenic models 161 To determine the sample sizes in Table 2, we have used suitable versions of the power approximations provided by Feingold et al. (1993): P(max Z k > b) 1 ((b ξ)/σ 0 ) [ 2ν ν 2 ] + ϕ((b ξ)/σ 0 ) bσ 0 /σ1 2 + (ξ b)/σ 0 2bσ 0 /σ1 2 + (ξ b)/σ, 0 where ν = ν[b(2β ) 1/2 /σ 1 ]. This approximation, which is valid when there is a marker at the QTL τ, involves (i) the probability that the statistic Z τ is above the detection threshold b (or that the statistic exceeds b at one or both of the two markers flanking τ in the case that τ is not itself a marker) and (ii) the probability that the statistic is below the threshold at τ but Z t b at some nearby marker t. To implement this approximation we require the mean and variance (σ 2 0 ) of Z τ and the conditional mean and variance (σ 2 1 ) of Z t Z τ given Z τ. See Tang (2000) for details. REFERENCES ALMASY, L. AND BLANGERO, J. (1998). Multipoint quantitative-trait linkage analysis in general pedigrees. American Journal of Human Genetics 62, AMOS, C. I. (1994). Robust variance-components approach for assessing genetic linkage in pedigrees. American Journal of Human Genetics 54, BEATY, T. H. AND LIANG, K. Y. (1987). Robust inference for variance components models in families ascertained through probands: I. conditioning on the proband s phenotype. Genetic Epidemiology 4, BLACKWELDER, W. C. AND ELSTON, R. C. (1982). Power and robustness of sib-pair linkage tests and extension to larger sibships. Commun. Statist.- Theor. Meth. 11, COX, D.R.AND HINKLEY, D. V. (1974). Theoretical Statistics. London: Chapman and Hall. DARVASI, A.,WEINREB, A., MINKE, V.,WELLER, J.I.AND SOLLER, M. (1993). Detecting marker-qtl linkage and estimating QTL gene effect and map location using a saturated genetic map. Genetics 134, DUPUIS, J. AND SIEGMUND, D. (2000). Boundary crossing probabilities in linkage analysis. In Thomas Bruss, F. and Le Cam, L. (eds), Game Theory, Optimal Stopping, Probability and Statistics, Hayward, CA: Institute of Mathematical Statistics, pp DUPUIS, J. AND SIEGMUND, D. (1999). Statistical methods for mapping quantitative trait loci from a dense set of markers. Genetics 151, DUPUIS, J., BROWN, P.AND SIEGMUND, D. (1995). Statistical methods for linkage analysis of complex traits from high resolution maps of identity by descent. Genetics 140, EAVES, L. AND MEYER, J. (1994). Locating human quantitative trait loci: guidelines for the selection of sibling pairs for genotyping. Behavior Genetics 24, ELSTON, R., BUXBAUM, S., JACOBS, K. B. AND OLSON, J. M. (2000). Haseman and Elston revisited. Genetic Epidemiology 19, FEINGOLD, E., BROWN, P. O. AND SIEGMUND, D. (1993). Gaussian models for genetic linkage analysis using complete high resolution maps of identity-by-descent. American Journal of Human Genetics 53, FISHER, R. A. (1918). The correlation of relatives on the assumption of Mendelian inheritance. Proc. Roy. Soc. Edinburgh. FULKER, D. W. AND CHERNY, S. S. (1996). An improved multipoint sib pair analysis of quantitative traits. Behavior Genetics 26,

16 162 H.-K. TANG AND D. SIEGMUND FULKER, D. W. AND CARDON, L. R. (1994). A sib-pair approach to interval mapping of quantitative trait loci. American Journal of Human Genetics 54, HASEMAN, J.K.AND ELSTON, R. C. (1972). The investigation of linkage between a quantitative trait and a marker locus. Behavior Genetics 2, KEMPTHORNE, O. (1955). Genetic Statistics. New York: Wiley. KRUGLYAK, L. AND LANDER, E. S. (1995). Complete multipoint sib pair analysis of qualitative and quantitative traits. American Journal of Human Genetics 57, LANDER, E. S. AND KRUGLYAK, L. (1995). Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nature Genetics 11, LANDER, E.S.AND SCHORK, N. J. (1994). Genetic dissection of complex traits. Science 265, PAGE, G.P.,AMOS, C.I.AND BOERWINKLE, E. (1998). The quantitative LOD score: test statistic and sample size for exclusion and linkage of quantitative traits in human sibships. American Journal of Human Genetics 62, RAO, C. R. (1973). Linear Statistical Inference and Its Applications, 2nd edn. New York: Wiley. RISCH, N. AND ZHANG, H. P. (1995). Extreme discordant sib pairs for mapping quantitative trait loci in humans. Science 268, SIEGMUND, D. (1985). Sequential Analysis: Tests and Confidence Intervals. New York: Springer. TANG, H.-K. (2000). Using variance components to map quantitative trait loci in humans, Ph.D. Thesis, Stanford University. TENG, J. (1996). Statistical methods in linkage analysis, Ph.D. Thesis, Stanford University. TENG, J. AND SIEGMUND, D. (1998). Multipoint linkage analysis using affected relative pairs and paritally informative makes. Biometrics 54, TU, I-PING AND SIEGMUND, D. (1999). The maximum of a function of a markov chain and application to linkage analysis. Advances in Applied Probability 31, WILLIAMS, J. T. AND BLANGERO, J. (1999). Power of variance component linkage analysis to detect quantitative trait loci. Annals of Human Genetics 63, WRIGHT, F. (1997). The phenotypic difference discards sib-pair QTL linkage information. American Journal of Human Genetics 60, XU, X., ROGUS, J. J., TERWEDOW, H.A.,YANG, J., WANT, Z., CHEN, C., NIU, T.,WANT, B., XU, H., WEISS, S., SCHORK, N. J. AND FANG, Z. (1999). An extreme-sib-pair genome scan for genes regulating blood pressure. American Journal of Human Genetics 64, [Received March 6, 2000; revised June 28, 2000; accepted for publication June 29, 2000]

QTL mapping under ascertainment

QTL mapping under ascertainment J. PENG Department of Statistics, University of California, Davis, CA 95616 D. SIEGMUND Department of Statistics, Stanford University, Stanford, CA 94305 February 15, 2006