ABC Fax Original Paper

Size: px

Start display at page:

Download "ABC Fax Original Paper"

Tracey Fowler
5 years ago
Views:

1 Original Paper Hum Hered 2003;55:27 36 DOI: / Received: October 23, 2002 Accepted after revision: March 27, 2003 Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study Daniel O. Stram a Christopher A. Haiman a Joel N. Hirschhorn b,c,e David Altshuler b,c,d,f Laurence N. Kolonel g Brian E. Henderson a Malcolm C. Pike a a Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, Calif.; b Center for Genome Research, Whitehead Institute, Massachusetts Institute of Technology, Cambridge, Mass.; Departments of c Genetics and d Medicine, Harvard Medical School, e Divisions of Genetics and Endocrinology, Children s Hospital, f Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Mass.; g Hawaii Cancer Research Center, University of Hawaii, Honolulu, Hawaii, USA Key Words Haplotypes W Case-control studies W Linkage disequilibrium W Candidate gene analysis W htsnp Abstract We describe an approach for picking haplotype-tagging single nucleotide polymorphisms (htsnps) that is presently being taken in two large nested case-control studies within a multiethnic cohort (MEC), which are engaged in a search for associations between risk of prostate and breast cancer and common genetic variations in candidate genes. Based on a preliminary sample of 70 control subjects chosen at random from each of the 5 ethnic groups in the MEC we estimate haplotype frequencies using a variant of the Excoffier-Slatkin E-M algorithm after genotyping a high density of SNPs selected every 3 5 kb in and surrounding a candidate gene. In order to evaluate the performance of a candidate set of htsnps (which will be genotyped in the much larger case-control sample) we treat the haplotype frequencies estimate above as known, and carry out a formal calculation of the uncertainty of the number of copies of common haplotypes carried by an individual, summarizing this calculation as a coefficient of determination, Rh 2. A candidate set of htsnps of a given size is chosen so as to maximize the minimum value of Rh 2 over the common haplotypes, h. Introduction Copyright 2003 S. Karger AG, Basel The underlying premise of the common-disease common-variant hypothesis is that common variants at multiple loci contribute to susceptibility to common disease. These causal variants are most likely to be old and predate the divergence of human populations and thus, are positioned on ancestral chromosomal segments (haplotypes) that are shared today across ethnically diverse populations. By exploiting the underlying pattern of linkage disequilibrium (LD) within a gene, physical regions that harbor disease susceptibility alleles may be isolated [1] using a haplotype-based association approach. This approach is ABC Fax karger@karger.ch S. Karger AG, Basel /03/ $19.50/0 Accessible online at: Daniel O. Stram, PhD Department of Preventive Medicine University of Southern California 1540 Alcazar Street, Suite 220, Los Angeles, CA (USA) Tel , Fax , stram@usc.edu

2 usually more powerful than traditional linkage studies to identify alleles of moderate risk [2]. The pattern of LD varies across the human genome [3, 4] and new studies suggest that discrete regions of high LD, haplotype blocks, exist that are characterized by restricted haplotype diversity [5 7]. In regions of high LD within these blocks, a reduced set of haplotype tag SNPs (htsnps) may be selected to efficiently identify the common haplotypes [8]. There is little published material available giving explicit suggestions for choosing htsnps, two exceptions are Zhang et al., 2002 [9] and Ke and Cardon, 2003 [10]. These and others that have been implemented for public use (c.f. snp.cgb.ki.se/tagntell/, haplotype/) generally treat finding the choice of appropriate tag SNPs as a pattern recognition, rather than statistical problem, in which htsnp haplotypes, rather than genotypes, will be read directly. (Some material, supplementary to [8], that does not appear to make this assumption, is available on David Clayton s web site: These algorithms often implicitly suggest the use of the smallest set of htsnps distinguishing the common haplotypes. This smallest number is no greater than the number of common haplotypes minus 1. As discussed below this smallest set of htsnps is not always appropriate even if the common haplotypes represent close to 100% of the haplotype diversity in a block, and further problems arise, when multiple rarer haplotypes exist that are not specified by the htsnps. Consequently, with this approach, dissimilar haplotypes will be combined leading to the potential for misclassification of haplotypes and the underestimation of haplotype-specific effects in association studies. Here we describe an approach for choosing htsnps from unphased genotype data that optimizes the predictability of the common haplotypes as defined by a statistic that is analogous to the usual coefficient of determination, R 2, in a multiple linear regression. We provide an example of this approach for the candidate breast cancer susceptibility gene CYP19. We also describe how the level of haplotype predictability affects the power of case-control studies. Analysis Maximum likelihood estimation of population haplotype frequencies from unphased genotype data from randomly sampled (unrelated) subjects is performed, under the assumption of Hardy-Weinberg equilibrium, by the Excoffier-Slatkin implementation of the expectationmaximization (E-M) algorithm [11]. Recent work [12, 13] on tests for association between haplotypes and disease risk has suggested that data should be analyzed by pooling cases and controls and performing a two-step estimation procedure. In the first step, haplotype frequency estimates are obtained by use of the E-M algorithm applied to the combined data for cases and controls. In the second step an estimate of the haplotypes carried by each of the cases and controls, coming directly from the E-M algorithm, is used as the independent variable in a logistic analysis of case-control status, to form score tests of the null hypothesis of no additional (excess) risk associated with carrying specific haplotypes. The justification for combining cases and controls in the first phase of this procedure is that under the null hypothesis there is no difference in haplotype frequencies between cases and controls, so that using the pooled data yields a better estimate of the haplotype frequencies than would be obtained by, for example, using only the data from the controls. The expectation step of the Excoffier-Slatkin E-M algorithm involves the calculation, for each possible haplotype h, of an estimate of the haplotype dosage, h (H), which is the count of the number of copies of h contained in the true (but generally unknown) pair of haplotypes H carried by that individual (i.e. h (H) = 0, 1 or 2). In the E-M calculations the estimate of h (H) is computed conditionally on the genotype data G for each subject and treating the set, P h, of current estimates of the haplotype frequencies as if they were known. The haplotype dose estimate (based on the assumption of Hardy-Weinberg equilibrium) from subject i with genotype G i is equal to E h (H i )AG i = HFG i h (H)p h1 p h2 HFGi p h1 p h2 where HFGi indicates a summation over the (ordered) haplotype pairs, H = (h 1, h 2 ), with frequency p h1 and p h2, respectively, that are compatible with the observed genotype data. For any given set of true haplotype frequencies, P h, we can make a formal calculation (again assuming Hardy- Weinberg equilibrium) of the squared correlation, R 2 h, between the estimate, E h (H i )AG i, and the true value, h (H i ), of the number of copies of h carried by a randomly sampled subject. Note that the assumption of Hardy- Weinberg equilibrium for the haplotypes is equivalent to assuming that the marginal distribution (found by summing over the distribution of H i with weights equal to p h1! p h2 ) of h (H i ) is equal to that of a binomial random variable with parameters n = 2 and p = p h so that h has 28 Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

3 Table 1. Details of the calculation of Var[E h (H i )AG i ] and R 2 h for two SNPs Genotype, G (0,0) (0,1) (0,2) (1,0) (1,1) (1,2) (2,0) (2,1) (2,2) Haplotype pair, H (0,0),(0,0) (0,0),(0,1) (0,1),(0,1) (1,0),(0,0) (1,0),(0,1) (1,1),(0,0) (1,1),(0,1) (1,0),(1,0) (1,1),(1,0) (1,1),(1,1) P(G) p 2 0 2p 0 p 1 p 2 1 2p 2 p 0 2(p 2 p 1 + p 3 p 0 ) 2p 3 p 1 p 2 2 2p 3 p 2 p 2 3 E h0 (H ) A G p 3 p 0 p 2 p 1 + p 3 p E h1 (H ) A G p 2 p 1 p 2 p 1 + p 3 p E h2 (H ) A G p 2 p 1 p 2 p 1 + p 3 p E h3 (H ) A G p 3 p 0 p 2 p 1 + p 3 p mean and variance equal to 2p h and 2p h (1 p h ), respectively, where p h is the frequency of haplotype h. The squared correlation R 2 h between true and predicted haplotype dosage, i.e. between h (H i ) and its estimate E h (H i )AG i, can be expressed as the ratio of the variance of h that is explained by the genotype data to the total variance of h (H i ), i.e. R 2 h = Var[E h(h i )AG i ]. (1) 2p h (1 p h ) Here the variance of the expectation is computed by averaging E h (H)AG 2 over all possible genotypes G, weighting by the probability of each genotype. For example consider the two SNP case, with possible haplotypes coded as h 0 = (0,0), h 1 = (0,1), h 2 = (1,0), and h 3 = (1,1) with 0 and 1 indicating the major and minor alleles respectively. Table 1 details the calculation of Var[E h (H i )AG i ] in this simple case. We compute Var E( h (H)AG) over the distribution of G as G E( h AG) 2 P(G) (2p h ) 2 (2) which, for haplotype h 0 = (0,0) is computed as Var E( h0 (H)AG) = 4 p p p 1p 0 +2p 2 p p 0 [p 2 p 1 + p 3 p 0 ]! 2(p 2p 1 + p 3 p 0 ) 4 p Remembering that p 0 + p 1 + p 2 + p 3 = 1, this can be usefully rearranged as 2p 0 1 p 0 + p 3 p 3 p 0 p 2 p 1 + p 3 p 0 1 (3) Note that this expression equals the binomial variance, 2p 0 (1 p 0 ), when any of p 1, p 2, or p 3 equals 0, yielding an R 2 h of 1 for h 0. From table 1, we see that for two SNPs the only genotype for which the haplotype dosages are uncertain given G is G = (1,1), and that the uncertainty disappears when any of p 1, p 2 or p 3 equals 0, which agrees with our calculation that R 2 h = 1 in these cases. For two independent SNPs of equal frequency, p, we have p 0 = p 2, p 1 = p(1 p), p 2 = (1 p)p, and p 3 = (1 p) 2, so that formula (1) simplifies to 3p + 1 2p + 2, with p = 1/2 this equals 5/6. This relatively high value of certainty in the haplotypes reflects the relative infrequency of the uncertain genotype G = (1,1), with P(G) = 1/4. As the number of independent SNPs increases the probability of having more than one heterozygote SNP increases markedly, and the values of R 2 h correspondingly decline (see fig. 1). The calculation of R 2 h can readily be extended to the problem of interest, the prediction of the haplotype dosage variable, h (H i ), when using only a subset of the SNPs. For any subset of SNPs we can formally calculate R 2 h using formula (1). For the reduced set of SNP data there will be more haplotype pairs, H i, which are compatible with the genotype G ir based solely on the reduced set of SNP data than with G i based on all SNP data. This results in a lower Var[E h (H i )AG ir ] and hence a lower R 2 h. This is illustrated for two SNPs in table 2 where we assume that (in the main case-control study) only the first SNP is measured in G ir. Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

4 Fig. 1. R 2 h for predicting haplotype h 0 in the case of n independent SNPs each with frequency = 1/2. Fig. 2. R 2 h for predicting haplotypes composed of two SNPs each with allele frequency of 1/2 according to standardized linkage disequilibrium coefficient D) when two SNPs are genotyped (a) or when only one SNP is genotyped (b). Table 2. Details of calculation of R 2 h for the two SNP case when only the first SNP is genotyped H G (0, ) (1, ) (2, ) (0,0),(0,0) (0,0),(0,1) (0,1),(0,1) (1,0),(0,0) (1,0),(0,1) (1,1),(0,0) (1,1),(0,1) (1,0),(1,0) (1,1),(1,0) (1,1),(1,1) P(G) p p 0p 1 + p 2 1 2(p 2 p 0 + p 2 p 1 + p 3 p 0 + p 3 p 1 ) p p 3p 2 + p 2 3 E h0 (H)AG p p 0p 1 p p 0p 1 +p 2 1 E h1 (H)AG 2p 0 p 1 +2p 2 1 p p 0p 1 +p 2 1 p 2 p 0 +p 3 p 0 0 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 p 2 p 1 +p 3 p 1 0 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 E h2 (H)AG 0 p 2 p 0 +p 2 p 1 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 2p p 3p 2 p p 3p 2 +p 2 31 E h3 (H)AG 0 p 3 p 0 +p 3 p 1 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 2p 3 p 2 +2p 2 3 p p 3p 2 +p 2 31 For two SNPs, expressions for Var[E h (H i )AG ir ] and R 2 h readily follow from table 2. In particular we have the uncertainty in estimating dosage for haplotype h 0 as R 2 h 0 = p 0(1 p 0 + p 1 ) (1 p 0 )(p 0 + p 1 ) which equals 1 now only if p 1 = 0. Figure 2 compares the expression for R 2 h using both SNPs to R2 h using only the first according to the amount of linkage disequilibrium (measured by D)) in the special case when both SNPs have frequency equal to 1/2. Computational Considerations As the number of SNPs available for each candidate gene increases the computations involved in the E-M algorithm increase. In order to allow as many as 30 or more SNPs to be used to estimate haplotype frequencies in regions of restricted haplotype diversity (which sometimes do include this many SNPs [7], and which would present a nearly insurmountable computational burden on an unmodified E-M algorithm) it is useful to break up the calculations into pseudo-blocks this is referred to as 30 Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

5 the partition-ligation method by J. Liu and colleagues [14, 15]. Within each pseudo-block (of perhaps 5 contiguous SNPs) the usual E-M algorithm is run providing estimates of frequency of the 2 5 = 32 possible haplotypes. If (as is usually the case for densely placed SNPs) one or more of these haplotype frequencies are estimated to be very close to or equal to zero after a reasonable number of iterations of the E-M, these haplotypes are ignored when the blocks are subsequently merged. Thus if in each of the first two pseudo-blocks 10 haplotypes are estimated to have nonzero frequency, the combination of 10! 10 = 100 possible haplotypes (rather than 32! 32 = 1,024) are all that are considered in the merging stage. This divide-and-conquer process, repeated over many nearby pseudo-blocks, greatly simplifies the calculations required to compute E h (H i )AG i in the E-M steps, and makes the computation of Var[E h (H i )AG i ] feasible for many combinations of potential htsnps during an optimization phase. Note that our use of the term pseudo-block here does not refer to distinct blocks of restricted haplotype diversity, in the sense described by Gabriel et al. [7]. Rather the pseudo-blocks will typically all be contained within a single block of restricted diversity, the number of SNPs in each pseudo-block is chosen solely to maximize the speed of the merging algorithm. For example using 5 SNPs in each pseudo-block will usually give the same result as using 10 SNPs in (half as many) pseudo-blocks with the only difference being in the speed of the algorithm, the first approach will generally be faster. (Occasional minor differences in results, see [15] for a simple example, have to do with the potential multi-modal nature of the likelihood being maximized). In general the success in estimating haplotype frequencies of any implementation of the E-M algorithm for haplotype reconstruction that uses large numbers of SNPs will be dependent upon the true state of nature being that of high linkage disequilibrium between all the SNPs considered. In our selection of htsnps (as for the CYP19 example below) we first identify blocks of restricted haplotype diversity (i.e. high linkage disequilibrium) by the method described in Gabriel et al. [7], and define the common haplotypes in this block as those with greater than 5% frequency as estimated by the E-M algorithm. For a block involving n SNPs in high LD, we define the best set of m htsnps (m! n) as those m SNPs that maximize the minimum value of R 2 h calculated for each common haplotype. The calculation of R 2 h for any given haplotype requires generating the full set of possible haplotype pairs, H, for a given set of non-zero haplotype frequencies and a summation of h (H) over the values compatible with each of the possible resulting SNP genotypes. This then must be done for each common haplotype h of interest, and for each set of candidate htsnps. Depending upon the number of SNPs and non-zero haplotype frequencies a full enumeration can be fairly tedious in many instances. In order to optimize the choice of m htsnps we have implemented a modified stepwise inclusion method rather than an exhaustive search of all n! (n m)!m! choices of m tag SNPs. In the modified stepwise procedure we enter, as the kth candidate htsnp (k ^ n) the SNP giving the greatest increase in the max min R 2 h, from that obtained using the set of k 1 candidate SNPs currently selected. Upon entry of this candidate then we then look backwards to see if max min R 2 h can be further increased by substitution of any of the previously entered k 1 SNPs with any SNP not presently included as an htsnp. To see that this approach is far less computationally intensive than a full search, consider choosing the best 5 htsnps from among 20 possibilities. Of a total of 15,504 possible choices of 5 from 20 potential htsnps, our modified stepwise inclusion method considers just 251 (20 to find the single best htsnp, to find the best two htsnps, an additional to choose the best three, more, for the best four, and more, for the final choice). We recognize that the stepwise algorithm is not mathematically guaranteed to find the single best set of m tag-snps but our experience with the method has been very favorable. For the data described below (section 5) we found that the exhaustive search took approximately 12 min on a 2.0-GHz laptop computer, compared with approximately 10 s using the stepwise algorithm, while producing the same result. Effect of Haplotype Uncertainty on Case-Control Study Sample Size There are standard formulae (Breslow and Day [16]) which may be used to provide sample sizes for risk estimation when dealing with known exposures or covariates. If (as motivated above) we are interested in estimating haplotype-specific relative risks we have to deal with the additional uncertainty that comes about because the haplotype dosage, h (H i ), is not completely known for all individuals. There are two sources of uncertainty, first the formal uncertainty, based upon the R 2 h calculation, which treats the set of estimated haplotype frequencies, Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

6 P h, as known, in the estimation of h (H i ) conditional on G i, and second the uncertainty in the estimates of the haplotype frequencies themselves. Let us first consider only the formal uncertainty in the estimation of h (H i ). We consider disease models in which the expected value of the disease outcome, D i, is nearly linear in h (H i ). This assumption of (near) linearity would include estimation of the log odds ratio or log relative risk in a logistic or Cox regression of disease on h (H i ) so long as the true odds ratio or relative risk is not extraordinarily large. Such additivity would also approximately hold for a dominant penetrance model so long as the haplotype is relatively rare. If the model for disease is E(D i ) = a + b h h (H i ) (4) then a general approach for estimation [13, 17] of b h is to replace h (H i ) with E h (H i )AG i as the independent variable in the appropriate regression algorithm. (This procedure is sometimes known as the regression substitution method). Now under the null hypothesis that b h = 0, neither the mean nor the variance of the disease outcome, Var(D), depends upon on h (H i ) so that we can approximate the sampling variance of the estimate, bˆ, of b under the null as Var (D) Var(bˆ ) NVar E h (H i )AG i = Var (D), (5) N 2p h (1 p h )R 2 h where N is the number of subjects in the main study of disease and haplotype-specific risk. This expression is a modification of a standard formula for Var(bˆ ) quite generally applicable in linear regression, in which the residual variance of the outcome is divided by N times the variance of the independent variable in the regression (here the residual variance is equal to the total variance because b = 0). Formula (5) holds asymptotically for both binary and continuous outcomes when b is zero. We see immediately then that 1/R 2 h is a sample size inflator reflecting the effect of uncertainty in the estimation of h (H i ), on the estimation of the risk estimate b under the null hypothesis that the true b equals 0. That is, in order to achieve the same sampling variance which would theoretically be achieved with h (H i ) known, we have to increase our sample size N by a factor of 1/R 2 h when the haplotype dosage is uncertain. Since expression (5) nearly holds for moderate values of b as well, we see that nearly the same inflation factor applies for power calculations so long as the hypothesized alternative value of b is not too large. (This is a standard result in the measurement error literature valid under local alternatives, see [18] for applications of this result in binary regression.) If it requires N individuals to detect a given non-zero b with a given level of power when the haplotype dosage is certain then expression (5) indicates that it will take approximately N/R 2 h subjects to detect this same b given the uncertainty in R 2 h. Figure 3 gives a modification of standard sample size computations for 1-1 frequency-matched case-control studies to account for haplotype uncertainty. There is of course additional uncertainty in estimating h (H i ) due to sampling errors in the estimation of P h. These errors are shared across many subjects, and it may only be by simulation that the influence of errors in estimating P h on the power to detect a given non-zero b can be addressed. One relatively simple approach to calculate an R 2 h which takes account of variability in the E-M estimates is to repeatedly simulate true values of h (H i ) based on the estimated P h, and then randomly combine these to produce sets of G i. Running the E-M algorithm on each of these sets of data allows a brute force computation of the squared correlation between the simulated true h (H i ) with the estimates E h (H i )AG i computed based upon the resulting E-M estimates of P h, obtained for each set of data. We perform this experiment for test data below. Example CYP19 Data from the Multiethnic Cohort Study The data are from 70 Japanese-American participants in the Multiethnic Cohort study (the University of Southern California Institutional Review Board has approved this study) for which 74 informative SNPs were geno- Table 3. Haplotypes and haplotype frequencies estimated for 19 SNPs in CYP19 for Japanese American members of the multiethnic cohort study Haplotype P h Cumulative probability Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

7 Fig. 3. Sample size requirements for estimating logistic regression model, log additive in h (H), after adjustment for uncertainty in prediction of haplotype count with R 2 h = 0.90 (a) and R 2 h = 0.70 (b). A 1-1 matching of controls to cases is assumed. typed. These 74 SNPs appear to fall into 4 regions of reduced haplotype diversity as judged by the methods of Gabriel et al. [7]. Table 3 shows the haplotypes estimated to have frequency 10 by the E-M algorithm for one of these regions which includes 19 of these SNPs. Treating these haplotype frequencies as fixed we compute R 2 h for the first 5 haplotypes given in table 3 (those with estimated frequency 15%). The choice of htsnps is optimized by maximizing the minimum R 2 h for these 5 common haplotypes (see table 4). Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

8 With the best set of 4 SNPs 1, 16, 17, 18 chosen, the minimum value of R 2 h is 0.94 indicating a loss of efficiency (relative to knowing h completely) for estimating a (linear) relative risk model of no more than approximately 6%. Simulation Based on these 19 SNPs for CYP19 we performed the simulation experiment described above 1,000 times using 1, 16, 17, 18 as the set of htsnps used in the prediction. Correlating the true simulated values of h (H i ) with the estimates E h (H i )AG i lead to the simulated values of R 2 h for the 5 most common haplotypes in table 3 over the 1,000 simulations shown in table 5. The simulation results are highly suggestive that R 2 h is well estimated using our limited number (70) of subjects in the preliminary set of controls for each ethnic group, when, as in our simulation, the true state of nature is reflective of limited haplotype diversity. Table 4. Best choices, by the R 2 h criteria, of htsnps for a region of reduced haplotype diversity in the CYP19 gene among Japanese- American members of the multiethnic cohort study htsnps Best set Haplotype R 2 h , ,17, ,16,17, Discussion We have simplified here the issues involved in the selection of htsnps for studies such as the multiethnic cohort. For example there may be SNPs that are of special interest. These may include missense SNPs within the coding regions, SNPs in the untranslated regions within exons, or at intron-exon boundaries, and SNPs which are in conserved human-mouse homologous regions. These SNPs are forced in as htsnps, and additional SNPs are chosen to ensure that the common haplotypes seen in the 70 controls remain well predicted by the R 2 h criterion. A number of recent papers [8, 19, 20] have discussed the role of htsnps in the search for genetic variants that are related to common diseases. The present paper is the first to suggest and utilize, for haplotype tagging, a formal measure (R 2 h ) of the uncertainty in the prediction of common haplotypes from unphased SNP genotypes, and to relate this measure to sample size requirements for the design of case-control studies. Once htsnps are selected and genotyped in the casecontrol study our approach towards estimation of haplotype-specific relative risks is to first use the regression substitution method described by Zaykin et al. [13] and Schaid et al. [12]. Both these papers specify that the E-M algorithm should be used to estimate haplotype frequencies in the cases and controls in one combined data set. Applying this to the htsnp problem, the haplotype frequency estimates will be re-estimated using the combined cases and controls by the E-M algorithm, using only the htsnps to redefine haplotypes (the relationship between the common haplotypes defined by the htsnps alone and those seen in the full set of SNPs considered originally, should readily be identifiable by eye so long as R 2 h is high). For each common haplotype, the expectations Table 5. Simulation study R 2 h results using htsnps 1, 16, 17, and 18 (see text for further description of the simulation experiment) 1000 replications R 2 h haplotype 1 haplotype 2 haplotype 3 haplotype 4 haplotype 5 Formal value of R 2 h Simulated values Mean SD Median Highest Lowest Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

9 E ( h (H)AG) are then computed for all subjects and used in ordinary logistic regression software to test the null hypothesis that b h = 0. If this hypothesis is rejected using the score test [12], then, in order to remove biases in the estimate of b h which occur because of the enrichment of high risk haplotypes in the cases (implying both biased estimates of p h and a failure of Hardy-Weinberg equilibrium in the combined sample), we re-estimate haplotype frequencies using only the data from the controls. We then re-compute E ( h (H)AG) based on the control haplotype frequencies and re-estimate b h, by logistic regression, but rely on the p value from the (more powerful) combined-data score test to judge its statistical significance. In addition we are currently developing an extension to the Excoffier-Slatkin E-M algorithm to give approximations to the full likelihood of the case-control data, for jointly estimating b h and the haplotype frequencies in an efficient one-stage procedure. This latter method is especially important for providing appropriate upper and lower confidence limits for b h. Other statistical criteria for choosing htsnps are possible, for example we may define s (H) as the allele dosage of SNP s (so that this equals 0, 1 or 2, depending upon the number of copies of the variant allele at position s carried by the pair of haplotypes, H). Then, under HWE, for any potential set of htsnps it we may compute min R 2 s over all the SNPs in the region of interest, simply by substitution of s for h in our formulae above. This measure of the performance of the htsnps might be more appropriate than R 2 h if it was considered very likely that one (or more) of the SNPs measured in the preliminary sample was in fact related to disease in a causal fashion but that it was unknown a priori which SNP was the most likely candidate. In this case statistical control of all SNPs (rather than of all common haplotypes) would be the goal of the ht SNP selection. At this point, our approach towards choosing htsnps emphasizes maximizing the predictability of common haplotypes, although we also have implemented the maximization of min R 2 s as well as R2 h, in the software that we have developed and are making available (see below). Acknowledgements This work has been supported by grants CA63464, Genetic Susceptibility to Cancer in Multiethnic Cohorts, and GM58897, Computational Methods in Genetic Epidemiology, from the National Cancer Institute, National Institutes of Health. Part of this work was completed while Daniel Stram was on sabbatical visiting the Center for Genome Research of the Whitehead Institute, Massachusetts Institute of Technology, Cambridge, Mass. Collaborators on the gene association studies now in progress using cases and controls from the Multiethnic Cohort study include Matthew Freedman from the Center for Genome Research, and Abraham M. Nomura and Loic Le Marchand at the Hawaii Cancer Research Center, University of Hawaii, Honolulu, Hawaii Software Windows/DOS-based software and documentation for the selection of htsnps based on the procedures described here may be downloaded at References 1 Rioux JD, Daly MJ, Silverberg MS, Lindblad K, Steinhart H, Cohen Z, Delmonte T, et al: Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nat Genet 2001;29(2): Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science 1996;273(5281): Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet 1999;22(2): Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, Lavery T, et al: Linkage disequilibrium in the human genome. Nature 2001;411(6834): Daly MJ, Rioux J, Schaffner S, Hudson T, Lander E: High-resolution haplotype structure in the human genome. Nat Genet 2001;29: Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, et al: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 2001;294(5547): Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, et al: The structure of haplotype blocks in the human genome. Science 2002;296(5576): Johnson GC, Esposito L, Barratt BJ, Smith AN, Heward J, Di Genova G, Ueda H, et al: Haplotype tagging for the identification of common disease genes. Nat Genet 2001;29(2): Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci USA 2002;99(11): Ke X, Cardon LR: Efficient selective screening of haplotype tag SNPs. Bioinformatics 2003; 19(2): Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995; 12(5): Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 2002; 70(2): Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

10 13 Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered 2002; 53(2): Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002;70(1): Qin ZS, Niu T, Liu JS: Partition-ligationexpectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 2002;71(5): Breslow N, Day W (eds): Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. IARC Scientific Publications, ed W Davis, Vol , International Agency for Cancer Research: Lyon. 17 Rosner B, Spiegelman D, Willett W: Correction of logistic relative risk estimates and confidence intervals for random within-person measurement error. Am J Epidemiol 1992;136: Tosteson T, Ware J: Designing a logistic regression study using surrogate measures for exposure and outcome. Biometrika 1990;77: Judson R, Salisbury B, Schneider J, Windemuth A, Stephens J: How many SNPs does a genome-wide haplotype map require? Pharmacogenomics 2002;3(3): Rohde K, Fuerst R: Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum Mutat 2001;17(4): Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University,