ABC Fax Original Paper

Size: px
Start display at page:

Download "ABC Fax Original Paper"

Transcription

1 Original Paper Hum Hered 2003;55:27 36 DOI: / Received: October 23, 2002 Accepted after revision: March 27, 2003 Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a Preliminary Sample of Unrelated Subjects with an Example from the Multiethnic Cohort Study Daniel O. Stram a Christopher A. Haiman a Joel N. Hirschhorn b,c,e David Altshuler b,c,d,f Laurence N. Kolonel g Brian E. Henderson a Malcolm C. Pike a a Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, Calif.; b Center for Genome Research, Whitehead Institute, Massachusetts Institute of Technology, Cambridge, Mass.; Departments of c Genetics and d Medicine, Harvard Medical School, e Divisions of Genetics and Endocrinology, Children s Hospital, f Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Mass.; g Hawaii Cancer Research Center, University of Hawaii, Honolulu, Hawaii, USA Key Words Haplotypes W Case-control studies W Linkage disequilibrium W Candidate gene analysis W htsnp Abstract We describe an approach for picking haplotype-tagging single nucleotide polymorphisms (htsnps) that is presently being taken in two large nested case-control studies within a multiethnic cohort (MEC), which are engaged in a search for associations between risk of prostate and breast cancer and common genetic variations in candidate genes. Based on a preliminary sample of 70 control subjects chosen at random from each of the 5 ethnic groups in the MEC we estimate haplotype frequencies using a variant of the Excoffier-Slatkin E-M algorithm after genotyping a high density of SNPs selected every 3 5 kb in and surrounding a candidate gene. In order to evaluate the performance of a candidate set of htsnps (which will be genotyped in the much larger case-control sample) we treat the haplotype frequencies estimate above as known, and carry out a formal calculation of the uncertainty of the number of copies of common haplotypes carried by an individual, summarizing this calculation as a coefficient of determination, Rh 2. A candidate set of htsnps of a given size is chosen so as to maximize the minimum value of Rh 2 over the common haplotypes, h. Introduction Copyright 2003 S. Karger AG, Basel The underlying premise of the common-disease common-variant hypothesis is that common variants at multiple loci contribute to susceptibility to common disease. These causal variants are most likely to be old and predate the divergence of human populations and thus, are positioned on ancestral chromosomal segments (haplotypes) that are shared today across ethnically diverse populations. By exploiting the underlying pattern of linkage disequilibrium (LD) within a gene, physical regions that harbor disease susceptibility alleles may be isolated [1] using a haplotype-based association approach. This approach is ABC Fax karger@karger.ch S. Karger AG, Basel /03/ $19.50/0 Accessible online at: Daniel O. Stram, PhD Department of Preventive Medicine University of Southern California 1540 Alcazar Street, Suite 220, Los Angeles, CA (USA) Tel , Fax , stram@usc.edu

2 usually more powerful than traditional linkage studies to identify alleles of moderate risk [2]. The pattern of LD varies across the human genome [3, 4] and new studies suggest that discrete regions of high LD, haplotype blocks, exist that are characterized by restricted haplotype diversity [5 7]. In regions of high LD within these blocks, a reduced set of haplotype tag SNPs (htsnps) may be selected to efficiently identify the common haplotypes [8]. There is little published material available giving explicit suggestions for choosing htsnps, two exceptions are Zhang et al., 2002 [9] and Ke and Cardon, 2003 [10]. These and others that have been implemented for public use (c.f. snp.cgb.ki.se/tagntell/, haplotype/) generally treat finding the choice of appropriate tag SNPs as a pattern recognition, rather than statistical problem, in which htsnp haplotypes, rather than genotypes, will be read directly. (Some material, supplementary to [8], that does not appear to make this assumption, is available on David Clayton s web site: These algorithms often implicitly suggest the use of the smallest set of htsnps distinguishing the common haplotypes. This smallest number is no greater than the number of common haplotypes minus 1. As discussed below this smallest set of htsnps is not always appropriate even if the common haplotypes represent close to 100% of the haplotype diversity in a block, and further problems arise, when multiple rarer haplotypes exist that are not specified by the htsnps. Consequently, with this approach, dissimilar haplotypes will be combined leading to the potential for misclassification of haplotypes and the underestimation of haplotype-specific effects in association studies. Here we describe an approach for choosing htsnps from unphased genotype data that optimizes the predictability of the common haplotypes as defined by a statistic that is analogous to the usual coefficient of determination, R 2, in a multiple linear regression. We provide an example of this approach for the candidate breast cancer susceptibility gene CYP19. We also describe how the level of haplotype predictability affects the power of case-control studies. Analysis Maximum likelihood estimation of population haplotype frequencies from unphased genotype data from randomly sampled (unrelated) subjects is performed, under the assumption of Hardy-Weinberg equilibrium, by the Excoffier-Slatkin implementation of the expectationmaximization (E-M) algorithm [11]. Recent work [12, 13] on tests for association between haplotypes and disease risk has suggested that data should be analyzed by pooling cases and controls and performing a two-step estimation procedure. In the first step, haplotype frequency estimates are obtained by use of the E-M algorithm applied to the combined data for cases and controls. In the second step an estimate of the haplotypes carried by each of the cases and controls, coming directly from the E-M algorithm, is used as the independent variable in a logistic analysis of case-control status, to form score tests of the null hypothesis of no additional (excess) risk associated with carrying specific haplotypes. The justification for combining cases and controls in the first phase of this procedure is that under the null hypothesis there is no difference in haplotype frequencies between cases and controls, so that using the pooled data yields a better estimate of the haplotype frequencies than would be obtained by, for example, using only the data from the controls. The expectation step of the Excoffier-Slatkin E-M algorithm involves the calculation, for each possible haplotype h, of an estimate of the haplotype dosage, h (H), which is the count of the number of copies of h contained in the true (but generally unknown) pair of haplotypes H carried by that individual (i.e. h (H) = 0, 1 or 2). In the E-M calculations the estimate of h (H) is computed conditionally on the genotype data G for each subject and treating the set, P h, of current estimates of the haplotype frequencies as if they were known. The haplotype dose estimate (based on the assumption of Hardy-Weinberg equilibrium) from subject i with genotype G i is equal to E h (H i )AG i = HFG i h (H)p h1 p h2 HFGi p h1 p h2 where HFGi indicates a summation over the (ordered) haplotype pairs, H = (h 1, h 2 ), with frequency p h1 and p h2, respectively, that are compatible with the observed genotype data. For any given set of true haplotype frequencies, P h, we can make a formal calculation (again assuming Hardy- Weinberg equilibrium) of the squared correlation, R 2 h, between the estimate, E h (H i )AG i, and the true value, h (H i ), of the number of copies of h carried by a randomly sampled subject. Note that the assumption of Hardy- Weinberg equilibrium for the haplotypes is equivalent to assuming that the marginal distribution (found by summing over the distribution of H i with weights equal to p h1! p h2 ) of h (H i ) is equal to that of a binomial random variable with parameters n = 2 and p = p h so that h has 28 Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

3 Table 1. Details of the calculation of Var[E h (H i )AG i ] and R 2 h for two SNPs Genotype, G (0,0) (0,1) (0,2) (1,0) (1,1) (1,2) (2,0) (2,1) (2,2) Haplotype pair, H (0,0),(0,0) (0,0),(0,1) (0,1),(0,1) (1,0),(0,0) (1,0),(0,1) (1,1),(0,0) (1,1),(0,1) (1,0),(1,0) (1,1),(1,0) (1,1),(1,1) P(G) p 2 0 2p 0 p 1 p 2 1 2p 2 p 0 2(p 2 p 1 + p 3 p 0 ) 2p 3 p 1 p 2 2 2p 3 p 2 p 2 3 E h0 (H ) A G p 3 p 0 p 2 p 1 + p 3 p E h1 (H ) A G p 2 p 1 p 2 p 1 + p 3 p E h2 (H ) A G p 2 p 1 p 2 p 1 + p 3 p E h3 (H ) A G p 3 p 0 p 2 p 1 + p 3 p mean and variance equal to 2p h and 2p h (1 p h ), respectively, where p h is the frequency of haplotype h. The squared correlation R 2 h between true and predicted haplotype dosage, i.e. between h (H i ) and its estimate E h (H i )AG i, can be expressed as the ratio of the variance of h that is explained by the genotype data to the total variance of h (H i ), i.e. R 2 h = Var[E h(h i )AG i ]. (1) 2p h (1 p h ) Here the variance of the expectation is computed by averaging E h (H)AG 2 over all possible genotypes G, weighting by the probability of each genotype. For example consider the two SNP case, with possible haplotypes coded as h 0 = (0,0), h 1 = (0,1), h 2 = (1,0), and h 3 = (1,1) with 0 and 1 indicating the major and minor alleles respectively. Table 1 details the calculation of Var[E h (H i )AG i ] in this simple case. We compute Var E( h (H)AG) over the distribution of G as G E( h AG) 2 P(G) (2p h ) 2 (2) which, for haplotype h 0 = (0,0) is computed as Var E( h0 (H)AG) = 4 p p p 1p 0 +2p 2 p p 0 [p 2 p 1 + p 3 p 0 ]! 2(p 2p 1 + p 3 p 0 ) 4 p Remembering that p 0 + p 1 + p 2 + p 3 = 1, this can be usefully rearranged as 2p 0 1 p 0 + p 3 p 3 p 0 p 2 p 1 + p 3 p 0 1 (3) Note that this expression equals the binomial variance, 2p 0 (1 p 0 ), when any of p 1, p 2, or p 3 equals 0, yielding an R 2 h of 1 for h 0. From table 1, we see that for two SNPs the only genotype for which the haplotype dosages are uncertain given G is G = (1,1), and that the uncertainty disappears when any of p 1, p 2 or p 3 equals 0, which agrees with our calculation that R 2 h = 1 in these cases. For two independent SNPs of equal frequency, p, we have p 0 = p 2, p 1 = p(1 p), p 2 = (1 p)p, and p 3 = (1 p) 2, so that formula (1) simplifies to 3p + 1 2p + 2, with p = 1/2 this equals 5/6. This relatively high value of certainty in the haplotypes reflects the relative infrequency of the uncertain genotype G = (1,1), with P(G) = 1/4. As the number of independent SNPs increases the probability of having more than one heterozygote SNP increases markedly, and the values of R 2 h correspondingly decline (see fig. 1). The calculation of R 2 h can readily be extended to the problem of interest, the prediction of the haplotype dosage variable, h (H i ), when using only a subset of the SNPs. For any subset of SNPs we can formally calculate R 2 h using formula (1). For the reduced set of SNP data there will be more haplotype pairs, H i, which are compatible with the genotype G ir based solely on the reduced set of SNP data than with G i based on all SNP data. This results in a lower Var[E h (H i )AG ir ] and hence a lower R 2 h. This is illustrated for two SNPs in table 2 where we assume that (in the main case-control study) only the first SNP is measured in G ir. Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

4 Fig. 1. R 2 h for predicting haplotype h 0 in the case of n independent SNPs each with frequency = 1/2. Fig. 2. R 2 h for predicting haplotypes composed of two SNPs each with allele frequency of 1/2 according to standardized linkage disequilibrium coefficient D) when two SNPs are genotyped (a) or when only one SNP is genotyped (b). Table 2. Details of calculation of R 2 h for the two SNP case when only the first SNP is genotyped H G (0, ) (1, ) (2, ) (0,0),(0,0) (0,0),(0,1) (0,1),(0,1) (1,0),(0,0) (1,0),(0,1) (1,1),(0,0) (1,1),(0,1) (1,0),(1,0) (1,1),(1,0) (1,1),(1,1) P(G) p p 0p 1 + p 2 1 2(p 2 p 0 + p 2 p 1 + p 3 p 0 + p 3 p 1 ) p p 3p 2 + p 2 3 E h0 (H)AG p p 0p 1 p p 0p 1 +p 2 1 E h1 (H)AG 2p 0 p 1 +2p 2 1 p p 0p 1 +p 2 1 p 2 p 0 +p 3 p 0 0 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 p 2 p 1 +p 3 p 1 0 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 E h2 (H)AG 0 p 2 p 0 +p 2 p 1 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 2p p 3p 2 p p 3p 2 +p 2 31 E h3 (H)AG 0 p 3 p 0 +p 3 p 1 p 2 p 0 +p 2 p 1 +p 3 p 0 +p 3 p 1 2p 3 p 2 +2p 2 3 p p 3p 2 +p 2 31 For two SNPs, expressions for Var[E h (H i )AG ir ] and R 2 h readily follow from table 2. In particular we have the uncertainty in estimating dosage for haplotype h 0 as R 2 h 0 = p 0(1 p 0 + p 1 ) (1 p 0 )(p 0 + p 1 ) which equals 1 now only if p 1 = 0. Figure 2 compares the expression for R 2 h using both SNPs to R2 h using only the first according to the amount of linkage disequilibrium (measured by D)) in the special case when both SNPs have frequency equal to 1/2. Computational Considerations As the number of SNPs available for each candidate gene increases the computations involved in the E-M algorithm increase. In order to allow as many as 30 or more SNPs to be used to estimate haplotype frequencies in regions of restricted haplotype diversity (which sometimes do include this many SNPs [7], and which would present a nearly insurmountable computational burden on an unmodified E-M algorithm) it is useful to break up the calculations into pseudo-blocks this is referred to as 30 Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

5 the partition-ligation method by J. Liu and colleagues [14, 15]. Within each pseudo-block (of perhaps 5 contiguous SNPs) the usual E-M algorithm is run providing estimates of frequency of the 2 5 = 32 possible haplotypes. If (as is usually the case for densely placed SNPs) one or more of these haplotype frequencies are estimated to be very close to or equal to zero after a reasonable number of iterations of the E-M, these haplotypes are ignored when the blocks are subsequently merged. Thus if in each of the first two pseudo-blocks 10 haplotypes are estimated to have nonzero frequency, the combination of 10! 10 = 100 possible haplotypes (rather than 32! 32 = 1,024) are all that are considered in the merging stage. This divide-and-conquer process, repeated over many nearby pseudo-blocks, greatly simplifies the calculations required to compute E h (H i )AG i in the E-M steps, and makes the computation of Var[E h (H i )AG i ] feasible for many combinations of potential htsnps during an optimization phase. Note that our use of the term pseudo-block here does not refer to distinct blocks of restricted haplotype diversity, in the sense described by Gabriel et al. [7]. Rather the pseudo-blocks will typically all be contained within a single block of restricted diversity, the number of SNPs in each pseudo-block is chosen solely to maximize the speed of the merging algorithm. For example using 5 SNPs in each pseudo-block will usually give the same result as using 10 SNPs in (half as many) pseudo-blocks with the only difference being in the speed of the algorithm, the first approach will generally be faster. (Occasional minor differences in results, see [15] for a simple example, have to do with the potential multi-modal nature of the likelihood being maximized). In general the success in estimating haplotype frequencies of any implementation of the E-M algorithm for haplotype reconstruction that uses large numbers of SNPs will be dependent upon the true state of nature being that of high linkage disequilibrium between all the SNPs considered. In our selection of htsnps (as for the CYP19 example below) we first identify blocks of restricted haplotype diversity (i.e. high linkage disequilibrium) by the method described in Gabriel et al. [7], and define the common haplotypes in this block as those with greater than 5% frequency as estimated by the E-M algorithm. For a block involving n SNPs in high LD, we define the best set of m htsnps (m! n) as those m SNPs that maximize the minimum value of R 2 h calculated for each common haplotype. The calculation of R 2 h for any given haplotype requires generating the full set of possible haplotype pairs, H, for a given set of non-zero haplotype frequencies and a summation of h (H) over the values compatible with each of the possible resulting SNP genotypes. This then must be done for each common haplotype h of interest, and for each set of candidate htsnps. Depending upon the number of SNPs and non-zero haplotype frequencies a full enumeration can be fairly tedious in many instances. In order to optimize the choice of m htsnps we have implemented a modified stepwise inclusion method rather than an exhaustive search of all n! (n m)!m! choices of m tag SNPs. In the modified stepwise procedure we enter, as the kth candidate htsnp (k ^ n) the SNP giving the greatest increase in the max min R 2 h, from that obtained using the set of k 1 candidate SNPs currently selected. Upon entry of this candidate then we then look backwards to see if max min R 2 h can be further increased by substitution of any of the previously entered k 1 SNPs with any SNP not presently included as an htsnp. To see that this approach is far less computationally intensive than a full search, consider choosing the best 5 htsnps from among 20 possibilities. Of a total of 15,504 possible choices of 5 from 20 potential htsnps, our modified stepwise inclusion method considers just 251 (20 to find the single best htsnp, to find the best two htsnps, an additional to choose the best three, more, for the best four, and more, for the final choice). We recognize that the stepwise algorithm is not mathematically guaranteed to find the single best set of m tag-snps but our experience with the method has been very favorable. For the data described below (section 5) we found that the exhaustive search took approximately 12 min on a 2.0-GHz laptop computer, compared with approximately 10 s using the stepwise algorithm, while producing the same result. Effect of Haplotype Uncertainty on Case-Control Study Sample Size There are standard formulae (Breslow and Day [16]) which may be used to provide sample sizes for risk estimation when dealing with known exposures or covariates. If (as motivated above) we are interested in estimating haplotype-specific relative risks we have to deal with the additional uncertainty that comes about because the haplotype dosage, h (H i ), is not completely known for all individuals. There are two sources of uncertainty, first the formal uncertainty, based upon the R 2 h calculation, which treats the set of estimated haplotype frequencies, Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

6 P h, as known, in the estimation of h (H i ) conditional on G i, and second the uncertainty in the estimates of the haplotype frequencies themselves. Let us first consider only the formal uncertainty in the estimation of h (H i ). We consider disease models in which the expected value of the disease outcome, D i, is nearly linear in h (H i ). This assumption of (near) linearity would include estimation of the log odds ratio or log relative risk in a logistic or Cox regression of disease on h (H i ) so long as the true odds ratio or relative risk is not extraordinarily large. Such additivity would also approximately hold for a dominant penetrance model so long as the haplotype is relatively rare. If the model for disease is E(D i ) = a + b h h (H i ) (4) then a general approach for estimation [13, 17] of b h is to replace h (H i ) with E h (H i )AG i as the independent variable in the appropriate regression algorithm. (This procedure is sometimes known as the regression substitution method). Now under the null hypothesis that b h = 0, neither the mean nor the variance of the disease outcome, Var(D), depends upon on h (H i ) so that we can approximate the sampling variance of the estimate, bˆ, of b under the null as Var (D) Var(bˆ ) NVar E h (H i )AG i = Var (D), (5) N 2p h (1 p h )R 2 h where N is the number of subjects in the main study of disease and haplotype-specific risk. This expression is a modification of a standard formula for Var(bˆ ) quite generally applicable in linear regression, in which the residual variance of the outcome is divided by N times the variance of the independent variable in the regression (here the residual variance is equal to the total variance because b = 0). Formula (5) holds asymptotically for both binary and continuous outcomes when b is zero. We see immediately then that 1/R 2 h is a sample size inflator reflecting the effect of uncertainty in the estimation of h (H i ), on the estimation of the risk estimate b under the null hypothesis that the true b equals 0. That is, in order to achieve the same sampling variance which would theoretically be achieved with h (H i ) known, we have to increase our sample size N by a factor of 1/R 2 h when the haplotype dosage is uncertain. Since expression (5) nearly holds for moderate values of b as well, we see that nearly the same inflation factor applies for power calculations so long as the hypothesized alternative value of b is not too large. (This is a standard result in the measurement error literature valid under local alternatives, see [18] for applications of this result in binary regression.) If it requires N individuals to detect a given non-zero b with a given level of power when the haplotype dosage is certain then expression (5) indicates that it will take approximately N/R 2 h subjects to detect this same b given the uncertainty in R 2 h. Figure 3 gives a modification of standard sample size computations for 1-1 frequency-matched case-control studies to account for haplotype uncertainty. There is of course additional uncertainty in estimating h (H i ) due to sampling errors in the estimation of P h. These errors are shared across many subjects, and it may only be by simulation that the influence of errors in estimating P h on the power to detect a given non-zero b can be addressed. One relatively simple approach to calculate an R 2 h which takes account of variability in the E-M estimates is to repeatedly simulate true values of h (H i ) based on the estimated P h, and then randomly combine these to produce sets of G i. Running the E-M algorithm on each of these sets of data allows a brute force computation of the squared correlation between the simulated true h (H i ) with the estimates E h (H i )AG i computed based upon the resulting E-M estimates of P h, obtained for each set of data. We perform this experiment for test data below. Example CYP19 Data from the Multiethnic Cohort Study The data are from 70 Japanese-American participants in the Multiethnic Cohort study (the University of Southern California Institutional Review Board has approved this study) for which 74 informative SNPs were geno- Table 3. Haplotypes and haplotype frequencies estimated for 19 SNPs in CYP19 for Japanese American members of the multiethnic cohort study Haplotype P h Cumulative probability Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

7 Fig. 3. Sample size requirements for estimating logistic regression model, log additive in h (H), after adjustment for uncertainty in prediction of haplotype count with R 2 h = 0.90 (a) and R 2 h = 0.70 (b). A 1-1 matching of controls to cases is assumed. typed. These 74 SNPs appear to fall into 4 regions of reduced haplotype diversity as judged by the methods of Gabriel et al. [7]. Table 3 shows the haplotypes estimated to have frequency 10 by the E-M algorithm for one of these regions which includes 19 of these SNPs. Treating these haplotype frequencies as fixed we compute R 2 h for the first 5 haplotypes given in table 3 (those with estimated frequency 15%). The choice of htsnps is optimized by maximizing the minimum R 2 h for these 5 common haplotypes (see table 4). Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

8 With the best set of 4 SNPs 1, 16, 17, 18 chosen, the minimum value of R 2 h is 0.94 indicating a loss of efficiency (relative to knowing h completely) for estimating a (linear) relative risk model of no more than approximately 6%. Simulation Based on these 19 SNPs for CYP19 we performed the simulation experiment described above 1,000 times using 1, 16, 17, 18 as the set of htsnps used in the prediction. Correlating the true simulated values of h (H i ) with the estimates E h (H i )AG i lead to the simulated values of R 2 h for the 5 most common haplotypes in table 3 over the 1,000 simulations shown in table 5. The simulation results are highly suggestive that R 2 h is well estimated using our limited number (70) of subjects in the preliminary set of controls for each ethnic group, when, as in our simulation, the true state of nature is reflective of limited haplotype diversity. Table 4. Best choices, by the R 2 h criteria, of htsnps for a region of reduced haplotype diversity in the CYP19 gene among Japanese- American members of the multiethnic cohort study htsnps Best set Haplotype R 2 h , ,17, ,16,17, Discussion We have simplified here the issues involved in the selection of htsnps for studies such as the multiethnic cohort. For example there may be SNPs that are of special interest. These may include missense SNPs within the coding regions, SNPs in the untranslated regions within exons, or at intron-exon boundaries, and SNPs which are in conserved human-mouse homologous regions. These SNPs are forced in as htsnps, and additional SNPs are chosen to ensure that the common haplotypes seen in the 70 controls remain well predicted by the R 2 h criterion. A number of recent papers [8, 19, 20] have discussed the role of htsnps in the search for genetic variants that are related to common diseases. The present paper is the first to suggest and utilize, for haplotype tagging, a formal measure (R 2 h ) of the uncertainty in the prediction of common haplotypes from unphased SNP genotypes, and to relate this measure to sample size requirements for the design of case-control studies. Once htsnps are selected and genotyped in the casecontrol study our approach towards estimation of haplotype-specific relative risks is to first use the regression substitution method described by Zaykin et al. [13] and Schaid et al. [12]. Both these papers specify that the E-M algorithm should be used to estimate haplotype frequencies in the cases and controls in one combined data set. Applying this to the htsnp problem, the haplotype frequency estimates will be re-estimated using the combined cases and controls by the E-M algorithm, using only the htsnps to redefine haplotypes (the relationship between the common haplotypes defined by the htsnps alone and those seen in the full set of SNPs considered originally, should readily be identifiable by eye so long as R 2 h is high). For each common haplotype, the expectations Table 5. Simulation study R 2 h results using htsnps 1, 16, 17, and 18 (see text for further description of the simulation experiment) 1000 replications R 2 h haplotype 1 haplotype 2 haplotype 3 haplotype 4 haplotype 5 Formal value of R 2 h Simulated values Mean SD Median Highest Lowest Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

9 E ( h (H)AG) are then computed for all subjects and used in ordinary logistic regression software to test the null hypothesis that b h = 0. If this hypothesis is rejected using the score test [12], then, in order to remove biases in the estimate of b h which occur because of the enrichment of high risk haplotypes in the cases (implying both biased estimates of p h and a failure of Hardy-Weinberg equilibrium in the combined sample), we re-estimate haplotype frequencies using only the data from the controls. We then re-compute E ( h (H)AG) based on the control haplotype frequencies and re-estimate b h, by logistic regression, but rely on the p value from the (more powerful) combined-data score test to judge its statistical significance. In addition we are currently developing an extension to the Excoffier-Slatkin E-M algorithm to give approximations to the full likelihood of the case-control data, for jointly estimating b h and the haplotype frequencies in an efficient one-stage procedure. This latter method is especially important for providing appropriate upper and lower confidence limits for b h. Other statistical criteria for choosing htsnps are possible, for example we may define s (H) as the allele dosage of SNP s (so that this equals 0, 1 or 2, depending upon the number of copies of the variant allele at position s carried by the pair of haplotypes, H). Then, under HWE, for any potential set of htsnps it we may compute min R 2 s over all the SNPs in the region of interest, simply by substitution of s for h in our formulae above. This measure of the performance of the htsnps might be more appropriate than R 2 h if it was considered very likely that one (or more) of the SNPs measured in the preliminary sample was in fact related to disease in a causal fashion but that it was unknown a priori which SNP was the most likely candidate. In this case statistical control of all SNPs (rather than of all common haplotypes) would be the goal of the ht SNP selection. At this point, our approach towards choosing htsnps emphasizes maximizing the predictability of common haplotypes, although we also have implemented the maximization of min R 2 s as well as R2 h, in the software that we have developed and are making available (see below). Acknowledgements This work has been supported by grants CA63464, Genetic Susceptibility to Cancer in Multiethnic Cohorts, and GM58897, Computational Methods in Genetic Epidemiology, from the National Cancer Institute, National Institutes of Health. Part of this work was completed while Daniel Stram was on sabbatical visiting the Center for Genome Research of the Whitehead Institute, Massachusetts Institute of Technology, Cambridge, Mass. Collaborators on the gene association studies now in progress using cases and controls from the Multiethnic Cohort study include Matthew Freedman from the Center for Genome Research, and Abraham M. Nomura and Loic Le Marchand at the Hawaii Cancer Research Center, University of Hawaii, Honolulu, Hawaii Software Windows/DOS-based software and documentation for the selection of htsnps based on the procedures described here may be downloaded at References 1 Rioux JD, Daly MJ, Silverberg MS, Lindblad K, Steinhart H, Cohen Z, Delmonte T, et al: Genetic variation in the 5q31 cytokine gene cluster confers susceptibility to Crohn disease. Nat Genet 2001;29(2): Risch N, Merikangas K: The future of genetic studies of complex human diseases. Science 1996;273(5281): Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet 1999;22(2): Reich DE, Cargill M, Bolk S, Ireland J, Sabeti PC, Richter DJ, Lavery T, et al: Linkage disequilibrium in the human genome. Nature 2001;411(6834): Daly MJ, Rioux J, Schaffner S, Hudson T, Lander E: High-resolution haplotype structure in the human genome. Nat Genet 2001;29: Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, et al: Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 2001;294(5547): Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, et al: The structure of haplotype blocks in the human genome. Science 2002;296(5576): Johnson GC, Esposito L, Barratt BJ, Smith AN, Heward J, Di Genova G, Ueda H, et al: Haplotype tagging for the identification of common disease genes. Nat Genet 2001;29(2): Zhang K, Deng M, Chen T, Waterman MS, Sun F: A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci USA 2002;99(11): Ke X, Cardon LR: Efficient selective screening of haplotype tag SNPs. Bioinformatics 2003; 19(2): Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995; 12(5): Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet 2002; 70(2): Choosing Haplotype-Tagging SNPS Hum Hered 2003;55:

10 13 Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered 2002; 53(2): Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am J Hum Genet 2002;70(1): Qin ZS, Niu T, Liu JS: Partition-ligationexpectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. Am J Hum Genet 2002;71(5): Breslow N, Day W (eds): Statistical Methods in Cancer Research: The Analysis of Case-Control Studies. IARC Scientific Publications, ed W Davis, Vol , International Agency for Cancer Research: Lyon. 17 Rosner B, Spiegelman D, Willett W: Correction of logistic relative risk estimates and confidence intervals for random within-person measurement error. Am J Epidemiol 1992;136: Tosteson T, Ware J: Designing a logistic regression study using surrogate measures for exposure and outcome. Biometrika 1990;77: Judson R, Salisbury B, Schneider J, Windemuth A, Stephens J: How many SNPs does a genome-wide haplotype map require? Pharmacogenomics 2002;3(3): Rohde K, Fuerst R: Haplotyping and estimation of haplotype frequencies for closely linked biallelic multilocus genetic phenotypes including nuclear family information. Hum Mutat 2001;17(4): Hum Hered 2003;55:27 36 Stram/Haiman/Hirschhorn/Altshuler/ Kolonel/Henderson/Pike

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University,

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries

An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries An MDL Method for Finding Haplotype Blocks and for Estimating the Strength of Haplotype Block Boundaries M. Koivisto, M. Perola, T. Varilo, W. Hennah, J. Ekelund, M. Lukk, L. Peltonen, E. Ukkonen, H. Mannila

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Learning gene regulatory networks Statistical methods for haplotype inference Part I Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to 1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution

More information

Haplotyping. Biostatistics 666

Haplotyping. Biostatistics 666 Haplotyping Biostatistics 666 Previously Introduction to te E-M algoritm Approac for likeliood optimization Examples related to gene counting Allele frequency estimation recessive disorder Allele frequency

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

Finding Haplotype Block Boundaries by Using the Minimum-Description-Length Principle

Finding Haplotype Block Boundaries by Using the Minimum-Description-Length Principle Am. J. Hum. Genet. 73:336 354, 2003 Finding Haplotype Block Boundaries by Using the Minimum-Description-Length Principle Eric C. Anderson and John Novembre Department of Integrative Biology, University

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/35195 holds various files of this Leiden University dissertation Author: Balliu, Brunilda Title: Statistical methods for genetic association studies with

More information

Non-iterative, regression-based estimation of haplotype associations

Non-iterative, regression-based estimation of haplotype associations Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center

More information

Backward Genotype-Trait Association. in Case-Control Designs

Backward Genotype-Trait Association. in Case-Control Designs Backward Genotype-Trait Association (BGTA)-Based Dissection of Complex Traits in Case-Control Designs Tian Zheng, Hui Wang and Shaw-Hwa Lo Department of Statistics, Columbia University, New York, New York,

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017

Lecture 7: Interaction Analysis. Summer Institute in Statistical Genetics 2017 Lecture 7: Interaction Analysis Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 39 Lecture Outline Beyond main SNP effects Introduction to Concept of Statistical Interaction

More information

BIOINFORMATICS. SequenceLDhot: Detecting Recombination Hotspots. Paul Fearnhead a 1 INTRODUCTION 2 METHOD

BIOINFORMATICS. SequenceLDhot: Detecting Recombination Hotspots. Paul Fearnhead a 1 INTRODUCTION 2 METHOD BIOINFORMATICS Vol. 00 no. 00 2006 Pages 1 5 SequenceLDhot: Detecting Recombination Hotspots Paul Fearnhead a a Department of Mathematics and Statistics, Lancaster University, Lancaster LA1 4YF, UK ABSTRACT

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder

More information

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility

SWEEPFINDER2: Increased sensitivity, robustness, and flexibility SWEEPFINDER2: Increased sensitivity, robustness, and flexibility Michael DeGiorgio 1,*, Christian D. Huber 2, Melissa J. Hubisz 3, Ines Hellmann 4, and Rasmus Nielsen 5 1 Department of Biology, Pennsylvania

More information

Linkage Disequilibrium Testing When Linkage Phase Is Unknown

Linkage Disequilibrium Testing When Linkage Phase Is Unknown Copyright 2004 by the Genetics Society of America Linkage Disequilibrium Testing When Linkage Phase Is Unknown Daniel J. Schaid 1 Department of Health Sciences Research, Mayo Clinic/Foundation, Rochester,

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen

Harvard University. A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome. Eric Tchetgen Tchetgen Harvard University Harvard University Biostatistics Working Paper Series Year 2014 Paper 175 A Note on the Control Function Approach with an Instrumental Variable and a Binary Outcome Eric Tchetgen Tchetgen

More information

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence

Efficient designs of gene environment interaction studies: implications of Hardy Weinberg equilibrium and gene environment independence Special Issue Paper Received 7 January 20, Accepted 28 September 20 Published online 24 February 202 in Wiley Online Library (wileyonlinelibrary.com) DOI: 0.002/sim.4460 Efficient designs of gene environment

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA)

HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) BIRS 016 1 HERITABILITY ESTIMATION USING A REGULARIZED REGRESSION APPROACH (HERRA) Malka Gorfine, Tel Aviv University, Israel Joint work with Li Hsu, FHCRC, Seattle, USA BIRS 016 The concept of heritability

More information

Equivalence of random-effects and conditional likelihoods for matched case-control studies

Equivalence of random-effects and conditional likelihoods for matched case-control studies Equivalence of random-effects and conditional likelihoods for matched case-control studies Ken Rice MRC Biostatistics Unit, Cambridge, UK January 8 th 4 Motivation Study of genetic c-erbb- exposure and

More information

Tutorial 1: Power and Sample Size for the One-sample t-test. Acknowledgements:

Tutorial 1: Power and Sample Size for the One-sample t-test. Acknowledgements: Tutorial 1: Power and Sample Size for the One-sample t-test Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements: The project was supported in large part by the National

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Tutorial 2: Power and Sample Size for the Paired Sample t-test

Tutorial 2: Power and Sample Size for the Paired Sample t-test Tutorial 2: Power and Sample Size for the Paired Sample t-test Preface Power is the probability that a study will reject the null hypothesis. The estimated probability is a function of sample size, variability,

More information

Bayesian analysis of the Hardy-Weinberg equilibrium model

Bayesian analysis of the Hardy-Weinberg equilibrium model Bayesian analysis of the Hardy-Weinberg equilibrium model Eduardo Gutiérrez Peña Department of Probability and Statistics IIMAS, UNAM 6 April, 2010 Outline Statistical Inference 1 Statistical Inference

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white Outline - segregation of alleles in single trait crosses - independent assortment of alleles - using probability to predict outcomes - statistical analysis of hypotheses - conditional probability in multi-generation

More information

Building a Prognostic Biomarker

Building a Prognostic Biomarker Building a Prognostic Biomarker Noah Simon and Richard Simon July 2016 1 / 44 Prognostic Biomarker for a Continuous Measure On each of n patients measure y i - single continuous outcome (eg. blood pressure,

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

Combining dependent tests for linkage or association across multiple phenotypic traits

Combining dependent tests for linkage or association across multiple phenotypic traits Biostatistics (2003), 4, 2,pp. 223 229 Printed in Great Britain Combining dependent tests for linkage or association across multiple phenotypic traits XIN XU Program for Population Genetics, Harvard School

More information

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) 12/5/14 Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) Linkage Disequilibrium Genealogical Interpretation of LD Association Mapping 1 Linkage and Recombination v linkage equilibrium ²

More information

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota,

More information

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate

Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Model-Free Knockoffs: High-Dimensional Variable Selection that Controls the False Discovery Rate Lucas Janson, Stanford Department of Statistics WADAPT Workshop, NIPS, December 2016 Collaborators: Emmanuel

More information

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Zheyang Wu 1, Hongyu Zhao 1,2 * 1 Department of Epidemiology and Public Health, Yale University School of Medicine, New

More information

Association studies and regression

Association studies and regression Association studies and regression CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar Association studies and regression 1 / 104 Administration

More information

Genotype Imputation. Class Discussion for January 19, 2016

Genotype Imputation. Class Discussion for January 19, 2016 Genotype Imputation Class Discussion for January 19, 2016 Intuition Patterns of genetic variation in one individual guide our interpretation of the genomes of other individuals Imputation uses previously

More information

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN

TESTS FOR EQUIVALENCE BASED ON ODDS RATIO FOR MATCHED-PAIR DESIGN Journal of Biopharmaceutical Statistics, 15: 889 901, 2005 Copyright Taylor & Francis, Inc. ISSN: 1054-3406 print/1520-5711 online DOI: 10.1080/10543400500265561 TESTS FOR EQUIVALENCE BASED ON ODDS RATIO

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

STAT331. Cox s Proportional Hazards Model

STAT331. Cox s Proportional Hazards Model STAT331 Cox s Proportional Hazards Model In this unit we introduce Cox s proportional hazards (Cox s PH) model, give a heuristic development of the partial likelihood function, and discuss adaptations

More information

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014

BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 BTRY 4830/6830: Quantitative Genomics and Genetics Fall 2014 Homework 4 (version 3) - posted October 3 Assigned October 2; Due 11:59PM October 9 Problem 1 (Easy) a. For the genetic regression model: Y

More information

Population Structure

Population Structure Ch 4: Population Subdivision Population Structure v most natural populations exist across a landscape (or seascape) that is more or less divided into areas of suitable habitat v to the extent that populations

More information

Meta-analysis of epidemiological dose-response studies

Meta-analysis of epidemiological dose-response studies Meta-analysis of epidemiological dose-response studies Nicola Orsini 2nd Italian Stata Users Group meeting October 10-11, 2005 Institute Environmental Medicine, Karolinska Institutet Rino Bellocco Dept.

More information

SNP-SNP Interactions in Case-Parent Trios

SNP-SNP Interactions in Case-Parent Trios Detection of SNP-SNP Interactions in Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 2, 2009 Karyotypes http://ghr.nlm.nih.gov/ Single Nucleotide Polymphisms

More information

Genetic Association Studies in the Presence of Population Structure and Admixture

Genetic Association Studies in the Presence of Population Structure and Admixture Genetic Association Studies in the Presence of Population Structure and Admixture Purushottam W. Laud and Nicholas M. Pajewski Division of Biostatistics Department of Population Health Medical College

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2008 Paper 241 A Note on Risk Prediction for Case-Control Studies Sherri Rose Mark J. van der Laan Division

More information

Ignoring the matching variables in cohort studies - when is it valid, and why?

Ignoring the matching variables in cohort studies - when is it valid, and why? Ignoring the matching variables in cohort studies - when is it valid, and why? Arvid Sjölander Abstract In observational studies of the effect of an exposure on an outcome, the exposure-outcome association

More information

The Admixture Model in Linkage Analysis

The Admixture Model in Linkage Analysis The Admixture Model in Linkage Analysis Jie Peng D. Siegmund Department of Statistics, Stanford University, Stanford, CA 94305 SUMMARY We study an appropriate version of the score statistic to test the

More information

Causal Model Selection Hypothesis Tests in Systems Genetics

Causal Model Selection Hypothesis Tests in Systems Genetics 1 Causal Model Selection Hypothesis Tests in Systems Genetics Elias Chaibub Neto and Brian S Yandell SISG 2012 July 13, 2012 2 Correlation and Causation The old view of cause and effect... could only fail;

More information

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials Biostatistics (2013), pp. 1 31 doi:10.1093/biostatistics/kxt006 Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials XINYI LIN, SEUNGGUEN

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

A Robust Test for Two-Stage Design in Genome-Wide Association Studies

A Robust Test for Two-Stage Design in Genome-Wide Association Studies Biometrics Supplementary Materials A Robust Test for Two-Stage Design in Genome-Wide Association Studies Minjung Kwak, Jungnam Joo and Gang Zheng Appendix A: Calculations of the thresholds D 1 and D The

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Dose-response modeling with bivariate binary data under model uncertainty

Dose-response modeling with bivariate binary data under model uncertainty Dose-response modeling with bivariate binary data under model uncertainty Bernhard Klingenberg 1 1 Department of Mathematics and Statistics, Williams College, Williamstown, MA, 01267 and Institute of Statistics,

More information

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies

Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies Statistical Analysis of Haplotypes, Untyped SNPs, and CNVs in Genome-Wide Association Studies by Yijuan Hu A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in

More information

Calculation of IBD probabilities

Calculation of IBD probabilities Calculation of IBD probabilities David Evans and Stacey Cherny University of Oxford Wellcome Trust Centre for Human Genetics This Session IBD vs IBS Why is IBD important? Calculating IBD probabilities

More information

Statistical Methods and Software for Forensic Genetics. Lecture I.1: Basics

Statistical Methods and Software for Forensic Genetics. Lecture I.1: Basics Statistical Methods and Software for Forensic Genetics. Lecture I.1: Basics Thore Egeland (1),(2) (1) Norwegian University of Life Sciences, (2) Oslo University Hospital Workshop. Monterrey, Mexico, Nov

More information

What is the expectation maximization algorithm?

What is the expectation maximization algorithm? primer 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou The expectation maximization algorithm arises

More information

Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs

Matched-Pair Case-Control Studies when Risk Factors are Correlated within the Pairs International Journal of Epidemiology O International Epidemlologlcal Association 1996 Vol. 25. No. 2 Printed In Great Britain Matched-Pair Case-Control Studies when Risk Factors are Correlated within

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

Optimal Methods for Using Posterior Probabilities in Association Testing

Optimal Methods for Using Posterior Probabilities in Association Testing Digital Collections @ Dordt Faculty Work: Comprehensive List 5-2013 Optimal Methods for Using Posterior Probabilities in Association Testing Keli Liu Harvard University Alexander Luedtke University of

More information

8. Genetic Diversity

8. Genetic Diversity 8. Genetic Diversity Many ways to measure the diversity of a population: For any measure of diversity, we expect an estimate to be: when only one kind of object is present; low when >1 kind of objects

More information

Analysis of the Seattle SNP, Perlegen, and HapMap data sets

Analysis of the Seattle SNP, Perlegen, and HapMap data sets A population genetics model with recombination hotspots that are heterogeneous across the population Peter Calabrese Molecular and Computational Biology, University of Southern California, 050 Childs Way,

More information

Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design

Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design Chapter 236 Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design Introduction This module provides power analysis and sample size calculation for non-inferiority tests

More information

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure) Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

More information