On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Size: px
Start display at page:

Download "On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease"

Transcription

1 On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University, East Lansing, MI Department of Statistics, Virginia Tech, Blacksburg, Virginia Abstract Detecting the pattern and distribution of DNA variants across the genome is essential in understanding the etiology of complex human disease. Recently Liu et al. (2005) and Cui et al. (2007) developed a novel nucleotide mapping method under the mixture model framework to target specific DNA sequence variants underlying complex disease. The likelihood ratio test (LRT) was applied to test the association between a risk haplotype and a complex disease, and permutation tests were used to assess the significance of the LRT. This, however, renders computational burden in extending the nucleotide mapping method to large scale high density genome-wide single nucleotide polymorphism data. Here we theoretically investigate the limiting distribution of the LRT under the mixture model-based nucleotide mapping framework and show that it asymptotically follows a χ 2 distribution. Simulations show good finite sample property of the limiting distribution. The study contributes to the theory of gene mapping. Key words: Asymptotic threshold, Mixture model, Risk haplotype, Single nucleotide polymorphism AMS 2000 subject classifications: 62F05, 60F05, 92D10. 1

2 1 Introduction Recent developments in bio-technology have produced massive amount of high-dimensional genetics data. Hunting for disease genes has been radically shifted from traditional approaches focusing on chromosome segments to single nucleotide variants called single nucleotide polymorphisms (SNPs). In a broad sense, methods for disease-gene association study have been focused either on single SNP analysis, or the combination of SNPs, termed haplotype analysis. Statistical analysis focusing on single SNPs can be done by testing allele or genotype frequency differences in affected and unaffected samples by a chi-square test or logistic regression (e.g. Olson and Wijsman, 1994). When multiple disease variants function in a cis-acting format, the haplotype-based analysis appears to be more powerful than the single SNP-based analysis (Schaid et al., 2002). The relative merit of haplotype-based analysis over single-locus approach has been shown in a number of studies (e.g. Akey et al., 2001; Clark, 2004; Schaid, 2004). With the development of the human HapMap project, large amount of DNA variants can be generated with different array genotyping platforms. As the SNP genotyping density becomes more and more dense, eventually a comprehensive human sequence variant map will be made available. It is essential to target specific DNA sequence variants that underlie complex diseases. In previous studies, Liu et al. (2005) and Cui et al. (2007) developed a nucleotide mapping approach by targeting SNP variants that are structured in a haplotype format. Specific risk haplotypes that trigger significant effects on a disease trait can be formulated and tested, which represents one of the advantages of the methods over the traditional haplotypebased analysis. Based on the patterns of the combination of risk and non-risk haplotypes, a novel grouping technique is applied to group diplotypes with common genetic effects. Thus, the degrees of freedom for an association tests are greatly reduced, regardless of the number of SNPs fitted in the model. The method was first developed for phenotypic data arising from a normal distribution (Liu et al., 2005). This assumption is relaxed in Cui et al. (2007) where data arising from an exponential family can be fitted. Statistical inference procedures are derived to quantify the relative risk of different haplotypes an individual may have (Cui et al. 2007). Li et al. (2007) recently applied the nucleotide mapping idea to longitudinal data. Other extensions and applications of the mapping method were also proposed (e.g. Hou et 2

3 al., 2007; Wu et al., 2007; Pinedo et al., 2008). A common issue in haplotype-based analysis is the unknown linkage phase. When two or more heterozygous loci are involved, the linkage phase cannot be determined explicitly, rather than being considered as missing data. Statistical mixture model has been commonly applied to modeling data with missing, as the nucleotide mapping methods adopted to deal with haplotype phase uncertainty. The likelihood ratio test (LRT) was applied to test the association of different nucleotide patterns with a disease trait. It is commonly recognized that the usual regularity conditions for the asymptotic χ 2 distribution of the LRT do not hold in this case. Thus, permutation tests were used to assess the statistical significance in the earlier works. Even though tagging SNPs can be used to reduce the dimension of SNP variants, extensive computation in permutation tests is still a huge burden. The computational cost greatly hinders the application of the methods, especially in extending them to a high-density genome-wide scale. Fast asymptotic approach for threshold determination is highly desirable to make these methods more practical. It is thus the purpose of this paper to study the limiting distribution of the LRT under the nucleotide mapping framework. In the next section, we start with a brief review of the nucleotide mapping methods. Then using the local asymptotic normality (LAN) of the test statistic, we show that the limiting distribution of the LRT follows a χ 2 distribution with two degrees of freedom, regardless of the number of SNPs fitted in the model and the asymptotic result holds for a wide range of phenotype distributions. Simulations are conducted to evaluate the finite sample property of the test statistic under the mixture model nucleotide mapping framework in comparison with the distribution-free permutation tests. Results show that for moderate sample sizes the thresholds for the test statistic based on the limiting distribution are virtually indistinguishable from those based on permutation tests. Thus fast threshold determination can be based on the χ 2 approximation rather than based on time-consuming permutation tests. 2 The nucleotide mapping framework and the LRT We begin this section by briefly introducing the nucleotide mapping method in a generalized linear model framework. Consider K (K 2) SNPs (or tag SNPs) within a haplotype block 3

4 constructed from a number of bi-allelic loci. SNPs within a haplotype block are correlated due to strong linkage disequilibrium (LD), whereas SNPs between blocks are less correlated with weak LD. Denote the two alleles for the kth SNP within a block as Q k r k (r k = 1, 2; k = 1, K), with allele frequencies denoted by p (k) r k. The subscript r k (=1,2) is used to denote the allele. There are maximum 2 K possible haplotypes by the random combination of these K SNPs. In reality, the number of observed haplotypes may be much smaller than that due to LD among SNPs in a block. Let us denote the haplotype structure as [Q 1 r 1 Q 2 r 2 Q K r k ] with corresponding haplotype frequencies denoted as p r1 r 2 r K. We assume that alleles with the same value of r k are located on the same chromosome. In practice, haplotypes are unobservable and only unphased multi-locus genotypes are observed with the form expressed as Q 1 r 1 Q 1 s 1 /Q 2 r 2 Q 2 s 2 / /Q K r K Q K s K (r k (s k ) = 1, 2), where s k is used to denote alleles located on the other chromosome in homologous to r k. The corresponding genotype frequency and the number of observations are expressed as P r1 s 1 /r 2 s 2 / /r K s K and n r1 s 1 /r 2 s 2 / /r K s K, respectively. Note that we use capital letter P to denote the observed genotype frequency and lower case p to denote the haplotype frequency. The combination of two haplotypes forms a diplotype which is denoted as [Q 1 r 1 Q 2 r 2 Q K r K ] [Q 1 s 1 Q 2 s 2 Q K s K ]. Assuming Hardy-Weinberg equilibrium, the diplotype frequency can be expressed as a product of the haplotype frequencies, i.e., P [r1 r 2 r K ][s 1 s 2 s K ] = p r1 r 2 r K p s1 s 2 s K. When there are two or more than two heterozygotes among multiple SNPs, the linkage phase is unknown and the inference about linkage phase is necessary for haplotype-based analysis. The problem of unknown phase leads to a natural mixture distribution in statistical modeling. For an example, consider three SNPs in a haplotype block. The genotype Q 1 1Q 1 1/Q 2 1Q 2 2/Q 3 1Q 3 2 could form two different diplotypes expressed as [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 2Q 3 2] and [Q 1 1Q 2 1Q 3 2][Q 1 1Q 2 2Q 3 1], while the genotype Q 1 1Q 1 2/Q 2 1Q 2 2/Q 3 1Q 3 2 could form four different diplotypes. In nucleotide mapping, one haplotype is assumed to be the risk haplotype and the selection of risk haplotype can be done through statistical model selection (see Liu et al. 2005; Cui et al. 2007). For the three SNPs case, if we assume that [Q 1 1Q 2 1Q 3 1] is the risk haplotype, we them can formulate three different composite diplotypes expressed as [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 1Q 3 1], [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 1Q 3 1] and [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 1Q 3 1]. Note that the formulation of the composite diplotype is the foundation of the nucleotide mapping approach. By assuming one haplotype to 4

5 be the risk haplotype and all the others as non-risk ones, the effect of risk haplotype can be modeled in terms of the three composite diplotypes. Applying the traditional quantitative genetic theory (Lynn and Walsh, 1989), the genetic effect of the risk haplotype can be modelled through the additive (denoted as a) and dominant effect (denoted as d) of the composite diplotypes. The multilocus haplotype frequency can be formulated as a function of allele frequencies and LD parameters of different orders (Lou et al., 2003). For example, a haplotype frequency, denoted as p r1 r 2 r L, can be decomposed into the following components: p r1 r 2...r K = p r1 p r2... p rk No LD +( 1) rk 1+rK p r1... p rk 2 D (K 1)K + + ( 1) r1+r2 p r3... p rk D 12 +( 1) r K 2+r K 1 +r K p r1... p rk 3 D (K 2)(K 1)K + + ( 1) r 1+r 2 +r 3 p r4... p rk D ( 1) r r K D 1...K Digenic LD Trigenic LD K genic LD (1) where D s are the linkage disequilibria of different orders among particular htsnps. For a 2-SNP model, this reduces to p r1r 2 = p r1 p r2 + ( 1) r1+r2 D, r 1, r 2 = 1, 2 where r 1, r 2 are indicator variables for SNP 1 and 2, respectively. Let y denote a measured disease trait which can be continuous or discrete depending on the nature of the disease status. For example, when studying obesity, the Body Mass Index can be a continuous phenotype while for most human diseases, the phenotype is measured as binary corresponding to either affected or unaffected status. Let X denote a matrix of numerical codes corresponding to the composite genotype, G say, including the intercept as the first column. Suppose that the genetic covariates influence only the mean of the trait and not the scale, so that their effects can be summarized by a function of the linear predictor η = Xβ (2) where β contains the regression parameters for the genetic effect of composite diplotypes on the disease trait. Here we assume the disease phenotype has an exponential family distribution and can be modeled through the generalized linear model (McCullagh and Nelder, 1989). For 5

6 a binary disease response, we can apply the logit model with the natural logit link function. For a normal or Poisson type trait, the identity or log link function is used, respectively. (McCullagh and Nelder, 1989). With the three composite diplotype patterns, the effect of genetic association can be assessed by testing H 0 : a = d = 0 using the likelihood ratio test. For simplicity, we start with a 2-SNP model to show the derivation of the limiting distribution of the LRT. Extensions to cases with more than two SNPs are derived later. Table 1 shows a complete list of possible genotype and diplotype configurations as well as the genotypic means. The linear combination of genetic effects in (2) can be simplified as µ + a, µ + d and µ a, corresponding to composite diplotypes [11][11], [11][11] and [11][11], respectively. The genotypic means can be reparameterized as λ 1 (= µ+a), λ 2 (= µ+d), and λ 3 (= µ a) corresponding to composite diplotypes [11][11], [11][11] and [11][11], respectively. To test the association of a risk haplotype with a disease trait, we can test H 0 : a = d = 0, or equivalently, test H 0 : λ 1 = λ 2 = λ 3 = λ for some unknown common λ. With the configuration listed in Table 1, the observed disease phenotype data can be categorized as four groups. Three groups have distinct phase information and their phenotypes are represented as y 1 = (y 11,..., y 1n1 ) T, y 2 = (y 21,..., y 2n2 ) T and y 3 = (y 31,..., y 3n3 ) T. The fourth group corresponds to the one with missing linkage phase information and is denoted as y 4 = (y 41,..., y 4n4 ) T. For the three groups with distinct phase information, their density functions are given by y ji f j (y; λ 1, λ 2, λ 3 ), i = 1,..., n j ; j = 1, 2, 3 (3) Specifically, f j (y; λ 1, λ 2, λ 3 ) = f j (y λ j ). The fourth group y 4 involves a mixture distribution of the form y 4i φf 2 (y λ 2 ) + (1 φ)f 3 (y λ 3 ), i = 1,..., n 4, (4) where λ 1, λ 2, λ 3 are unknown and φ is an unknown parameter of the mixture proportion with 0 < φ < 1, and can be estimated from the population frequency parameters. Assume independent samples, and let n = 4 j=1 n j denote the total sample size. The log-likelihood function l n can be expressed as l n (λ 1, λ 2, λ 3 ; φ) = n 3 j n 4 log[f j (y ji λ j )] + log [φf 2 (y 4i λ 2 ) + (1 φ)f 3 (y 4i λ 3 )] (5) j=1 6

7 Table 1: Possible diplotype and composite genotype configurations of nine genotypes at two SNPs and their haplotype composition frequencies Diplotype Composite diplotype Genotype Relative Mean Observation Configuration Frequency frequency Symbol parameters 11/11 [11][11] P [11][11] = p [11][11] λ 1 n 11/11 11/12 [11][12] P [11][12] = 2p 11 p 12 1 [11][11] λ 2 n 11/12 11/22 [12][12] P [12][12] = p [11][11] λ 3 n 11/22 12/11 [11][21] P [11][21] = 2p 11 p 21 1 [11][11] λ 2 n 12/11 { { { { { [11][22] P [11][22] = 2p 11 p 22 φ [11][11] λ2 12/12 n 12/12 [12][21] P [12][21] = 2p 12 p 21 1 φ [11][11] λ 3 12/22 [12][22] P [12][22] = 2p 12 p 22 1 [11][11] λ 3 n 12/22 22/11 [21][21] P [21][21] = p [11][11] λ 3 n 22/11 22/12 [21][22] P [21][22] = 2p 21 p 22 1 [11][11] λ 3 n 22/12 22/22 [22][22] P [22][22] = p [11][11] λ 3 n 22/22 Note that φ = p 11 p 22 /(p 11 p 22 +p 12 p 21 ) where p ij is the frequency for haplotype ij for i, j = 1, 2. The relative frequency refers to the probability that a specific diplotype is observed. For unambiguous genotype (with known phase), the relative frequency is 1. For the double heterozygotic genotype 12/12, the probability of observing diplotype [11][22] is φ, and observing diplotype [12][12] is 1 φ. Under H 0, the mixture distribution is collapsed to a single distribution free of φ. So, the log-likelihood function is l n (λ, λ, λ; φ) = n log[f(y i λ)] Let ˆλ j (j = 1, 2, 3) be the maximum likelihood estimate (MLE) of λ j (j = 1, 2, 3) under H 1, and λ be the MLE of λ under H 0. Following the notation given in Van Der Vaart (1998), introduce λ j = λ + h j n 1 2, where h j, j = 1, 2, 3 are arbitrary real numbers. Define Λ n (λ; φ) as the LRT statistics of the form ( Λ n (λ; φ) = 2 l n ( λ, λ, λ) l n (ˆλ 1, ˆλ 2, ˆλ ) 3 ; φ) ( = 2 sup l n (λ + h 1=h 2=h 3=h h n, λ + h n, λ + ) h ) sup l n (λ + h 1, λ + h 2, λ + h 3 ; φ) n h 1,h 2,h 3 n n n (6) The test rejects H 0 if Λ n (λ; φ) exceeds a critical value as identified below. Note that the 7

8 likelihood function under the null is not nested under the alternative and hence regularity conditions to apply the asymptotic chi-square distribution does not directly apply in the current setting. Here we show that the LRT converges to a chi-square distribution with two degrees of freedom. We also generalize the results to multiple SNPs case. 3 The limiting distribution of the LRT 3.1 Case when K = 2 Let D and P denote convergence in distribution and in probability, respectively. Let Z = (Z 1, Z 2, Z 3, Z 4 ) T where Z 1, Z 2, Z 3, Z 4 are iid standard normal random variables. Let h = (h 1, h 2, h 3 ) T. Denote and 2 as the gradient and Hessian operators, respectively, and let I(λ) denote the Fisher information matrix. Introduce and where and w n (h 1, h 2, h 3 ) = 1 n l n (λ, λ, λ) T h + 1 2n ht 2 l n (λ, λ, λ)h (7) w(h 1, h 2, h 3 ) = I(λ)(BZ) T h I(λ) 2 ht Ah (8) A = B = p p 2 + φ 2 p 4 φ(1 φ)p 4 0 φ(1 φ)p 4 p 3 + (1 φ) 2 p 4 (9) p p2 0 p4 φ (10) 0 0 p3 p4 (1 φ) By the second-order Taylor expansion of the log-likelihood function about (λ, λ, λ) T, we have ( l n λ + h 1, λ + h 2, λ + h ) 3 ; φ l n (λ, λ, λ) + w n (h 1, h 2, h 3 ) (11) n n n Lemma 1. Let y ji, i = 1,..., n j, j = 1,..., 4 be independent random variables having density given in (3) and (4). Under suitable regularity conditions on the density functions f j (y), j = 1, 2, 3 (as in page 118, Lehmann (1991)), for any real numbers h 1, h 2, and h 3, w n (h 1, h 2, h 3 ) w(h 1, h 2, h 3 ) as n. Proof: Define p j (j = 1,..., 4) as the limiting proportion for group j, i.e., p j = lim (n j/n). n Let w n (h 1, h 2, h 3 ) as in (7). Then, ( l n (λ, λ, λ) T l n =, l n, l ) T n λ 1 λ 2 λ 3 λ1 =λ 8 λ2 =λ λ3 =λ D

9 and we can write 1 l n n λ 1 = λ1 =λ n1 1 n 1 n n1 f λ (y 1i ) f λ (y 1i ) where f. λ denotes the first derivative of the density f with respect to the parameter λ. ( ) ( f Since E λ (y 1i ) ) f λ = 0 and Var λ (y 1i ) λ = I(λ), by lemma 6.1, page 118 in Lehmann (1991), f λ (y 1i ) f λ (y 1i ) 1 l n n λ 1 λ1 =λ D p 1 I(λ)Z1 Similarly, Thus, 1 l n n λ 2 = λ2 =λ 1 l n n λ 2 n2 1 n 2 n n2 λ2 =λ f λ (y 2i ) f λ (y 2i ) + n4 1 n n4 n 4 D I(λ) ( p 2 Z 2 + φ p 4 Z 4 ) φ f λ (y 4i ) f λ (y 4i ) and, 1 l n n λ 3 λ3 =λ D I(λ) ( p 3 Z 3 + (1 φ) p 4 Z 4 ) Let.. f λ denote the second derivative of the density f with respect to the parameter λ. For the Hessian matrix 2 l n (λ, λ, λ), Since E λ (.. f λ (y 1i ) f λ(y 1i) ) 2 l n λ 2 1 = λ1=λ n 1.. n1 f λ (y 1i ) f λ (y 1i ) ( f λ (y 1i ) f λ (y 1i ) = 0 and E λ ( f λ (y 1i ) f λ(y 1i)) 2 = I(λ), by the Law of Large Numbers, ) 2 1 n 2 l n λ 2 1 λ1=λ = n 1 n 1 n 1 n 1.. f λ (y 1i ) f λ (y 1i ) n 1 n 1 n 1 n 1 ( ) 2 f λ (y 1i ) P p1 I(λ) f λ (y 1i ) and 2 l n λ 1 λ 2 = 2 l n λ 1 λ 3 = 0 Essentially following the same idea, it can be shown that 1 n 1 n 1 n 2 l n λ l n λ 2 λ 3 2 l n λ2=λ λ 2 3 λ3 =λ λ2 =λ 3 =λ P I(λ)(p 2 + φ 2 p 4 ) P I(λ)(p 3 + (1 φ) 2 p 4 ) P I(λ)φ(1 φ)p 4 9

10 Thus, w n (h 1, h 2, h 3 ) D w(h 1, h 2, h 3 ) as n. Then we have and sup w(h 1, h 2, h 3 ) = 1 h 1,h 2,h 3 2 (BZ)T A 1 (BZ) (12) sup w(h, h, h) = 1 h 1 =h 2 =h 3 =h 2 (BZ)T JBZ (13) where J is a 3 3 matrix whose entries are all 1 s and A, B are as in (9) and (10). With Lemma 1, we have the following theorem. Theorem 1. Let LRT Λ n (λ; φ) be as in (6). Under the same regularity conditions as in Lemma 1, if the null hypothesis is true, then Λ n (λ; φ) converges in distribution to a chi-square distribution with two degrees of freedom when φ is known. Proof. From equations (6) and (11), ( By Lemma 1, Λ n (λ; φ) Λ n (λ; φ) 2 D Λ(λ; φ) = 2 sup w n (h, h, h) h 1 =h 2 =h 3 =h ( sup w(h, h, h) h 1 =h 2 =h 3 =h sup w n (h 1, h 2, h 3 ) h 1,h 2,h 3 sup w(h 1, h 2, h 3 ) h 1,h 2,h 3 ) ) = Z T MZ where M = B T (A 1 J)B and w n and w are as in (7) and (8). It can be shown by routine algebra that M is a 3 3 idempotent matrix with rank 2. Since Z has a multivariate normal distribution with zero mean and identity covariance matrix, by standard multivariate distribution theory Z T MZ has a χ 2 distribution with two degrees of freedom. This completes the proof of Theorem 1. Theorem 1 was proved for known φ. However, the parameter φ is often unknown and has to be estimated from data. The nucleotide mapping methods proposed in Liu et al. (2005) and Cui et al. (2007) applied a two-stage estimation procedure. The first stage is to estimate the haplotype frequencies. Denote ˆp 11, ˆp 12, ˆp 21 and ˆp 22 as the MLE of the four corresponding haplotypes which can be estimated by formulating a multinomial likelihood function. Details can be found in Liu et al. (2005). Then the MLE of φ can be obtained by and ˆφ ˆφ = ˆp 11 ˆp 22 ˆp 11 ˆp 22 + ˆp 12 ˆp 21 P φ as n. The estimated ˆφ is then plugged into (5) to estimate the quantitative parameters λ j. 10

11 Theorem 2. Let Λ n (λ; ˆφ) be the LRT by substituting φ by ˆφ. With the same regularity conditions as in Lemma 1, under H 0, Λ n (λ; ˆφ) converges in distribution to a χ 2 distribution with two degrees of freedom as n. Proof: Let Λ n (λ; ˆφ) = Λ n (λ; ˆφ) Λ n (λ; ˆφ) ( Λ n (λ; ˆφ) ) Λ n (λ; φ) + Λ n (λ; φ). Note that under H 0, Λ n (λ; φ) P 0 as n. By Theorem 1, we have Λ n (λ; φ) D χ 2 2. Then by Slutsky s theorem, D χ 2 2. This completes the proof. 3.2 Case when K = 3 Now consider the case when three SNPs form a haplotype. The maximum number of possible haplotypes is 2 3 = 8. A detailed list of the configurations similar to Table 1 for 3 SNPs is tabulated in Table 1 of Li et al. (2007). When the number of SNPs increases, the number of possible heterozygous loci increases resulting in exponentially increased mixture components in the likelihood function. With 3 SNPs considered, there are total of 7 possible mixture components in the likelihood function. However, three mixtures involve the same mean parameters and hence are non-informative and can be collapsed. Thus, only four mixture components are informative. The relative merit of the nucleotide mapping methods is that even though the number of SNPs is increased, the number of association parameters do not increase. To illustrate the idea, assume that [111] is the risk haplotype. As shown in Table 1 of Li et al. (2007), three composite diplotypes can be formulated based on this risk haplotype, namely [111][111], [111][111] and [111][111]. Similar to the 2-SNP model case, the genetic effect of these three composite diplotypes can be modelled by the additive (a) and dominant effect (d) of the risk haplotype [111]. Thus, testing for association of risk haplotype with a disease trait is the same before: that is, testing H 0 : a = d = 0, or by reparameterization, H 0 : λ 1 = λ 2 = λ 3 = λ, similar as the one in a 2-SNP model case. For K = 3, the data can be categorized as seven groups. The log likelihood function l n can be expressed as l n (λ 1, λ 2, λ 3 ; φ) = + n 3 j log[f j (y ji λ j )] j=1 7 n l log[φ l f 2 (y li λ 2 ) + (1 φ l )f 3 (y li λ 3 )] (14) l=4 where f j is defined in (3). 11

12 The likelihood function l n contains four mixtures each one of which is associated with one particular group of diplotypes. The four mixture proportions (φ l, l = 4,..., 7) are functions of haplotype frequencies (see Li et al. (2007) for details). Similar as the 2-SNP case, a two-stage estimation procedure can be applied. The first stage is to estimate the haplotype frequencies and so the four mixture proportions which are then plugged into (14) for the second stage estimation of the quantitative parameters λ j. Let Z = (Z 1,..., Z 7 ) T where Z i, i = 1,..., 7 are iid standard normal random variables. Define A = p A 22 A 23 0 A 23 A 33 (15) where A 22 = p l=4 φ2 l p l, A 33 = p l=4 (1 φ l) 2 p l, and A 23 = 7 l=4 φ l(1 φ l )p l. Also define p B = 0 p2 0 p4 φ 4 p5 φ 5 p6 φ 6 p7 φ p3 p4 (1 φ 4 ) p5 (1 φ 5 ) p6 (1 φ 6 ) p7 (1 φ 7 ) (16) where p l = lim n (n l/n), l = 1,..., 7. Theorem 3. Define Λ n (λ; ˆφ) as the LRT for testing H 0 : λ 1 = λ 2 = λ 3 = λ, where ˆφ = ( ˆφ 4, ˆφ 5, ˆφ 6, ˆφ 7 ) T. With the same regularity conditions as in Lemma 1, under H 0, Λ n (λ; ˆφ) converges in distribution to Z T MZ χ 2 2 where M = BT (A 1 J)B, J is a 3 3 matrix whose entries are all 1 s, and A, B are as in (15) and (16). Proof. The proof follows the same technique as have been shown in the 2-SNP case. Remark 1. We have investigated the limiting distribution of the LRT under the 2-SNP and 3-SNP model. The generalization to multiple SNPs (K > 3) is straightforward by modifying the A and B matrices. Thus, the asymptotic χ 2 2 distribution of the LRT is true in general and can be applied in practice for fast threshold determination considering any number of SNPs. Remark 2. The limiting distribution of the LRT is established without covariates. In genetic studies, clinical risk factors or other environmental factors can also have influence on an individual s disease risk. When testing the effect of genetic variants, these covariates can be considered as nuisance parameters. The limiting distribution of the LRT derived in this study still holds with nuisance parameters. 12

13 4 Simulation To evaluate the finite sample performance of the asymptotic chi-square distribution, we perform several simulation studies. The first simulation considers a binary disease phenotype and a logistic regression model is fitted. The values of allele frequency, LD among the tested SNPs as well as the quantitative genetic parameters used for the simulation study can be found in Cui et al. (2007). Three disease models are considered: the additive model (d/a = 0), the dominant model (d/a = 1) and the recessive model (d/a = 1), where d and a are the dominant and additive effects, respectively. Data were simulated under different sample sizes (n = 100, 200, 500). For each simulated data set with different gene action modes, 1000 permutations were repeated and the same procedure was repeated for 100 times. The average values of the 100 replications were recorded for each percentile. The permutation test is distribution free but data dependent. Thus, the threshold obtained by permutations was considered as the exact threshold, in comparison with the threshold calculated from the chi-square distribution. Table 2: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with binary disease phenotype fitted with logistic distribution assuming a 2-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 d/a n Note: The permutation percentile is the averaged percentile out of 100 replicates. A comparison of the permutation- and χ 2 approximation-based cutoff points is shown in Table 2 for the 2-SNP model and in Table 3 for the 3-SNP model. Overall, the chi-square cutoffs are consistent with the permutation cutoffs for different percentiles, which indicates good performance of the approximation. With large sample size (n = 500), the cutoffs obtained with the two methods are very close. Thus in real data analysis the large sample test can effectively replace time-consuming permutation tests. For a normally distributed phenotype, we simulated data with different heritability levels. Details of the simulation can be found in Liu et al. (2005). Again, we assume different gene 13

14 Table 3: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with binary disease phenotype fitted with logistic distribution assuming a 3-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 α 2 /α n Note: The permutation percentile is the averaged percentile out of 100 replicates. action modes (additive, dominant and recessive) as described above. Results were summarized in Tables 4 and 5. In general, the two methods produce fairly consistent cutoffs where the consistency depends on heritability level and sample size. More consistent results are observed under larger heritability level. For example, for the 2-SNP model when n = 100 and d/a = 0, the 80% threshold difference between the chi-square approximation and the permutation is 0.42 when H 2 is 0.1. This difference reduces to 0.1 when H 2 increases to 0.4 while hold the other conditions fixed. When sample size increases, the cutoffs generated by the two methods are more consistent with each other. Table 4: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with continuous disease phenotype fitted with normal distribution assuming a 2-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 d/a n H 2 = H 2 = H 2 = H 2 = H 2 = H 2 = H 2 = H 2 = Note: The permutation percentile is the averaged percentile out of 100 replicates. The empirical type I error rate of the large sample test was also investigated at the nominal 14

15 Table 5: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with continuous disease phenotype fitted with normal distribution assuming a 3-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 d/a n H 2 = H 2 = H 2 = H 2 = H 2 = H 2 = H 2 = H 2 = Note: The permutation percentile is the averaged percentile out of 100 replicates. 5% level. Figure 1 shows the performance of the chi-square approximation under different sample sizes and different disease trait distributions. Overall, the type I error rate is reasonably controlled for models fitted with different number of SNPs. 5 Conclusion Statistical dissection of genetic association between genetic factors and disease phenotypes has been a long-term effort in gene mapping study. With the development of the human HapMap project, massive amount of high throughput SNP data are generated. The density of the SNP data is still increasing with advanced genotyping technology. Development of computationally efficient and statistically powerful analytical method is critically important in unravelling causal disease variants. The methods developed by Liu et al. (2005) and Cui et al. (2007), termed nucleotide mapping in general, as well as various extensions and applications of the methods (e.g. Hou et al., 2007; Li et al., 2007; Wu et al., 2007; Pinedo et al., 2008) provide timely efforts in elucidating the genetic architecture of nucleotide patterns in association with a complex disease trait. However, given the increasing number of SNPs documented in public database, the computational burden in assessing the statistical significance of the LRT in nucleotide mapping with permutation tests is substantial. In this article, we investigated the limiting distribution of the LRT and showed that the LRT is asymptotically chi-square with 2 degrees of freedom under the null hypothesis of no disease gene association. We evaluated 15

16 L 2SNP L 3SNP N 2SNP N 3SNP Type I error Sample size Figure 1: Type I error rate with chi-square approximation for disease phenotype simulated assuming logistic (L) and normal (N) distribution under the 2-SNP and 3-SNP models. the performance of the chi-square approximation and compared it with the non-parametric permutation tests. The results indicate that the chi-square approximation performs well with moderate sample size, and hence can be applied in real data analysis for fast threshold determination. Achnowledgement The work of the first author was supported in part by NSF grant DMS References Akey, J., Jin, L., Xiong, M., Haplotypes vs. single makrer linkage disequilibrium tests: what do we gain? Eur. J. Hum. Genet. 9, Clark, A.G., The role of haplotypes in candidate gene studies. Genet. Epidemiol. 27, Cui, Y.H., Fu, W., Sun, K.L., Romero, R., Wu, R., Mapping nucleotide sequences that encode complex binary disease traits with HapMap. Current Genomics 8, Hou, W., Yap, J.S., Wu, S., Liu, T., Cheverud, J.M., Wu, R., Haplotyping a quantitative trait with a high-density map in experimental crosses. PLoS ONE 2(1): e

17 Lehmann, E.L., Theory of point estimation. Chapman and Hall, New York. Li, H., Kim, B.R., Wu, R., Identification of quantitative trait nucleotides that regulate cancer growth: a simulation approach. J. Theor. Biol. 242, Lou, X-Y., Casella, G., Littell, R.C., Yang, M.C.K., Wu, R., 2003 A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis in natural populations. Genetics 163, Lynch, M., Walsh, B., Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA. McCullagh, P., Nelder, J., Generalized Linear Models. London: Chapman and Hall. Olson, J.M., Wijsman, E.M., Design and sample size considerations in the detection of linkage disequilibrium with a marker locus. Am. J. Hum. Genet. 55, Pinedoa, P., Wang, C., Li, Y., Raea, D., Wu, R., Risk haplotype analysis for bovine paratuberculosis. Mamm. Genome (in press). Schaid, D.J., Evaluating associations of haplotypes with traits. Genet. Epidemiol. 27, Schaid, D.J., Rowland, C.M., Tines, D.E., Jacobson, R.M., Poland, G.A., Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, van der Vaart, A. W., Asymptotic statistics. Cambridge University Press. Wu, S., Yang, J., Wang, C., Wu, R., A general quantitative genetic model for haplotyping a complex trait in humans. Current Genomics 8,

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 Association Testing with Quantitative Traits: Common and Rare Variants Timothy Thornton and Katie Kerr Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5 1 / 41 Introduction to Quantitative

More information

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics 1 Springer Nan M. Laird Christoph Lange The Fundamentals of Modern Statistical Genetics 1 Introduction to Statistical Genetics and Background in Molecular Genetics 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15. NIH Public Access Author Manuscript Published in final edited form as: Stat Sin. 2012 ; 22: 1041 1074. ON MODEL SELECTION STRATEGIES TO IDENTIFY GENES UNDERLYING BINARY TRAITS USING GENOME-WIDE ASSOCIATION

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Binomial Mixture Model-based Association Tests under Genetic Heterogeneity Hui Zhou, Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 April 30,

More information

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017 Lecture 2: Genetic Association Testing with Quantitative Traits Instructors: Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2017 1 / 29 Introduction to Quantitative Trait Mapping

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/35195 holds various files of this Leiden University dissertation Author: Balliu, Brunilda Title: Statistical methods for genetic association studies with

More information

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions Nilanjan Chatterjee, Zeynep Kalaylioglu 2, Roxana Moslehi, Ulrike Peters 3, Sholom Wacholder

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Power and sample size calculations for designing rare variant sequencing association studies.

Power and sample size calculations for designing rare variant sequencing association studies. Power and sample size calculations for designing rare variant sequencing association studies. Seunggeun Lee 1, Michael C. Wu 2, Tianxi Cai 1, Yun Li 2,3, Michael Boehnke 4 and Xihong Lin 1 1 Department

More information

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies. Theoretical and computational aspects of association tests: application in case-control genome-wide association studies Mathieu Emily November 18, 2014 Caen mathieu.emily@agrocampus-ouest.fr - Agrocampus

More information

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to 1 1 1 1 1 1 1 1 0 SUPPLEMENTARY MATERIALS, B. BIVARIATE PEDIGREE-BASED ASSOCIATION ANALYSIS Introduction We propose here a statistical method of bivariate genetic analysis, designed to evaluate contribution

More information

Introduction to Linkage Disequilibrium

Introduction to Linkage Disequilibrium Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo Friday Harbor 2017 From Genetics to GWAS (Genome-wide Association Study) Sept 7 2017 David Fardo Purpose: prepare for tomorrow s tutorial Genetic Variants Quality Control Imputation Association Visualization

More information

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata Most common diseases are caused by

More information

Methods for Cryptic Structure. Methods for Cryptic Structure

Methods for Cryptic Structure. Methods for Cryptic Structure Case-Control Association Testing Review Consider testing for association between a disease and a genetic marker Idea is to look for an association by comparing allele/genotype frequencies between the cases

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

Case-Control Association Testing. Case-Control Association Testing

Case-Control Association Testing. Case-Control Association Testing Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits. Technological advances have made it feasible to perform case-control association studies

More information

Lecture WS Evolutionary Genetics Part I 1

Lecture WS Evolutionary Genetics Part I 1 Quantitative genetics Quantitative genetics is the study of the inheritance of quantitative/continuous phenotypic traits, like human height and body size, grain colour in winter wheat or beak depth in

More information

I Have the Power in QTL linkage: single and multilocus analysis

I Have the Power in QTL linkage: single and multilocus analysis I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department

More information

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015 Timothy Thornton and Michael Wu Summer Institute in Statistical Genetics 2015 1 / 1 Introduction Association mapping is now routinely being used to identify loci that are involved with complex traits.

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

(Genome-wide) association analysis

(Genome-wide) association analysis (Genome-wide) association analysis 1 Key concepts Mapping QTL by association relies on linkage disequilibrium in the population; LD can be caused by close linkage between a QTL and marker (= good) or by

More information

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia Expression QTLs and Mapping of Complex Trait Loci Paul Schliekelman Statistics Department University of Georgia Definitions: Genes, Loci and Alleles A gene codes for a protein. Proteins due everything.

More information

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power Proportional Variance Explained by QTL and Statistical Power Partitioning the Genetic Variance We previously focused on obtaining variance components of a quantitative trait to determine the proportion

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota,

More information

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN Weierstraß-Institut für Angewandte Analysis und Stochastik Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN 2198-5855 On an extended interpretation of linkage disequilibrium in genetic

More information

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M. STAT 550 Howework 6 Anton Amirov 1. This question relates to the same study you saw in Homework-4, by Dr. Arno Motulsky and coworkers, and published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).

More information

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control Xiaoquan Wen Department of Biostatistics, University of Michigan A Model

More information

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Statistical Power of Model Selection Strategies for Genome-Wide Association Studies Zheyang Wu 1, Hongyu Zhao 1,2 * 1 Department of Epidemiology and Public Health, Yale University School of Medicine, New

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics and Medical Informatics University of Wisconsin Madison www.biostat.wisc.edu/~kbroman [ Teaching Miscellaneous lectures]

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl Broman Biostatistics and Medical Informatics University of Wisconsin Madison kbroman.org github.com/kbroman @kwbroman Backcross P 1 P 2 P 1 F 1 BC 4

More information

Non-iterative, regression-based estimation of haplotype associations

Non-iterative, regression-based estimation of haplotype associations Non-iterative, regression-based estimation of haplotype associations Benjamin French, PhD Department of Biostatistics and Epidemiology University of Pennsylvania bcfrench@upenn.edu National Cancer Center

More information

Sample size determination for logistic regression: A simulation study

Sample size determination for logistic regression: A simulation study Sample size determination for logistic regression: A simulation study Stephen Bush School of Mathematical Sciences, University of Technology Sydney, PO Box 123 Broadway NSW 2007, Australia Abstract This

More information

How to analyze many contingency tables simultaneously?

How to analyze many contingency tables simultaneously? How to analyze many contingency tables simultaneously? Thorsten Dickhaus Humboldt-Universität zu Berlin Beuth Hochschule für Technik Berlin, 31.10.2012 Outline Motivation: Genetic association studies Statistical

More information

SNP-SNP Interactions in Case-Parent Trios

SNP-SNP Interactions in Case-Parent Trios Detection of SNP-SNP Interactions in Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health June 2, 2009 Karyotypes http://ghr.nlm.nih.gov/ Single Nucleotide Polymphisms

More information

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA

More information

Goodness of Fit Goodness of fit - 2 classes

Goodness of Fit Goodness of fit - 2 classes Goodness of Fit Goodness of fit - 2 classes A B 78 22 Do these data correspond reasonably to the proportions 3:1? We previously discussed options for testing p A = 0.75! Exact p-value Exact confidence

More information

1. Understand the methods for analyzing population structure in genomes

1. Understand the methods for analyzing population structure in genomes MSCBIO 2070/02-710: Computational Genomics, Spring 2016 HW3: Population Genetics Due: 24:00 EST, April 4, 2016 by autolab Your goals in this assignment are to 1. Understand the methods for analyzing population

More information

Gene mapping in model organisms

Gene mapping in model organisms Gene mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University http://www.biostat.jhsph.edu/~kbroman Goal Identify genes that contribute to common human diseases. 2

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis The Canadian Journal of Statistics Vol.?, No.?, 2006, Pages???-??? La revue canadienne de statistique Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint

More information

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms Human vs mouse Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] www.daviddeen.com

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure) Previous lecture Single variant association Use genome-wide SNPs to account for confounding (population substructure) Estimation of effect size and winner s curse Meta-Analysis Today s outline P-value

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs. Supplementary Figure 1 Number of cases and proxy cases required to detect association at designs. = 5 10 8 for case control and proxy case control The ratio of controls to cases (or proxy cases) is 1.

More information

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q) Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,

More information

OPTIMALITY AND STABILITY OF SYMMETRIC EVOLUTIONARY GAMES WITH APPLICATIONS IN GENETIC SELECTION. (Communicated by Yang Kuang)

OPTIMALITY AND STABILITY OF SYMMETRIC EVOLUTIONARY GAMES WITH APPLICATIONS IN GENETIC SELECTION. (Communicated by Yang Kuang) MATHEMATICAL BIOSCIENCES doi:10.3934/mbe.2015.12.503 AND ENGINEERING Volume 12, Number 3, June 2015 pp. 503 523 OPTIMALITY AND STABILITY OF SYMMETRIC EVOLUTIONARY GAMES WITH APPLICATIONS IN GENETIC SELECTION

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion Glenn Heller and Jing Qin Department of Epidemiology and Biostatistics Memorial

More information

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits Wang BMC Genetics 011, 1:8 http://www.biomedcentral.com/171-156/1/8 METHODOLOGY ARTICLE Open Access On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative

More information

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Learning gene regulatory networks Statistical methods for haplotype inference Part I Learning gene regulatory networks Statistical methods for haplotype inference Part I Input: Measurement of mrn levels of all genes from microarray or rna sequencing Samples (e.g. 200 patients with lung

More information

Combining dependent tests for linkage or association across multiple phenotypic traits

Combining dependent tests for linkage or association across multiple phenotypic traits Biostatistics (2003), 4, 2,pp. 223 229 Printed in Great Britain Combining dependent tests for linkage or association across multiple phenotypic traits XIN XU Program for Population Genetics, Harvard School

More information

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials Biostatistics (2013), pp. 1 31 doi:10.1093/biostatistics/kxt006 Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials XINYI LIN, SEUNGGUEN

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature25973 Power Simulations We performed extensive power simulations to demonstrate that the analyses carried out in our study are well powered. Our simulations indicate very high power for

More information

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST Econometrics Working Paper EWP0402 ISSN 1485-6441 Department of Economics TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST Lauren Bin Dong & David E. A. Giles Department

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Prediction of the Confidence Interval of Quantitative Trait Loci Location Behavior Genetics, Vol. 34, No. 4, July 2004 ( 2004) Prediction of the Confidence Interval of Quantitative Trait Loci Location Peter M. Visscher 1,3 and Mike E. Goddard 2 Received 4 Sept. 2003 Final 28

More information

SNP Association Studies with Case-Parent Trios

SNP Association Studies with Case-Parent Trios SNP Association Studies with Case-Parent Trios Department of Biostatistics Johns Hopkins Bloomberg School of Public Health September 3, 2009 Population-based Association Studies Balding (2006). Nature

More information

Genotype Imputation. Biostatistics 666

Genotype Imputation. Biostatistics 666 Genotype Imputation Biostatistics 666 Previously Hidden Markov Models for Relative Pairs Linkage analysis using affected sibling pairs Estimation of pairwise relationships Identity-by-Descent Relatives

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

Evolution of phenotypic traits

Evolution of phenotypic traits Quantitative genetics Evolution of phenotypic traits Very few phenotypic traits are controlled by one locus, as in our previous discussion of genetics and evolution Quantitative genetics considers characters

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics Lee H. Dicker Rutgers University and Amazon, NYC Based on joint work with Ruijun Ma (Rutgers),

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

A test for improved forecasting performance at higher lead times

A test for improved forecasting performance at higher lead times A test for improved forecasting performance at higher lead times John Haywood and Granville Tunnicliffe Wilson September 3 Abstract Tiao and Xu (1993) proposed a test of whether a time series model, estimated

More information

Linkage Disequilibrium Testing When Linkage Phase Is Unknown

Linkage Disequilibrium Testing When Linkage Phase Is Unknown Copyright 2004 by the Genetics Society of America Linkage Disequilibrium Testing When Linkage Phase Is Unknown Daniel J. Schaid 1 Department of Health Sciences Research, Mayo Clinic/Foundation, Rochester,

More information

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics Thorsten Dickhaus University of Bremen Institute for Statistics AG DANK Herbsttagung

More information

Asymptotic distribution of the largest eigenvalue with application to genetic data

Asymptotic distribution of the largest eigenvalue with application to genetic data Asymptotic distribution of the largest eigenvalue with application to genetic data Chong Wu University of Minnesota September 30, 2016 T32 Journal Club Chong Wu 1 / 25 Table of Contents 1 Background Gene-gene

More information

Lecture 11: Multiple trait models for QTL analysis

Lecture 11: Multiple trait models for QTL analysis Lecture 11: Multiple trait models for QTL analysis Julius van der Werf Multiple trait mapping of QTL...99 Increased power of QTL detection...99 Testing for linked QTL vs pleiotropic QTL...100 Multiple

More information

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012 Quantile based Permutation Thresholds for QTL Hotspots Brian S Yandell and Elias Chaibub Neto 17 March 2012 2012 Yandell 1 Fisher on inference We may at once admit that any inference from the particular

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

Sample size calculations for logistic and Poisson regression models

Sample size calculations for logistic and Poisson regression models Biometrika (2), 88, 4, pp. 93 99 2 Biometrika Trust Printed in Great Britain Sample size calculations for logistic and Poisson regression models BY GWOWEN SHIEH Department of Management Science, National

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important? Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 3 Exercise 3.. a. Define random mating. b. Discuss what random mating as defined in (a) above means in a single infinite population

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

ABC Fax Original Paper

ABC Fax Original Paper Original Paper Hum Hered 2003;55:27 36 DOI: 10.1159/000071807 Received: October 23, 2002 Accepted after revision: March 27, 2003 Choosing Haplotype-Tagging SNPS Based on Unphased Genotype Data Using a

More information

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies Ruth Pfeiffer, Ph.D. Mitchell Gail Biostatistics Branch Division of Cancer Epidemiology&Genetics National

More information

Multiple QTL mapping

Multiple QTL mapping Multiple QTL mapping Karl W Broman Department of Biostatistics Johns Hopkins University www.biostat.jhsph.edu/~kbroman [ Teaching Miscellaneous lectures] 1 Why? Reduce residual variation = increased power

More information

Testing for Homogeneity in Genetic Linkage Analysis

Testing for Homogeneity in Genetic Linkage Analysis Testing for Homogeneity in Genetic Linkage Analysis Yuejiao Fu, 1, Jiahua Chen 2 and John D. Kalbfleisch 3 1 Department of Mathematics and Statistics, York University Toronto, ON, M3J 1P3, Canada 2 Department

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/

More information

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17 Modeling IBD for Pairs of Relatives Biostatistics 666 Lecture 7 Previously Linkage Analysis of Relative Pairs IBS Methods Compare observed and expected sharing IBD Methods Account for frequency of shared

More information

Review We have covered so far: Single variant association analysis and effect size estimation GxE interaction and higher order >2 interaction Measurement error in dietary variables (nutritional epidemiology)

More information

Régression en grande dimension et épistasie par blocs pour les études d association

Régression en grande dimension et épistasie par blocs pour les études d association Régression en grande dimension et épistasie par blocs pour les études d association V. Stanislas, C. Dalmasso, C. Ambroise Laboratoire de Mathématiques et Modélisation d Évry "Statistique et Génome" 1

More information

Bi-level feature selection with applications to genetic association

Bi-level feature selection with applications to genetic association Bi-level feature selection with applications to genetic association studies October 15, 2008 Motivation In many applications, biological features possess a grouping structure Categorical variables may

More information

Logistic Regression Model for Analyzing Extended Haplotype Data

Logistic Regression Model for Analyzing Extended Haplotype Data Genetic Epidemiology 15:173 181 (1998) Logistic Regression Model for Analyzing Extended Haplotype Data Sylvan Wallenstein, 1 * Susan E. Hodge, 3 and Ainsley Weston 2 1 Department of Biomathematical Sciences,

More information

Affected Sibling Pairs. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666 Affected Sibling airs Biostatistics 666 Today Discussion of linkage analysis using affected sibling pairs Our exploration will include several components we have seen before: A simple disease model IBD

More information

Binary trait mapping in experimental crosses with selective genotyping

Binary trait mapping in experimental crosses with selective genotyping Genetics: Published Articles Ahead of Print, published on May 4, 2009 as 10.1534/genetics.108.098913 Binary trait mapping in experimental crosses with selective genotyping Ani Manichaikul,1 and Karl W.

More information

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago Submitted to the Annals of Applied Statistics USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA By Xiaoquan Wen and Matthew Stephens University of Chicago Recently-developed

More information

Supporting Information

Supporting Information Supporting Information Hammer et al. 10.1073/pnas.1109300108 SI Materials and Methods Two-Population Model. Estimating demographic parameters. For each pair of sub-saharan African populations we consider

More information