On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

Similar documents
Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Linear Regression (1/1/17)

Lecture 9. QTL Mapping 2: Outbred Populations

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Computational Systems Biology: Biology X

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

Cover Page. The handle holds various files of this Leiden University dissertation

Powerful multi-locus tests for genetic association in the presence of gene-gene and gene-environment interactions

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Power and sample size calculations for designing rare variant sequencing association studies.

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Introduction to Linkage Disequilibrium

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Methods for Cryptic Structure. Methods for Cryptic Structure

2. Map genetic distance between markers

Case-Control Association Testing. Case-Control Association Testing

Lecture WS Evolutionary Genetics Part I 1

I Have the Power in QTL linkage: single and multilocus analysis

Lecture 1: Case-Control Association Testing. Summer Institute in Statistical Genetics 2015

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

(Genome-wide) association analysis

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Introduction to QTL mapping in model organisms

STAT 536: Genetic Statistics

Relationship between Genomic Distance-Based Regression and Kernel Machine Regression for Multi-marker Association Testing

Weierstraß-Institut. für Angewandte Analysis und Stochastik. Leibniz-Institut im Forschungsverbund Berlin e. V. Preprint ISSN

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Statistical Power of Model Selection Strategies for Genome-Wide Association Studies

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Non-iterative, regression-based estimation of haplotype associations

Sample size determination for logistic regression: A simulation study

How to analyze many contingency tables simultaneously?

SNP-SNP Interactions in Case-Parent Trios

BTRY 4830/6830: Quantitative Genomics and Genetics

Generalized Linear Models (GLZ)

Introduction to QTL mapping in model organisms

Goodness of Fit Goodness of fit - 2 classes

1. Understand the methods for analyzing population structure in genomes

Gene mapping in model organisms

Bayesian Inference of Interactions and Associations

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis

Introduction to QTL mapping in model organisms

The Quantitative TDT

Previous lecture. Single variant association. Use genome-wide SNPs to account for confounding (population substructure)

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

OPTIMALITY AND STABILITY OF SYMMETRIC EVOLUTIONARY GAMES WITH APPLICATIONS IN GENETIC SELECTION. (Communicated by Yang Kuang)

Bayes methods for categorical data. April 25, 2017

Pairwise rank based likelihood for estimating the relationship between two homogeneous populations and their mixture proportion

On coding genotypes for genetic markers with multiple alleles in genetic association study of quantitative traits

Learning gene regulatory networks Statistical methods for haplotype inference Part I

Combining dependent tests for linkage or association across multiple phenotypic traits

Test for interactions between a genetic marker set and environment in generalized linear models Supplementary Materials

SUPPLEMENTARY INFORMATION

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

Marginal Screening and Post-Selection Inference

Prediction of the Confidence Interval of Quantitative Trait Loci Location

SNP Association Studies with Case-Parent Trios

Genotype Imputation. Biostatistics 666

Linkage and Linkage Disequilibrium

Evolution of phenotypic traits

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

A test for improved forecasting performance at higher lead times

Linkage Disequilibrium Testing When Linkage Phase Is Unknown

COMBI - Combining high-dimensional classification and multiple hypotheses testing for the analysis of big data in genetics

Asymptotic distribution of the largest eigenvalue with application to genetic data

Lecture 11: Multiple trait models for QTL analysis

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

p(d g A,g B )p(g B ), g B

Sample size calculations for logistic and Poisson regression models

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

25 : Graphical induced structured input/output models

ABC Fax Original Paper

Probability of Detecting Disease-Associated SNPs in Case-Control Genome-Wide Association Studies

Multiple QTL mapping

Testing for Homogeneity in Genetic Linkage Analysis

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Stat 5101 Lecture Notes

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17


Régression en grande dimension et épistasie par blocs pour les études d association

Bi-level feature selection with applications to genetic association

Logistic Regression Model for Analyzing Extended Haplotype Data

Affected Sibling Pairs. Biostatistics 666

Binary trait mapping in experimental crosses with selective genotyping

USING LINEAR PREDICTORS TO IMPUTE ALLELE FREQUENCIES FROM SUMMARY OR POOLED GENOTYPE DATA. By Xiaoquan Wen and Matthew Stephens University of Chicago

Supporting Information

Transcription:

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease Yuehua Cui 1 and Dong-Yun Kim 2 1 Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824 2 Department of Statistics, Virginia Tech, Blacksburg, Virginia 24061 Abstract Detecting the pattern and distribution of DNA variants across the genome is essential in understanding the etiology of complex human disease. Recently Liu et al. (2005) and Cui et al. (2007) developed a novel nucleotide mapping method under the mixture model framework to target specific DNA sequence variants underlying complex disease. The likelihood ratio test (LRT) was applied to test the association between a risk haplotype and a complex disease, and permutation tests were used to assess the significance of the LRT. This, however, renders computational burden in extending the nucleotide mapping method to large scale high density genome-wide single nucleotide polymorphism data. Here we theoretically investigate the limiting distribution of the LRT under the mixture model-based nucleotide mapping framework and show that it asymptotically follows a χ 2 distribution. Simulations show good finite sample property of the limiting distribution. The study contributes to the theory of gene mapping. Key words: Asymptotic threshold, Mixture model, Risk haplotype, Single nucleotide polymorphism AMS 2000 subject classifications: 62F05, 60F05, 92D10. 1

1 Introduction Recent developments in bio-technology have produced massive amount of high-dimensional genetics data. Hunting for disease genes has been radically shifted from traditional approaches focusing on chromosome segments to single nucleotide variants called single nucleotide polymorphisms (SNPs). In a broad sense, methods for disease-gene association study have been focused either on single SNP analysis, or the combination of SNPs, termed haplotype analysis. Statistical analysis focusing on single SNPs can be done by testing allele or genotype frequency differences in affected and unaffected samples by a chi-square test or logistic regression (e.g. Olson and Wijsman, 1994). When multiple disease variants function in a cis-acting format, the haplotype-based analysis appears to be more powerful than the single SNP-based analysis (Schaid et al., 2002). The relative merit of haplotype-based analysis over single-locus approach has been shown in a number of studies (e.g. Akey et al., 2001; Clark, 2004; Schaid, 2004). With the development of the human HapMap project, large amount of DNA variants can be generated with different array genotyping platforms. As the SNP genotyping density becomes more and more dense, eventually a comprehensive human sequence variant map will be made available. It is essential to target specific DNA sequence variants that underlie complex diseases. In previous studies, Liu et al. (2005) and Cui et al. (2007) developed a nucleotide mapping approach by targeting SNP variants that are structured in a haplotype format. Specific risk haplotypes that trigger significant effects on a disease trait can be formulated and tested, which represents one of the advantages of the methods over the traditional haplotypebased analysis. Based on the patterns of the combination of risk and non-risk haplotypes, a novel grouping technique is applied to group diplotypes with common genetic effects. Thus, the degrees of freedom for an association tests are greatly reduced, regardless of the number of SNPs fitted in the model. The method was first developed for phenotypic data arising from a normal distribution (Liu et al., 2005). This assumption is relaxed in Cui et al. (2007) where data arising from an exponential family can be fitted. Statistical inference procedures are derived to quantify the relative risk of different haplotypes an individual may have (Cui et al. 2007). Li et al. (2007) recently applied the nucleotide mapping idea to longitudinal data. Other extensions and applications of the mapping method were also proposed (e.g. Hou et 2

al., 2007; Wu et al., 2007; Pinedo et al., 2008). A common issue in haplotype-based analysis is the unknown linkage phase. When two or more heterozygous loci are involved, the linkage phase cannot be determined explicitly, rather than being considered as missing data. Statistical mixture model has been commonly applied to modeling data with missing, as the nucleotide mapping methods adopted to deal with haplotype phase uncertainty. The likelihood ratio test (LRT) was applied to test the association of different nucleotide patterns with a disease trait. It is commonly recognized that the usual regularity conditions for the asymptotic χ 2 distribution of the LRT do not hold in this case. Thus, permutation tests were used to assess the statistical significance in the earlier works. Even though tagging SNPs can be used to reduce the dimension of SNP variants, extensive computation in permutation tests is still a huge burden. The computational cost greatly hinders the application of the methods, especially in extending them to a high-density genome-wide scale. Fast asymptotic approach for threshold determination is highly desirable to make these methods more practical. It is thus the purpose of this paper to study the limiting distribution of the LRT under the nucleotide mapping framework. In the next section, we start with a brief review of the nucleotide mapping methods. Then using the local asymptotic normality (LAN) of the test statistic, we show that the limiting distribution of the LRT follows a χ 2 distribution with two degrees of freedom, regardless of the number of SNPs fitted in the model and the asymptotic result holds for a wide range of phenotype distributions. Simulations are conducted to evaluate the finite sample property of the test statistic under the mixture model nucleotide mapping framework in comparison with the distribution-free permutation tests. Results show that for moderate sample sizes the thresholds for the test statistic based on the limiting distribution are virtually indistinguishable from those based on permutation tests. Thus fast threshold determination can be based on the χ 2 approximation rather than based on time-consuming permutation tests. 2 The nucleotide mapping framework and the LRT We begin this section by briefly introducing the nucleotide mapping method in a generalized linear model framework. Consider K (K 2) SNPs (or tag SNPs) within a haplotype block 3

constructed from a number of bi-allelic loci. SNPs within a haplotype block are correlated due to strong linkage disequilibrium (LD), whereas SNPs between blocks are less correlated with weak LD. Denote the two alleles for the kth SNP within a block as Q k r k (r k = 1, 2; k = 1, K), with allele frequencies denoted by p (k) r k. The subscript r k (=1,2) is used to denote the allele. There are maximum 2 K possible haplotypes by the random combination of these K SNPs. In reality, the number of observed haplotypes may be much smaller than that due to LD among SNPs in a block. Let us denote the haplotype structure as [Q 1 r 1 Q 2 r 2 Q K r k ] with corresponding haplotype frequencies denoted as p r1 r 2 r K. We assume that alleles with the same value of r k are located on the same chromosome. In practice, haplotypes are unobservable and only unphased multi-locus genotypes are observed with the form expressed as Q 1 r 1 Q 1 s 1 /Q 2 r 2 Q 2 s 2 / /Q K r K Q K s K (r k (s k ) = 1, 2), where s k is used to denote alleles located on the other chromosome in homologous to r k. The corresponding genotype frequency and the number of observations are expressed as P r1 s 1 /r 2 s 2 / /r K s K and n r1 s 1 /r 2 s 2 / /r K s K, respectively. Note that we use capital letter P to denote the observed genotype frequency and lower case p to denote the haplotype frequency. The combination of two haplotypes forms a diplotype which is denoted as [Q 1 r 1 Q 2 r 2 Q K r K ] [Q 1 s 1 Q 2 s 2 Q K s K ]. Assuming Hardy-Weinberg equilibrium, the diplotype frequency can be expressed as a product of the haplotype frequencies, i.e., P [r1 r 2 r K ][s 1 s 2 s K ] = p r1 r 2 r K p s1 s 2 s K. When there are two or more than two heterozygotes among multiple SNPs, the linkage phase is unknown and the inference about linkage phase is necessary for haplotype-based analysis. The problem of unknown phase leads to a natural mixture distribution in statistical modeling. For an example, consider three SNPs in a haplotype block. The genotype Q 1 1Q 1 1/Q 2 1Q 2 2/Q 3 1Q 3 2 could form two different diplotypes expressed as [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 2Q 3 2] and [Q 1 1Q 2 1Q 3 2][Q 1 1Q 2 2Q 3 1], while the genotype Q 1 1Q 1 2/Q 2 1Q 2 2/Q 3 1Q 3 2 could form four different diplotypes. In nucleotide mapping, one haplotype is assumed to be the risk haplotype and the selection of risk haplotype can be done through statistical model selection (see Liu et al. 2005; Cui et al. 2007). For the three SNPs case, if we assume that [Q 1 1Q 2 1Q 3 1] is the risk haplotype, we them can formulate three different composite diplotypes expressed as [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 1Q 3 1], [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 1Q 3 1] and [Q 1 1Q 2 1Q 3 1][Q 1 1Q 2 1Q 3 1]. Note that the formulation of the composite diplotype is the foundation of the nucleotide mapping approach. By assuming one haplotype to 4

be the risk haplotype and all the others as non-risk ones, the effect of risk haplotype can be modeled in terms of the three composite diplotypes. Applying the traditional quantitative genetic theory (Lynn and Walsh, 1989), the genetic effect of the risk haplotype can be modelled through the additive (denoted as a) and dominant effect (denoted as d) of the composite diplotypes. The multilocus haplotype frequency can be formulated as a function of allele frequencies and LD parameters of different orders (Lou et al., 2003). For example, a haplotype frequency, denoted as p r1 r 2 r L, can be decomposed into the following components: p r1 r 2...r K = p r1 p r2... p rk No LD +( 1) rk 1+rK p r1... p rk 2 D (K 1)K + + ( 1) r1+r2 p r3... p rk D 12 +( 1) r K 2+r K 1 +r K p r1... p rk 3 D (K 2)(K 1)K + + ( 1) r 1+r 2 +r 3 p r4... p rk D 123 + +( 1) r 1+...+r K D 1...K Digenic LD Trigenic LD K genic LD (1) where D s are the linkage disequilibria of different orders among particular htsnps. For a 2-SNP model, this reduces to p r1r 2 = p r1 p r2 + ( 1) r1+r2 D, r 1, r 2 = 1, 2 where r 1, r 2 are indicator variables for SNP 1 and 2, respectively. Let y denote a measured disease trait which can be continuous or discrete depending on the nature of the disease status. For example, when studying obesity, the Body Mass Index can be a continuous phenotype while for most human diseases, the phenotype is measured as binary corresponding to either affected or unaffected status. Let X denote a matrix of numerical codes corresponding to the composite genotype, G say, including the intercept as the first column. Suppose that the genetic covariates influence only the mean of the trait and not the scale, so that their effects can be summarized by a function of the linear predictor η = Xβ (2) where β contains the regression parameters for the genetic effect of composite diplotypes on the disease trait. Here we assume the disease phenotype has an exponential family distribution and can be modeled through the generalized linear model (McCullagh and Nelder, 1989). For 5

a binary disease response, we can apply the logit model with the natural logit link function. For a normal or Poisson type trait, the identity or log link function is used, respectively. (McCullagh and Nelder, 1989). With the three composite diplotype patterns, the effect of genetic association can be assessed by testing H 0 : a = d = 0 using the likelihood ratio test. For simplicity, we start with a 2-SNP model to show the derivation of the limiting distribution of the LRT. Extensions to cases with more than two SNPs are derived later. Table 1 shows a complete list of possible genotype and diplotype configurations as well as the genotypic means. The linear combination of genetic effects in (2) can be simplified as µ + a, µ + d and µ a, corresponding to composite diplotypes [11][11], [11][11] and [11][11], respectively. The genotypic means can be reparameterized as λ 1 (= µ+a), λ 2 (= µ+d), and λ 3 (= µ a) corresponding to composite diplotypes [11][11], [11][11] and [11][11], respectively. To test the association of a risk haplotype with a disease trait, we can test H 0 : a = d = 0, or equivalently, test H 0 : λ 1 = λ 2 = λ 3 = λ for some unknown common λ. With the configuration listed in Table 1, the observed disease phenotype data can be categorized as four groups. Three groups have distinct phase information and their phenotypes are represented as y 1 = (y 11,..., y 1n1 ) T, y 2 = (y 21,..., y 2n2 ) T and y 3 = (y 31,..., y 3n3 ) T. The fourth group corresponds to the one with missing linkage phase information and is denoted as y 4 = (y 41,..., y 4n4 ) T. For the three groups with distinct phase information, their density functions are given by y ji f j (y; λ 1, λ 2, λ 3 ), i = 1,..., n j ; j = 1, 2, 3 (3) Specifically, f j (y; λ 1, λ 2, λ 3 ) = f j (y λ j ). The fourth group y 4 involves a mixture distribution of the form y 4i φf 2 (y λ 2 ) + (1 φ)f 3 (y λ 3 ), i = 1,..., n 4, (4) where λ 1, λ 2, λ 3 are unknown and φ is an unknown parameter of the mixture proportion with 0 < φ < 1, and can be estimated from the population frequency parameters. Assume independent samples, and let n = 4 j=1 n j denote the total sample size. The log-likelihood function l n can be expressed as l n (λ 1, λ 2, λ 3 ; φ) = n 3 j n 4 log[f j (y ji λ j )] + log [φf 2 (y 4i λ 2 ) + (1 φ)f 3 (y 4i λ 3 )] (5) j=1 6

Table 1: Possible diplotype and composite genotype configurations of nine genotypes at two SNPs and their haplotype composition frequencies Diplotype Composite diplotype Genotype Relative Mean Observation Configuration Frequency frequency Symbol parameters 11/11 [11][11] P [11][11] = p 2 11 1 [11][11] λ 1 n 11/11 11/12 [11][12] P [11][12] = 2p 11 p 12 1 [11][11] λ 2 n 11/12 11/22 [12][12] P [12][12] = p 2 12 1 [11][11] λ 3 n 11/22 12/11 [11][21] P [11][21] = 2p 11 p 21 1 [11][11] λ 2 n 12/11 { { { { { [11][22] P [11][22] = 2p 11 p 22 φ [11][11] λ2 12/12 n 12/12 [12][21] P [12][21] = 2p 12 p 21 1 φ [11][11] λ 3 12/22 [12][22] P [12][22] = 2p 12 p 22 1 [11][11] λ 3 n 12/22 22/11 [21][21] P [21][21] = p 2 21 1 [11][11] λ 3 n 22/11 22/12 [21][22] P [21][22] = 2p 21 p 22 1 [11][11] λ 3 n 22/12 22/22 [22][22] P [22][22] = p 2 22 1 [11][11] λ 3 n 22/22 Note that φ = p 11 p 22 /(p 11 p 22 +p 12 p 21 ) where p ij is the frequency for haplotype ij for i, j = 1, 2. The relative frequency refers to the probability that a specific diplotype is observed. For unambiguous genotype (with known phase), the relative frequency is 1. For the double heterozygotic genotype 12/12, the probability of observing diplotype [11][22] is φ, and observing diplotype [12][12] is 1 φ. Under H 0, the mixture distribution is collapsed to a single distribution free of φ. So, the log-likelihood function is l n (λ, λ, λ; φ) = n log[f(y i λ)] Let ˆλ j (j = 1, 2, 3) be the maximum likelihood estimate (MLE) of λ j (j = 1, 2, 3) under H 1, and λ be the MLE of λ under H 0. Following the notation given in Van Der Vaart (1998), introduce λ j = λ + h j n 1 2, where h j, j = 1, 2, 3 are arbitrary real numbers. Define Λ n (λ; φ) as the LRT statistics of the form ( Λ n (λ; φ) = 2 l n ( λ, λ, λ) l n (ˆλ 1, ˆλ 2, ˆλ ) 3 ; φ) ( = 2 sup l n (λ + h 1=h 2=h 3=h h n, λ + h n, λ + ) h ) sup l n (λ + h 1, λ + h 2, λ + h 3 ; φ) n h 1,h 2,h 3 n n n (6) The test rejects H 0 if Λ n (λ; φ) exceeds a critical value as identified below. Note that the 7

likelihood function under the null is not nested under the alternative and hence regularity conditions to apply the asymptotic chi-square distribution does not directly apply in the current setting. Here we show that the LRT converges to a chi-square distribution with two degrees of freedom. We also generalize the results to multiple SNPs case. 3 The limiting distribution of the LRT 3.1 Case when K = 2 Let D and P denote convergence in distribution and in probability, respectively. Let Z = (Z 1, Z 2, Z 3, Z 4 ) T where Z 1, Z 2, Z 3, Z 4 are iid standard normal random variables. Let h = (h 1, h 2, h 3 ) T. Denote and 2 as the gradient and Hessian operators, respectively, and let I(λ) denote the Fisher information matrix. Introduce and where and w n (h 1, h 2, h 3 ) = 1 n l n (λ, λ, λ) T h + 1 2n ht 2 l n (λ, λ, λ)h (7) w(h 1, h 2, h 3 ) = I(λ)(BZ) T h I(λ) 2 ht Ah (8) A = B = p 1 0 0 0 p 2 + φ 2 p 4 φ(1 φ)p 4 0 φ(1 φ)p 4 p 3 + (1 φ) 2 p 4 (9) p1 0 0 0 0 p2 0 p4 φ (10) 0 0 p3 p4 (1 φ) By the second-order Taylor expansion of the log-likelihood function about (λ, λ, λ) T, we have ( l n λ + h 1, λ + h 2, λ + h ) 3 ; φ l n (λ, λ, λ) + w n (h 1, h 2, h 3 ) (11) n n n Lemma 1. Let y ji, i = 1,..., n j, j = 1,..., 4 be independent random variables having density given in (3) and (4). Under suitable regularity conditions on the density functions f j (y), j = 1, 2, 3 (as in page 118, Lehmann (1991)), for any real numbers h 1, h 2, and h 3, w n (h 1, h 2, h 3 ) w(h 1, h 2, h 3 ) as n. Proof: Define p j (j = 1,..., 4) as the limiting proportion for group j, i.e., p j = lim (n j/n). n Let w n (h 1, h 2, h 3 ) as in (7). Then, ( l n (λ, λ, λ) T l n =, l n, l ) T n λ 1 λ 2 λ 3 λ1 =λ 8 λ2 =λ λ3 =λ D

and we can write 1 l n n λ 1 = λ1 =λ n1 1 n 1 n n1 f λ (y 1i ) f λ (y 1i ) where f. λ denotes the first derivative of the density f with respect to the parameter λ. ( ) ( f Since E λ (y 1i ) ) f λ = 0 and Var λ (y 1i ) λ = I(λ), by lemma 6.1, page 118 in Lehmann (1991), f λ (y 1i ) f λ (y 1i ) 1 l n n λ 1 λ1 =λ D p 1 I(λ)Z1 Similarly, Thus, 1 l n n λ 2 = λ2 =λ 1 l n n λ 2 n2 1 n 2 n n2 λ2 =λ f λ (y 2i ) f λ (y 2i ) + n4 1 n n4 n 4 D I(λ) ( p 2 Z 2 + φ p 4 Z 4 ) φ f λ (y 4i ) f λ (y 4i ) and, 1 l n n λ 3 λ3 =λ D I(λ) ( p 3 Z 3 + (1 φ) p 4 Z 4 ) Let.. f λ denote the second derivative of the density f with respect to the parameter λ. For the Hessian matrix 2 l n (λ, λ, λ), Since E λ (.. f λ (y 1i ) f λ(y 1i) ) 2 l n λ 2 1 = λ1=λ n 1.. n1 f λ (y 1i ) f λ (y 1i ) ( f λ (y 1i ) f λ (y 1i ) = 0 and E λ ( f λ (y 1i ) f λ(y 1i)) 2 = I(λ), by the Law of Large Numbers, ) 2 1 n 2 l n λ 2 1 λ1=λ = n 1 n 1 n 1 n 1.. f λ (y 1i ) f λ (y 1i ) n 1 n 1 n 1 n 1 ( ) 2 f λ (y 1i ) P p1 I(λ) f λ (y 1i ) and 2 l n λ 1 λ 2 = 2 l n λ 1 λ 3 = 0 Essentially following the same idea, it can be shown that 1 n 1 n 1 n 2 l n λ 2 2 2 l n λ 2 λ 3 2 l n λ2=λ λ 2 3 λ3 =λ λ2 =λ 3 =λ P I(λ)(p 2 + φ 2 p 4 ) P I(λ)(p 3 + (1 φ) 2 p 4 ) P I(λ)φ(1 φ)p 4 9

Thus, w n (h 1, h 2, h 3 ) D w(h 1, h 2, h 3 ) as n. Then we have and sup w(h 1, h 2, h 3 ) = 1 h 1,h 2,h 3 2 (BZ)T A 1 (BZ) (12) sup w(h, h, h) = 1 h 1 =h 2 =h 3 =h 2 (BZ)T JBZ (13) where J is a 3 3 matrix whose entries are all 1 s and A, B are as in (9) and (10). With Lemma 1, we have the following theorem. Theorem 1. Let LRT Λ n (λ; φ) be as in (6). Under the same regularity conditions as in Lemma 1, if the null hypothesis is true, then Λ n (λ; φ) converges in distribution to a chi-square distribution with two degrees of freedom when φ is known. Proof. From equations (6) and (11), ( By Lemma 1, Λ n (λ; φ) Λ n (λ; φ) 2 D Λ(λ; φ) = 2 sup w n (h, h, h) h 1 =h 2 =h 3 =h ( sup w(h, h, h) h 1 =h 2 =h 3 =h sup w n (h 1, h 2, h 3 ) h 1,h 2,h 3 sup w(h 1, h 2, h 3 ) h 1,h 2,h 3 ) ) = Z T MZ where M = B T (A 1 J)B and w n and w are as in (7) and (8). It can be shown by routine algebra that M is a 3 3 idempotent matrix with rank 2. Since Z has a multivariate normal distribution with zero mean and identity covariance matrix, by standard multivariate distribution theory Z T MZ has a χ 2 distribution with two degrees of freedom. This completes the proof of Theorem 1. Theorem 1 was proved for known φ. However, the parameter φ is often unknown and has to be estimated from data. The nucleotide mapping methods proposed in Liu et al. (2005) and Cui et al. (2007) applied a two-stage estimation procedure. The first stage is to estimate the haplotype frequencies. Denote ˆp 11, ˆp 12, ˆp 21 and ˆp 22 as the MLE of the four corresponding haplotypes which can be estimated by formulating a multinomial likelihood function. Details can be found in Liu et al. (2005). Then the MLE of φ can be obtained by and ˆφ ˆφ = ˆp 11 ˆp 22 ˆp 11 ˆp 22 + ˆp 12 ˆp 21 P φ as n. The estimated ˆφ is then plugged into (5) to estimate the quantitative parameters λ j. 10

Theorem 2. Let Λ n (λ; ˆφ) be the LRT by substituting φ by ˆφ. With the same regularity conditions as in Lemma 1, under H 0, Λ n (λ; ˆφ) converges in distribution to a χ 2 distribution with two degrees of freedom as n. Proof: Let Λ n (λ; ˆφ) = Λ n (λ; ˆφ) Λ n (λ; ˆφ) ( Λ n (λ; ˆφ) ) Λ n (λ; φ) + Λ n (λ; φ). Note that under H 0, Λ n (λ; φ) P 0 as n. By Theorem 1, we have Λ n (λ; φ) D χ 2 2. Then by Slutsky s theorem, D χ 2 2. This completes the proof. 3.2 Case when K = 3 Now consider the case when three SNPs form a haplotype. The maximum number of possible haplotypes is 2 3 = 8. A detailed list of the configurations similar to Table 1 for 3 SNPs is tabulated in Table 1 of Li et al. (2007). When the number of SNPs increases, the number of possible heterozygous loci increases resulting in exponentially increased mixture components in the likelihood function. With 3 SNPs considered, there are total of 7 possible mixture components in the likelihood function. However, three mixtures involve the same mean parameters and hence are non-informative and can be collapsed. Thus, only four mixture components are informative. The relative merit of the nucleotide mapping methods is that even though the number of SNPs is increased, the number of association parameters do not increase. To illustrate the idea, assume that [111] is the risk haplotype. As shown in Table 1 of Li et al. (2007), three composite diplotypes can be formulated based on this risk haplotype, namely [111][111], [111][111] and [111][111]. Similar to the 2-SNP model case, the genetic effect of these three composite diplotypes can be modelled by the additive (a) and dominant effect (d) of the risk haplotype [111]. Thus, testing for association of risk haplotype with a disease trait is the same before: that is, testing H 0 : a = d = 0, or by reparameterization, H 0 : λ 1 = λ 2 = λ 3 = λ, similar as the one in a 2-SNP model case. For K = 3, the data can be categorized as seven groups. The log likelihood function l n can be expressed as l n (λ 1, λ 2, λ 3 ; φ) = + n 3 j log[f j (y ji λ j )] j=1 7 n l log[φ l f 2 (y li λ 2 ) + (1 φ l )f 3 (y li λ 3 )] (14) l=4 where f j is defined in (3). 11

The likelihood function l n contains four mixtures each one of which is associated with one particular group of diplotypes. The four mixture proportions (φ l, l = 4,..., 7) are functions of haplotype frequencies (see Li et al. (2007) for details). Similar as the 2-SNP case, a two-stage estimation procedure can be applied. The first stage is to estimate the haplotype frequencies and so the four mixture proportions which are then plugged into (14) for the second stage estimation of the quantitative parameters λ j. Let Z = (Z 1,..., Z 7 ) T where Z i, i = 1,..., 7 are iid standard normal random variables. Define A = p 1 0 0 0 A 22 A 23 0 A 23 A 33 (15) where A 22 = p 2 + 7 l=4 φ2 l p l, A 33 = p 3 + 7 l=4 (1 φ l) 2 p l, and A 23 = 7 l=4 φ l(1 φ l )p l. Also define p1 0 0 0 0 0 0 B = 0 p2 0 p4 φ 4 p5 φ 5 p6 φ 6 p7 φ 7 0 0 p3 p4 (1 φ 4 ) p5 (1 φ 5 ) p6 (1 φ 6 ) p7 (1 φ 7 ) (16) where p l = lim n (n l/n), l = 1,..., 7. Theorem 3. Define Λ n (λ; ˆφ) as the LRT for testing H 0 : λ 1 = λ 2 = λ 3 = λ, where ˆφ = ( ˆφ 4, ˆφ 5, ˆφ 6, ˆφ 7 ) T. With the same regularity conditions as in Lemma 1, under H 0, Λ n (λ; ˆφ) converges in distribution to Z T MZ χ 2 2 where M = BT (A 1 J)B, J is a 3 3 matrix whose entries are all 1 s, and A, B are as in (15) and (16). Proof. The proof follows the same technique as have been shown in the 2-SNP case. Remark 1. We have investigated the limiting distribution of the LRT under the 2-SNP and 3-SNP model. The generalization to multiple SNPs (K > 3) is straightforward by modifying the A and B matrices. Thus, the asymptotic χ 2 2 distribution of the LRT is true in general and can be applied in practice for fast threshold determination considering any number of SNPs. Remark 2. The limiting distribution of the LRT is established without covariates. In genetic studies, clinical risk factors or other environmental factors can also have influence on an individual s disease risk. When testing the effect of genetic variants, these covariates can be considered as nuisance parameters. The limiting distribution of the LRT derived in this study still holds with nuisance parameters. 12

4 Simulation To evaluate the finite sample performance of the asymptotic chi-square distribution, we perform several simulation studies. The first simulation considers a binary disease phenotype and a logistic regression model is fitted. The values of allele frequency, LD among the tested SNPs as well as the quantitative genetic parameters used for the simulation study can be found in Cui et al. (2007). Three disease models are considered: the additive model (d/a = 0), the dominant model (d/a = 1) and the recessive model (d/a = 1), where d and a are the dominant and additive effects, respectively. Data were simulated under different sample sizes (n = 100, 200, 500). For each simulated data set with different gene action modes, 1000 permutations were repeated and the same procedure was repeated for 100 times. The average values of the 100 replications were recorded for each percentile. The permutation test is distribution free but data dependent. Thus, the threshold obtained by permutations was considered as the exact threshold, in comparison with the threshold calculated from the chi-square distribution. Table 2: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with binary disease phenotype fitted with logistic distribution assuming a 2-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 d/a 0 1-1 n 100 200 500 100 200 500 100 200 500.80 3.22 3.37 3.31 3.23 3.35 3.31 3.23 3.34 3.29 3.23.90 4.61 4.78 4.73 4.65 4.81 4.68 4.59 4.78 4.69 4.62.95 5.99 6.24 6.11 6.05 6.27 6.12 5.99 6.22 6.11 6.01.99 9.21 9.75 9.46 9.28 9.49 9.37 9.29 9.48 9.26 9.18 Note: The permutation percentile is the averaged percentile out of 100 replicates. A comparison of the permutation- and χ 2 approximation-based cutoff points is shown in Table 2 for the 2-SNP model and in Table 3 for the 3-SNP model. Overall, the chi-square cutoffs are consistent with the permutation cutoffs for different percentiles, which indicates good performance of the approximation. With large sample size (n = 500), the cutoffs obtained with the two methods are very close. Thus in real data analysis the large sample test can effectively replace time-consuming permutation tests. For a normally distributed phenotype, we simulated data with different heritability levels. Details of the simulation can be found in Liu et al. (2005). Again, we assume different gene 13

Table 3: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with binary disease phenotype fitted with logistic distribution assuming a 3-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 α 2 /α 1 0 1-1 n 100 200 500 100 200 500 100 200 500.80 3.22 3.34 3.30 3.22 3.36 3.30 3.24 3.33 3.29 3.23.90 4.61 4.78 4.73 4.64 4.81 4.69 4.65 4.75 4.69 4.63.95 5.99 6.19 6.08 5.98 6.24 6.14 6.03 6.17 6.09 6.00.99 9.21 9.48 9.29 9.11 9.63 9.37 9.08 9.43 9.31 9.17 Note: The permutation percentile is the averaged percentile out of 100 replicates. action modes (additive, dominant and recessive) as described above. Results were summarized in Tables 4 and 5. In general, the two methods produce fairly consistent cutoffs where the consistency depends on heritability level and sample size. More consistent results are observed under larger heritability level. For example, for the 2-SNP model when n = 100 and d/a = 0, the 80% threshold difference between the chi-square approximation and the permutation is 0.42 when H 2 is 0.1. This difference reduces to 0.1 when H 2 increases to 0.4 while hold the other conditions fixed. When sample size increases, the cutoffs generated by the two methods are more consistent with each other. Table 4: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with continuous disease phenotype fitted with normal distribution assuming a 2-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 d/a 0 1-1 n 100 200 500 100 200 500 100 200 500.80 3.22 H 2 =0.1 2.80 3.27 3.25 3.02 3.27 3.22 2.94 3.18 3.24 H 2 =0.4 3.12 3.20 3.22 2.39 3.26 3.24 2.96 3.30 3.23.90 4.61 H 2 =0.1 4.21 4.68 4.64 4.44 4.67 4.61 4.37 4.58 4.62 H 2 =0.4 4.57 4.61 4.61 3.80 4.66 4.65 4.40 4.70 4.62.95 5.99 H 2 =0.1 5.64 6.06 6.00 5.86 6.07 6.00 5.76 6.00 5.99 H 2 =0.4 5.96 6.02 6.01 5.22 6.07 6.05 5.87 6.10 5.99.99 9.21 H 2 =0.1 8.99 9.35 9.14 9.08 9.33 9.19 8.98 9.31 9.10 H 2 =0.4 9.20 9.17 9.16 8.52 9.35 9.16 9.15 9.33 9.23 Note: The permutation percentile is the averaged percentile out of 100 replicates. The empirical type I error rate of the large sample test was also investigated at the nominal 14

Table 5: The permutation-based percentiles of the LRT statistic and the χ 2 2 approximation with continuous disease phenotype fitted with normal distribution assuming a 3-SNP model under different disease gene action modes. Permutation Percentile χ 2 2 d/a 0 1-1 n 100 200 500 100 200 500 100 200 500.80 3.22 H 2 =0.1 3.32 3.27 3.25 3.33 3.27 3.24 3.36 3.26 3.23 H 2 =0.4 3.33 3.26 3.22 3.31 3.27 3.23 3.35 3.27 3.23.90 4.61 H 2 =0.1 4.76 4.67 4.63 4.76 4.68 4.64 4.81 4.68 4.61 H 2 =0.4 4.75 4.66 4.61 4.74 4.67 4.63 4.81 4.70 4.63.95 5.99 H 2 =0.1 6.18 6.07 6.02 6.19 6.13 6.02 6.26 6.08 5.98 H 2 =0.4 6.15 6.04 6.00 6.18 6.08 5.99 6.24 6.10 6.01.99 9.21 H 2 =0.1 9.33 9.31 9.13 9.38 9.36 9.05 9.46 9.31 9.12 H 2 =0.4 9.39 9.23 9.21 9.52 9.28 9.17 9.52 9.35 9.16 Note: The permutation percentile is the averaged percentile out of 100 replicates. 5% level. Figure 1 shows the performance of the chi-square approximation under different sample sizes and different disease trait distributions. Overall, the type I error rate is reasonably controlled for models fitted with different number of SNPs. 5 Conclusion Statistical dissection of genetic association between genetic factors and disease phenotypes has been a long-term effort in gene mapping study. With the development of the human HapMap project, massive amount of high throughput SNP data are generated. The density of the SNP data is still increasing with advanced genotyping technology. Development of computationally efficient and statistically powerful analytical method is critically important in unravelling causal disease variants. The methods developed by Liu et al. (2005) and Cui et al. (2007), termed nucleotide mapping in general, as well as various extensions and applications of the methods (e.g. Hou et al., 2007; Li et al., 2007; Wu et al., 2007; Pinedo et al., 2008) provide timely efforts in elucidating the genetic architecture of nucleotide patterns in association with a complex disease trait. However, given the increasing number of SNPs documented in public database, the computational burden in assessing the statistical significance of the LRT in nucleotide mapping with permutation tests is substantial. In this article, we investigated the limiting distribution of the LRT and showed that the LRT is asymptotically chi-square with 2 degrees of freedom under the null hypothesis of no disease gene association. We evaluated 15

0.06 0.055 L 2SNP L 3SNP N 2SNP N 3SNP Type I error 0.05 0.045 100 200 500 Sample size Figure 1: Type I error rate with chi-square approximation for disease phenotype simulated assuming logistic (L) and normal (N) distribution under the 2-SNP and 3-SNP models. the performance of the chi-square approximation and compared it with the non-parametric permutation tests. The results indicate that the chi-square approximation performs well with moderate sample size, and hence can be applied in real data analysis for fast threshold determination. Achnowledgement The work of the first author was supported in part by NSF grant DMS-0707031. References Akey, J., Jin, L., Xiong, M., 2001. Haplotypes vs. single makrer linkage disequilibrium tests: what do we gain? Eur. J. Hum. Genet. 9, 291-300. Clark, A.G., 2004. The role of haplotypes in candidate gene studies. Genet. Epidemiol. 27, 321-333. Cui, Y.H., Fu, W., Sun, K.L., Romero, R., Wu, R., 2007. Mapping nucleotide sequences that encode complex binary disease traits with HapMap. Current Genomics 8, 307-322 Hou, W., Yap, J.S., Wu, S., Liu, T., Cheverud, J.M., Wu, R., 2007. Haplotyping a quantitative trait with a high-density map in experimental crosses. PLoS ONE 2(1): e732. 16

Lehmann, E.L., 1991. Theory of point estimation. Chapman and Hall, New York. Li, H., Kim, B.R., Wu, R., 2006. Identification of quantitative trait nucleotides that regulate cancer growth: a simulation approach. J. Theor. Biol. 242, 426-439. Lou, X-Y., Casella, G., Littell, R.C., Yang, M.C.K., Wu, R., 2003 A haplotype-based algorithm for multilocus linkage disequilibrium mapping of quantitative trait loci with epistasis in natural populations. Genetics 163, 1533-1548. Lynch, M., Walsh, B., 1998. Genetics and Analysis of Quantitative Traits. Sinauer, Sunderland, MA. McCullagh, P., Nelder, J., 1989. Generalized Linear Models. London: Chapman and Hall. Olson, J.M., Wijsman, E.M., 1994. Design and sample size considerations in the detection of linkage disequilibrium with a marker locus. Am. J. Hum. Genet. 55, 574-580. Pinedoa, P., Wang, C., Li, Y., Raea, D., Wu, R., 2008. Risk haplotype analysis for bovine paratuberculosis. Mamm. Genome (in press). Schaid, D.J., 2004. Evaluating associations of haplotypes with traits. Genet. Epidemiol. 27, 348-364. Schaid, D.J., Rowland, C.M., Tines, D.E., Jacobson, R.M., Poland, G.A., 2002. Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am. J. Hum. Genet. 70, 425-434. van der Vaart, A. W., 1998. Asymptotic statistics. Cambridge University Press. Wu, S., Yang, J., Wang, C., Wu, R., 2007. A general quantitative genetic model for haplotyping a complex trait in humans. Current Genomics 8, 343-350. 17