MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Similar documents
Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 2: Genetic Association Testing with Quantitative Traits. Summer Institute in Statistical Genetics 2017

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

I Have the Power in QTL linkage: single and multilocus analysis

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Combining dependent tests for linkage or association across multiple phenotypic traits

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Lecture 11: Multiple trait models for QTL analysis

Genotype Imputation. Biostatistics 666

Overview. Background

Variance Component Models for Quantitative Traits. Biostatistics 666

Affected Sibling Pairs. Biostatistics 666

QTL mapping under ascertainment

Analytic power calculation for QTL linkage analysis of small pedigrees

Power and Robustness of Linkage Tests for Quantitative Traits in General Pedigrees

(Genome-wide) association analysis

Multipoint Quantitative-Trait Linkage Analysis in General Pedigrees

Lecture 6. QTL Mapping

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Linkage and Linkage Disequilibrium

Calculation of IBD probabilities

Bayesian Inference of Interactions and Associations

The Quantitative TDT

Nature Genetics: doi: /ng Supplementary Figure 1. Number of cases and proxy cases required to detect association at designs.

Computational Systems Biology: Biology X

The genomes of recombinant inbred lines

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Mapping quantitative trait loci in oligogenic models

Calculation of IBD probabilities

Prediction of the Confidence Interval of Quantitative Trait Loci Location

Methods for Cryptic Structure. Methods for Cryptic Structure

Modeling IBD for Pairs of Relatives. Biostatistics 666 Lecture 17

Gamma frailty model for linkage analysis. with application to interval censored migraine data

Multiple QTL mapping

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Lecture 6: Introduction to Quantitative genetics. Bruce Walsh lecture notes Liege May 2011 course version 25 May 2011

SNP Association Studies with Case-Parent Trios

Statistical issues in QTL mapping in mice

Powerful Regression-Based Quantitative-Trait Linkage Analysis of General Pedigrees

Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives

Introduction to QTL mapping in model organisms

Gene mapping in model organisms

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis

TOPICS IN STATISTICAL METHODS FOR HUMAN GENE MAPPING

Experimental Design and Data Analysis for Biologists

SNP-SNP Interactions in Case-Parent Trios

MIXED MODELS THE GENERAL MIXED MODEL

Distinctive aspects of non-parametric fitting

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Linear Regression (1/1/17)

p(d g A,g B )p(g B ), g B

Quantitative characters - exercises

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

Genetics Studies of Multivariate Traits

25 : Graphical induced structured input/output models

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

Case-Control Association Testing with Related Individuals: A More Powerful Quasi-Likelihood Score Test

Genetic Studies of Multivariate Traits

Association studies and regression

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

... x. Variance NORMAL DISTRIBUTIONS OF PHENOTYPES. Mice. Fruit Flies CHARACTERIZING A NORMAL DISTRIBUTION MEAN VARIANCE

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Solutions to Problem Set 4

Lecture WS Evolutionary Genetics Part I 1

Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

On the limiting distribution of the likelihood ratio test in nucleotide mapping of complex disease

A Parametric Copula Model for Analysis of Familial Binary Data

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial

Heritability estimation in modern genetics and connections to some new results for quadratic forms in statistics

Mapping QTL to a phylogenetic tree

Statistical Applications in Genetics and Molecular Biology

Asymptotic distribution of the largest eigenvalue with application to genetic data

QTL Mapping I: Overview and using Inbred Lines

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

VARIANCE-COMPONENTS (VC) linkage analysis

Introduction to QTL mapping in model organisms

The Admixture Model in Linkage Analysis

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Quantitative Genetics. February 16, 2010

Quantitative trait evolution with mutations of large effect

Introduction to QTL mapping in model organisms

2. Map genetic distance between markers

GBLUP and G matrices 1

Partitioning the Genetic Variance

1. Understand the methods for analyzing population structure in genomes

Theoretical and computational aspects of association tests: application in case-control genome-wide association studies.

Introduction to QTL mapping in model organisms

Lecture 24: Multivariate Response: Changes in G. Bruce Walsh lecture notes Synbreed course version 10 July 2013

Backward Genotype-Trait Association. in Case-Control Designs

Gene± environment interaction is likely to be a common and

Statistical Methods in Mapping Complex Diseases

NIH Public Access Author Manuscript Stat Sin. Author manuscript; available in PMC 2013 August 15.

Introduction to QTL mapping in model organisms

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

The Generalized Higher Criticism for Testing SNP-sets in Genetic Association Studies

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Family-wise Error Rate Control in QTL Mapping and Gene Ontology Graphs

Transcription:

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES Saurabh Ghosh Human Genetics Unit Indian Statistical Institute, Kolkata

Most common diseases are caused by a combination of genes and environment Stroke Breast cancer Diabetes Obesity Inflammatory Bowel Disease Manic-depression Myocardial Infarction Hypertension High Cholesterol Schizophrenia

Where are those genes?

QUANTITATIVE TRAITS and COMPLEX TRAITS Complex traits are controlled by multiple loci, some with minor gene effects, and genetic variation at any one locus does not completely determine the trait. Examples: diabetes, coronary artery disorder, schizophrenia. Such traits are usually binary in nature. Heritable quantitative characters, possibly correlated, generally are precursors of complex traits. Example: Blood pressure and total cholesterol are precursors of coronary artery disease.

WHY STUDY QUANTITATIVE TRAITS? To study a complex trait. Often genetic studies of a binary trait (e.g., hypertension) use dichotomization of a correlated quantitative variable (e.g., s.b.p) based on some pre-determined threshold(s) [e.g., s.b.p > 140]. This leads to loss of information on trait variability and statistical power. For genetic analysis of a binary complex trait, it may be more powerful to consider a quantitative trait correlated with the end-point trait. Genes found to control the quantitative trait may also be involved in the pathogenesis of the complex trait.

THE TWO PARADIGMS OF GENE-MAPPING 1. LINKAGE measured in terms of RECOMBINATION FRACTION (θ), the probability of a crossover between two loci. A B A a b b From biological considerations, 0 θ 0.5. θ=0.5 implies that two loci are unlinked, θ=0 implies complete linkage. 2. ASSOCIATION measured in terms of LINKAGE DISEQUILIBRIUM (δ), the deviation of probability of joint occurrence of alleles at two-locus from random occurrence of the alleles. δ= P(AB)-P(A)P(B)

LINKAGE ANALYSIS OF QUANTITATIVE TRAITS AIM: make inferences about the recombination fraction between a hypothetical QTL and a marker locus. Is the QTL physically close to the marker locus? For a sib-pair, the QTL paradigm is that if the quantitative trait values of the sibs are close to each other, the sibs have similar allelic composition at the QTL. If their trait values differ significantly, the allelic compositions at the QTL are different. If the marker locus is LINKED (physically close) to the QTL, this property will also be exhibited by the marker locus. If the marker locus is UNLINKED, there will NOT be any relation between the trait values and allelic compositions.

THE BASIC QTL MODEL A quantitative trait Y is controlled by an autosomal, biallelic QTL with alleles (A,a). E(Y AA)= α Var(Y AA)= σ 2 E(Y Aa)= β Var(Y Aa)= σ 2 E(Y aa)= -α Var(Y aa)= σ 2 If β = 0, we say that Y has no dominance effect at the QTL. π t is the i.b.d. (identity-by-descent) score at the QTL (unobservable). π m is the i.b.d. score at the marker (sometimes can be exactly computed, often necessary to be estimated as P m ). AIM: to find relationship between suitable functions of Y and P m for relative-pairs.

POPULAR QTL MAPPING METHODS 1. VARIANCE COMPONENTS METHODS (Amos et al. 1990, Almasy and Blangero 1998) Random Effects Model: Y = ΣG i + E ; i=1,2,,k where G i is the genotypic effect of the i th QTL assumed to have a normal distribution; E is the environmental effect with a normal distribution. Data: Quantitative traits and marker genotypes of pedigrees. Quantitative traits within a pedigree are assumed to follow a MULTIVARIATE NORMAL DISTRIBUTION. Test for linkage at the i th QTL is a likelihood-ratio test for the variance of G i = 0 versus being positive.

2. HASEMAN-ELSTON APPROACH AND EXTENSIONS (Haseman and Elston 1972, Extensions: Amos et al. 1989, Wijsman and Olson 1993, Olson 1995, Tiwari and Elston 1997, Elston et al. 2000 ) Data: quantitative trait values and marker genotypes of independent sib-pairs (and preferably marker genotypes of both parents). The regression equation: E(Z P m ) = a + b P m where Z = sib-pair squared difference in trait values β = 0 (no dominance at the QTL) b=0 θ = 0.5 (unlinked) and b < 0 θ< 0.5 (linked) Testing for b=0 versus b < 0: distribution-free approach

A NON-PARAMETRIC ALTERNATIVE (Ghosh and Majumder 2000, Ghosh et al. 2003) Non-parametric regression instead of Haseman-Elston linear regression: Z = Ψ(P m ) + e where Ψ is a real-valued function. Ψ is estimated using kernel smoothing : Ψ(x j )= [Σ i κ({x i -x j }/h)y i ]/ Σ i κ({x i -x j }/h) The kernel function used: κ(x) = 0.75 (1-x 2 ) I x <1 Optimal h selected using leave-out-one cross validation.

The test statistic: Δ = 1 - (residual sum of squares in regression / total variation in Z) Analogous to R 2 in linear regression. Test for linkage is equivalent to Δ=0 versus Δ > 0. Difficult to obtain asymptotic distribution of Δ. Use permutations to obtain empirical p-value of observed Δ. CAVEAT: Δ does not take into account direction of relationship between Z and P m. Under null, random positive relationship will inflate rate of false positives. Test additionally for correlation between Z and P m whenever Δ significant.

(200 sib-pairs, α=3, σ=1, p=0.5, ρ=0.5, θ=0.01) Model β POWER HE (sq diff) HE (cr pr) KS Normal 0 0.899 0.912 0.878 1 0.825 0.838 0.852 2 0.685 0.702 0.769 Chi-sq 0 0.890 0.902 0.876 1 0.817 0.823 0.849 2 0.651 0.657 0.766 Poisson 0 0.872 0.881 0.875 1 0.805 0.802 0.850 2 0.613 0.604 0.764

EXTENSION OF HE TO SIBSHIP DATA (Ghosh and Reich 2002) Data: Quantitative traits Y= (y 1,y 2, ) on sibships of size 2 Contrast function: Σ c i y i (=c Y) where Σ c i (=c 1) = 0 Quadratic function of marker i.b.d. scores: c Π m c where Π m is the matrix of i.b.d. scores with (i,j) th element the i.b.d. score of the i th and j th sibs (i j) and diagonals 0. NOTE: i.b.d. scores of all possible sib-pairs in a sibship are not independent. Given a sibship of size n, there exists an independent set of (n-1) pairwise i.b.d. scores. For example, {π 12,π 13,,π 1n } is an independent set of i.b.d. scores. Given this set, all other pairwise i.b.d. scores are dependent.

Why choose the contrast function? 1. The classical HE can be viewed as a special case of a contrast function for n=2 with c = (1-1). 2. The contrast function corresponds to the second principal component of quantitative trait values. The corresponding reqression equation: E{(c Y) 2 c Π m c} = a + b c Π m c where there is no dominance at the QTL, b=0 θ=0.5 and b > 0 θ < 0.5 Choice of c : {1,-1/(n-1),,-1/(n-1)} Consistent with HE. Higher coefficient associated with the independent set of i.b.d. scores {π 12,π 13,,π 1n } and lower coefficient with other pairs.

COMBINING CONTRAST AND MEAN FUNCTIONS Mean value of quantitative values correspond to first PC. E( y 2 1 Π m 1) = a + b (1 Π m 1)/n 2 where b is same as in the contrast function equation. Mean function and contrast function are uncorrelated. Combined least squares minimization: Σ j {u j -β 0 -β 1 x j } 2 + Σ j {v j -β 2 -β 1 z j } 2 u j, v j : sq contrast and mean functions; x j,y j : i.b.d. regressors Test for linkage based on least squares estimate of β 1. Alternative: Combined non-parametric regressions of u j on x j and y j on z j using kernel smoothing.

(200 sibships with 4 sibs each, α=3, σ=1, p=0.5, ρ=0.5, θ=0.01) Model β POWER KS HE (cr pr) CM Normal 0 0.912 0.918 0.931 1 0.875 0.860 0.868 2 0.798 0.742 0.749 Chi-sq 0 0.903 0.909 0.915 1 0.856 0.832 0.836 2 0.754 0.717 0.711 Poisson 0 0.880 0.887 0.893 1 0.826 0.811 0.813 2 0.735 0.661 0.651

MULTIVARIATE PHENOTYPES Complex traits are characterized by correlated quantitative variables constituting a multivariate phenotype. The end-point trait is usually binary in nature. Since quantitative traits carry more information on variation within genotypes, it may be more prudent to use a vector of correlated phenotypes for linkage analysis of the complex trait. Analyzing individual components of a multivariate phenotype vector separately leads to the statistical problem of multiple comparisons. Analyzing a genetically relevant phenotype is more powerful than a blind multivariate analysis with all the components (Ott and Raboniwitz 1999).

Existing Methods For Mapping Multivariate Phenotypes Variance Components Methods (e.g., Amos et al. 1990): Require modeling of covariance structure of the components of the multivariate phenotype vector. Generally, multivariate normality assumed: problems with robustness, difficulty to verify assumptions in high dimensions. Data Reduction Techniques (e.g., Elston et al. 2000): Principal components analysis: a linear combination of the components is defined as the new phenotype. Model-free methods (e.g., Haseman-Elston) possible, but linkage results difficult to interpret.

Quantitative and Qualitative Phenotypes in a Multivariate Phenotype vector Example: EEG (quantitative), ERP (quantitative), depression (qualitative) may comprise a multivariate phenotype vector for studying the genetic risk of alcoholism. Variance components methods (assuming multivariate normality) are incompatible even if only one of the components of the vector is binary/qualitative. Principal components obtained by decomposing the correlation matrix comprising both quantitative and qualitative phenotypes are unreliable.

The Proposed Method Constructing the multivariate phenotype vector (Y 1, Y 2,, Y k ) is a multivariate vector of quantitative traits, Z is the binary end-point trait. A logistic regression of Z on (Y 1, Y 2,, Y k ) will identify those traits significantly correlated with Z. The multivariate phenotype comprise only these traits. The regression G 1, G 2,, G m are genotypes at m marker loci. Most linkage methods model the dependence of (Y 1, Y 2,, Y k ) on G 1, G 2,, G m. Since the response variable is some function of Y i s, inferences are sensitive to the distribution of Y i s.

Alternative formulation: Reverse the response and explanatory variables (e.g., single trait version of Sham et al. 2002 ) in a linear regression set-up. The regression model: Π i = Σ j β j d ij + e i where, Π i is the estimated i.b.d. score at a marker locus, d ij is the squared difference of trait values corresponding to the i th sib-pair and the j th trait component of the multivariate phenotype vector, and e i is a random component. Linkage for the complex trait can be tested in terms of β j : H 0 : β j =0 for all j versus H 1 : β j < 0 The test is performed by a log-likelihood-ratio statistic. Since H 1 is constrained, the asymptotic distribution is difficult to obtain and is determined empirically using permutation principles.

Advantages : does not require modeling of covariance structure of component phenotypes. Robust with respect to violations in distributional assumptions. Linkage results are easier to interpret than analyses based on phenotypes defined by data reduction techniques. can incorporate both quantitative and qualitative traits. using i.b.d. scores as the response variable allows for (n-1) independent observations for a sibship of size n [not true if functions of quantitative traits are response variables].

SIMULATIONS Example 1: (X,Y,Z) ~ N 3 (μ,σ) Example 2: (X,Y,Z) such that (X,Y) ~ N 2 (μ,σ) and Z~ chi-square and correlated with (X,Y) Example 3: (X,Y,Z) such that X is normal, Y is chi-square and Z is Poisson; pairwise correlated U is a binary trait defined by Φ 3 (X,Y,Z) standardized. In each case, there is a common QTL. Example 4: (X,Y,Z) ~ N 3 (μ,σ) such that (X,Y) is controlled by a common QTL, Z is not. U is defined as Φ 2 (X,Y) Logistic regression gave significance correctly. 200 sibships (60 of size 2, 80 of 3, 60 of 4), major gene-effect for each trait =3, allele frequencies at trait locus=(0.7,0.3). Power at θ=0.01 based on 1000 replications.

Model β HE RR RR HE RR (dom) (PC) (PC) (Mult) (Bin) (Bin) 1 0.802.854.899.721.735 1.5.751.778.815.643.670 2 0.766.817.845.718.723 1.5.714.751.779.626.621 3 0.683.702.770.608.628 1.5.624.665.713.577.582 4 0.695.699.731.438.457 (all 3) 1.5.617.632.690.375.361 4 0.754.784.816.609.624 (first 2) 1.5.700.728.776.543.558

COLLABORATIVE STUDY ON THE GENETICS OF ALCOHOLISM (COGA) Multicenter project to detect and map susceptibility genes for alcoholic dependence (SUNY, Wash U, IUPUI, U Iowa, UCONN, Howard U, Rutgers U, SFBR) 262 families, COGA proband and having at least three full sibs and both parents/larger sibships. 405 markers with average inter-marker distance 10.9 cm Qualitative and Ordinal Traits: COGA (DSM-III-R + Feighner) DSM -IV ICD-10

LINKAGE ANALYSES ON COGA DATA No promising linkage finding with binary clinical phenotypes like COGA, DSM-IV and ICD-10. Alternative strategy was to analyze quantitative precursors. Quantitative Traits: Maximum number of drinks in a 24 hour period (a measure to grade non-alcoholic individuals) # externalizing symptoms (symptom count associated with alcoholism) Electroencephalogram (EEG) Waves (data collected at 31 bipolar electrode positions)

MULTIVARIATE PHENOTYPES IN COGA Univariate studies have provided linkage evidence on Chromosome 4 (near the ADH gene cluster) for: # Max-drinks in a 24 hour period [Saccone et al. 2000] Beta 2 EEG [Porjesz et al. 2002, Ghosh et al. 2003] # Externalizing Symptoms [Ghosh et al. 2008] Binary trait: COGA diagnosis Logistic regression: All 3 quantitative traits significantly correlated (p-value < 0.001) Scan on Chromosome 4 using reverse regression: Peak at 118 cm (p-value < 0.0001)

CLOSING REMARKS A single quantitative phenotype correlated with a binary end-point trait may not be a sufficiently good surrogate for the end-point trait. Rather, a multivariate phenotype comprising both quantitative and binary phenotypes may be more optimal. Using i.b.d. scores as the response variable and components of a multivariate phenotype as explanatory variables in a regression set-up has both analytical and interpretational advantages. No single gene-finding method is sufficient or complete. Multiple roads should lead to Rome.

Our Human Genetics Unit Team