Introduction to QTL mapping in model organisms

Similar documents
Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Gene mapping in model organisms

Statistical issues in QTL mapping in mice

Mapping multiple QTL in experimental crosses

Multiple QTL mapping

Use of hidden Markov models for QTL mapping

Mapping multiple QTL in experimental crosses

Prediction of the Confidence Interval of Quantitative Trait Loci Location

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

Methods for QTL analysis

Mapping QTL to a phylogenetic tree

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,

The genomes of recombinant inbred lines

Lecture 9. QTL Mapping 2: Outbred Populations

Overview. Background

Causal Model Selection Hypothesis Tests in Systems Genetics

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

Binary trait mapping in experimental crosses with selective genotyping

Linkage Mapping. Reading: Mather K (1951) The measurement of linkage in heredity. 2nd Ed. John Wiley and Sons, New York. Chapters 5 and 6.

Mixture Cure Model with an Application to Interval Mapping of Quantitative Trait Loci

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

Causal Graphical Models in Systems Genetics

Lecture 11: Multiple trait models for QTL analysis

On the mapping of quantitative trait loci at marker and non-marker locations

Lecture 6. QTL Mapping

QTL Mapping I: Overview and using Inbred Lines

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI

Inferring Genetic Architecture of Complex Biological Processes

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

An applied statistician does probability:

QTL mapping of Poisson traits: a simulation study

. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)

Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets

Variance Component Models for Quantitative Traits. Biostatistics 666

Goodness of Fit Goodness of fit - 2 classes

The Admixture Model in Linkage Analysis

Affected Sibling Pairs. Biostatistics 666

theta H H H H H H H H H H H K K K K K K K K K K centimorgans

Chapter 1 Statistical Inference

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Multiple interval mapping for ordinal traits

Quantile based Permutation Thresholds for QTL Hotspots. Brian S Yandell and Elias Chaibub Neto 17 March 2012

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

A new simple method for improving QTL mapping under selective genotyping

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Linkage analysis and QTL mapping in autotetraploid species. Christine Hackett Biomathematics and Statistics Scotland Dundee DD2 5DA

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Lecture 2: Linear and Mixed Models

QTL model selection: key players

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Association studies and regression

BTRY 4830/6830: Quantitative Genomics and Genetics

MULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI ROBY JOEHANES

Calculation of IBD probabilities

Locating multiple interacting quantitative trait. loci using rank-based model selection

Lecture WS Evolutionary Genetics Part I 1

Quantile-based permutation thresholds for QTL hotspot analysis: a tutorial

Calculation of IBD probabilities

F1 Parent Cell R R. Name Period. Concept 15.1 Mendelian inheritance has its physical basis in the behavior of chromosomes

On Computation of P-values in Parametric Linkage Analysis

Lecture 9. Short-Term Selection Response: Breeder s equation. Bruce Walsh lecture notes Synbreed course version 3 July 2013

Detection of multiple QTL with epistatic effects under a mixed inheritance model in an outbred population

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Bayesian Regression (1/31/13)

Model Selection for Multiple QTL

Generalized Linear Models Introduction

Linear Regression (1/1/17)

Linkage and Linkage Disequilibrium

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

THE problem of identifying the genetic factors un- QTL models because of their ability to separate linked

Selecting explanatory variables with the modified version of Bayesian Information Criterion

arxiv: v4 [stat.me] 2 May 2015

(Genome-wide) association analysis

2. Map genetic distance between markers

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Part 8: GLMs and Hierarchical LMs and GLMs

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

Genotype Imputation. Biostatistics 666

where x and ȳ are the sample means of x 1,, x n

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Notes on the Multivariate Normal and Related Topics

Joint Analysis of Multiple Gene Expression Traits to Map Expression Quantitative Trait Loci. Jessica Mendes Maia

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Major Genes, Polygenes, and

DNA polymorphisms such as SNP and familial effects (additive genetic, common environment) to

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Master s Written Examination

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Final Exam. Name: Solution:

The evolutionary success of flowering plants is to a certain

Semiparametric Generalized Linear Models

Quantitative Genetics. February 16, 2010

Transcription:

Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University kbroman@jhsph.edu www.biostat.jhsph.edu/ kbroman Outline Experiments and data Models ANOVA at marker loci Interval mapping Epistasis LOD thresholds CIs for QTL location Selection bias The X chromosome Selective genotyping Covariates Non-normal traits The need for good data

Backcross experiment P 1 P 2 F 1 A A B B A B BC BC BC BC Intercross experiment P 1 P 2 F 1 A A B B A B F 2 F 2 F 2 F 2

Phenotype distributions Within each of the parental and F 1 strains, individuals are genetically identical. Environmental variation may or may not be constant with genotype. The backcross generation exhibits genetic as well as environmental variation. Parental strains A B 0 20 40 60 80 100 F 1 generation 0 20 40 60 80 100 Backcross generation 0 20 40 60 80 100 Phenotype Data Phenotypes: Genotypes: Genetic map: y i = trait value for individual i x ij = 0/1 if mouse i is BB/AB at marker j (or 0/1/2, in an intercross) Locations of markers

Genetic map 0 20 Location (cm) 40 60 80 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X Chromosome Genotype data 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X 100 80 Individuals 60 40 20 20 40 60 80 100 120 Markers

Goals Detect QTLs (and interactions between QTLs) Confidence intervals for QTL location Estimate QTL effects (effects of allelic substitution) Statistical structure QTL Covariates Markers Phenotype The missing data problem: Markers QTL The model selection problem: QTL, covariates phenotype

Models: Recombination We assume no crossover interference. = Points of exchange (crossovers) are according to a Poisson process. = The {x ij } (marker genotypes) form a Markov chain Crossover locations Crossover locations follow a Poisson process 1. Number of XOs Poisson( L / 100 ) Locations of XOs (given number) iid uniform 2. Inter-XO distances are iid exponential( mean = 100 cm ) Note: In reality, crossovers are more spread out, but this no interference (NI) model is extremely convenient.

Marker genotypes Under the no interference model, the marker genotypes follow a Markov chain. x 1 x 2 x 3 x j x M Pr(x j+1 x j,x j 1,...,x 1 ) = Pr(x j+1 x j ) The past and future are conditionally independent, given the present. Example A B B A A B A A B A A A A A A B?

Models: Genotype Phenotype Let y = phenotype g = whole genome genotype Imagine a small number of QTLs with genotypes g 1,...,g p. (2 p distinct genotypes) E(y g) = µ g1,...,g p var(y g) = σ 2 g 1,...,g p Models: Genotype Phenotype Homoscedasticity (constant variance): σ 2 g σ 2 Normally distributed residual variation: y g N(µ g,σ 2 ). Additivity: Epistasis: µ g1,...,g p = µ + p j=1 j g j (g j = 1 or 0) Any deviations from additivity.

The simplest method: ANOVA Also known as marker regression. Split mice into groups according to genotype at a marker. Do a t-test / ANOVA. Repeat for each marker. Phenotype 80 70 60 50 40 BB AB Genotype at D1M30 BB AB Genotype at D2M99 Effect at a marker Consider the case of a single QTL with effect = µ BB µ AB. Consider a marker linked to the QTL, with r = recomb. frac. Of individuals with marker genotype BB, mean phenotype is: µ BB (1 r) + µ AB r = µ BB r Of individuals with marker genotype AB, mean phenotype is: µ AB (1 r) + µ BB r = µ AB + r Difference: (µ BB r ) (µ AB + r ) = (1 2r)

ANOVA at marker loci Advantages Simple. Easily incorporates covariates. Easily extended to more complex models. Doesn t require a genetic map. Disadvantages Must exclude individuals with missing genotype data. Imperfect information about QTL location. Suffers in low density scans. Only considers one QTL at a time. Lander & Botstein (1989) Interval mapping (IM) Take account of missing genotype data Interpolate between markers Maximum likelihood under a mixture model 8 6 LOD score 4 2 0 0 20 40 60 Map position (cm)

Lander & Botstein (1989) Assume a single QTL model. Interval mapping (IM) Each position in the genome, one at a time, is posited as the putative QTL. Let z = 1/0 if the (unobserved) QTL genotype is BB/AB. Assume y N(µ z,σ) Given genotypes at linked markers, y mixture of normal dist ns with mixing proportion Pr(z = 1 marker data): QTL genotype M 1 M 2 BB AB BB BB (1 r L )(1 r R )/(1 r) r L r R /(1 r) BB AB (1 r L )r R /r r L (1 r R )/r AB BB r L (1 r R )/r (1 r L )r R /r AB AB r L r R /(1 r) (1 r L )(1 r R )/(1 r) The normal mixtures 7 cm 13 cm M 1 Q M 2 AB/AB Two markers separated by 20 cm, with the QTL closer to the left marker. AB/BB µ 0 µ 1 The figure at right show the distributions of the phenotype conditional on the genotypes at the two markers. BB/AB µ 0 µ 1 The dashed curves correspond to the components of the mixtures. BB/BB µ 0 µ 1 20 40 µ 0 60 µ 1 80 100 Phenotype

Interval mapping (continued) Let p i = Pr(z i = 1 marker data) y i z i N(µ zi,σ 2 ) Pr(y i marker data,µ 0,µ 1,σ) = p i f(y i ;µ 1,σ) + (1 p i )f(y i ;µ 0,σ) where f(y;µ,σ) = exp[ (y µ) 2 /(2σ 2 )]/ 2πσ 2 Log likelihood: l(µ 0,µ 1,σ) = i log Pr(y i marker data,µ 0,µ 1,σ) Maximum likelihood estimates (MLEs) of µ 0, µ 1, σ: values for which l(µ 0,µ 1,σ) is maximized. Dempster et al. (1977) E step: Let M step: Let EM algorithm w (k+1) = Pr(z i = 1 y i, marker data, ˆµ (k) 0, ˆµ(k) 1, ˆσ(k) ) ˆµ (k+1) = p i f(y i ;ˆµ (k) 1,ˆσ(k) ) p i f(y i ;ˆµ (k) 1,ˆσ(k) )+(1 p i )f(y i ;ˆµ (k) 0,ˆσ(k) ) 1 = i y iw (k+1) / i w(k+1) ˆµ (k+1) The algorithm: i 0 = i y i(1 w (k+1) i )/ i (1 w(k+1) ˆσ (k+1) = [not worth writing down] Start with w (1) i i i ) = p i ; iterate the E & M steps until convergence.

Example Iteration ˆµ 0 ˆµ 1 ˆσ log likelihood 1 5.903 6.492 1.668 770.752 2 5.835 6.562 1.654 770.291 3 5.818 6.579 1.651 770.264 4 5.815 6.583 1.650 770.262..... 5.813 6.584 1.649 770.262 LOD scores The LOD score is a measure of the strength of evidence for the presence of a QTL at a particular location. LOD(z) = log 10 likelihood ratio comparing the hypothesis of a QTL at position z versus that of no QTL { } Pr(y QTL at z,ˆµ0z,ˆµ = log 1z,ˆσ z ) 10 Pr(y no QTL,ˆµ,ˆσ) ˆµ 0z, ˆµ 1z, ˆσ z are the MLEs, assuming a single QTL at position z. No QTL model: The phenotypes are independent and identically distributed (iid) N(µ,σ 2 ).

An example LOD curve 12 10 8 LOD 6 4 2 0 0 20 40 60 80 100 Chromosome position (cm) LOD curves 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 3.0 2.5 2.0 lod 1.5 1.0 0.5 0.0 0 500 1000 1500 2000 2500 Map position (cm)

Interval mapping Advantages Takes proper account of missing data. Allows examination of positions between markers. Gives improved estimates of QTL effects. Provides pretty graphs. Disadvantages Increased computation time. Requires specialized software. Difficult to generalize. Only considers one QTL at a time. Multiple QTL methods Why consider multiple QTLs at once? Reduce residual variation. Separate linked QTLs. Investigate interactions between QTLs (epistasis).

Epistasis in a backcross Additive QTLs Ave. phenotype 100 80 60 40 20 H A QTL 2 0 A QTL 1 H Interacting QTLs Ave. phenotype 100 80 60 40 20 H A QTL 2 0 A QTL 1 H Epistasis in an intercross Additive QTLs Ave. phenotype 100 80 60 40 20 B H A QTL 2 0 A H B QTL 1 Interacting QTLs Ave. phenotype 100 80 60 40 20 B H QTL 2 0 A H B QTL 1 A

LOD thresholds Large LOD scores indicate evidence for the presence of a QTL. Q: How large is large? We consider the distribution of the LOD score under the null hypothesis of no QTL. Key point: We must make some adjustment for our examination of multiple putative QTL locations. We seek the distribution of the maximum LOD score, genomewide. The 95th %ile of this distribution serves as a genome-wide LOD threshold. Estimating the threshold: simulations, analytical calculations, permutation (randomization) tests. Null distribution of the LOD score Null distribution derived by computer simulation of backcross with genome of typical size. Solid curve: distribution of LOD score at any one point. Dashed curve: distribution of maximum LOD score, genome-wide. 0 1 2 3 4 LOD score

Permutation tests markers phenotypes mice genotype data LOD(z) (a set of curves) M = max z LOD(z) Permute/shuffle the phenotypes; keep the genotype data intact. Calculate LOD (z) M = max z LOD (z) We wish to compare the observed M to the distribution of M. Pr(M M) is a genome-wide P-value. The 95th %ile of M is a genome-wide LOD threshold. We can t look at all n! possible permutations, but a random set of 1000 is feasible and provides reasonable estimates of P-values and thresholds. Value: conditions on observed phenotypes, marker density, and pattern of missing data; doesn t rely on normality assumptions or asymptotics. Permutation distribution 95th percentile 0 1 2 3 4 5 6 7 maximum LOD score

Confidence intervals for QTL location Confidence interval indicates plausible location of a QTL. Methods: LOD support intervals (I prefer 1.5-LOD intervals.) Bootstrap (Resample individuals with replacement) 1.5-LOD support interval 6 1.5 5 4 lod 3 2 1 0 1.5 LOD support interval 0 5 10 15 20 25 30 35 Map position (cm)

Bootstrap-based confidence intervals Consider a set of n individuals (e.g., n = 124) Sample, with replacement, n individuals Some individuals duplicated Some individuals missing Using the sampled individuals, re-calculate the LOD curve for the chromosome of interest Identify the location of the maximum LOD score Repeat many times (e.g., 250) Bootstrap CI = (2.5 %ile to 97.5 %ile) of locations of maximum LOD. Bootstrap confidence interval 1.5 LOD support interval Bootstrap CI 0 5 10 15 20 25 30 35

Selection bias The estimated effect of a QTL will vary somewhat from its true effect. Only when the estimated effect is large will the QTL be detected. Among those experiments in which the QTL is detected, the estimated QTL effect will be, on average, larger than its true effect. This is selection bias. Selection bias is largest in QTLs with small or moderate effects. The true effects of QTLs that we identify are likely smaller than was observed. QTL effect = 5 Bias = 79% 0 5 10 15 QTL effect = 8 Bias = 18% Estimated QTL effect 0 5 10 15 QTL effect = 11 Bias = 1% Estimated QTL effect 0 5 10 15 Estimated QTL effect Implications of selection bias Estimated % variance explained by identified QTLs Repeating an experiment Congenics Marker-assisted selection

The X chromosome In a backcross, the X chromosome may or may not be segregating. (A B) A Females: Males: X A B X A X A B Y A A (A B) Females: Males: X A X A X A Y B The X chromosome In an intercross, one must pay attention to the paternal grandmother s genotype. (A B) (A B) or (B A) (A B) Females: Males: X A B X A X A B Y B (A B) (B A) or (B A) (B A) Females: Males: X A B X B X A B Y A

Selective genotyping Save effort by only typing the most informative individuals (say, top & bottom 10%). 80 Useful in context of a single, inexpensive trait. Tricky to estimate the effects of QTLs: use IM with all phenotypes. Phenotype 70 60 50 40 Can t get at interactions. BB AB BB AB Likely better to also genotype some random portion of the rest of the individuals. All genotypes Top and botton 10% Covariates Examples: treatment, sex, litter, lab, age. 4 3 Intercross, split by sex 7 Control residual variation. Avoid confounding. lod 2 1 Look for QTL environ t interactions Adjust before interval mapping (IM) versus adjust within IM. 0 4 3 0 20 40 60 80 100 120 Map position (cm) Reverse intercross 7 lod 2 1 0 0 20 40 60 80 100 120 Map position (cm)

Non-normal traits Standard interval mapping assumes normally distributed residual variation. (Thus the phenotype distribution is a mixture of normals.) In reality: we see dichotomous traits, counts, skewed distributions, outliers, and all sorts of odd things. Interval mapping, with LOD thresholds derived from permutation tests, generally performs just fine anyway. Alternatives to consider: Nonparametric approaches (Kruglyak & Lander 1995) Transformations (e.g., log, square root) Specially-tailored models (e.g., a generalized linear model, the Cox proportional hazard model, and the model in Broman et al. 2000) Check data integrity The success of QTL mapping depends crucially on the integrity of the data. Segregation distortion Genetic maps / marker positions Genotyping errors (tight double crossovers) Phenotype distribution / outliers Residual analysis

Summary I ANOVA at marker loci (aka marker regression) is simple and easily extended to include covariates or accommodate complex models. Interval mapping improves on ANOVA by allowing inference of QTLs to positions between markers and taking proper account of missing genotype data. ANOVA and IM consider only single-qtl models. Multiple QTL methods allow the better separation of linked QTLs and are necessary for the investigation of epistasis. Statistical significance of LOD peaks requires consideration of the maximum LOD score, genome-wide, under the null hypothesis of no QTLs. Permutation tests are extremely useful for this. 1.5-LOD support intervals indicate the plausible location of a QTL. Alternatively, perform a bootstrap. Estimates of QTL effects are subject to selection bias. Such estimated effects are often too large. Summary II The X chromosome must be dealt with specially, and can be tricky. Study your data. Look for errors in the genetic map, genotyping errors and phenotype outliers. But don t worry about them too much. Selective genotyping can save you time and money, but proceed with caution. Study your data. The consideration of covariates may reveal extremely interesting phenomena. Interval mapping works reasonably well even with non-normal traits. But consider transformations or specially-tailored models. If interval mapping software is not available for your preferred model, start with some version of ANOVA.

References Broman KW (2001) Review of statistical methods for QTL mapping in experimental crosses. Lab Animal 30:44 52 A review for non-statisticians. Doerge RW (2002) Mapping and analysis of quantitative trait loci in experimental populations. Nat Rev Genet 3:43 52 A very recent review. Doerge RW, Zeng Z-B, Weir BS (1997) Statistical issues in the search for genes affecting quantitative traits in experimental populations. Statistical Science 12:195 219 Review paper. Jansen RC (2001) Quantitative trait loci in inbred lines. In Balding DJ et al., Handbook of statistical genetics, John Wiley & Sons, New York, chapter 21 Review in an expensive but rather comprehensive and likely useful book. Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer Associates, Sunderland, MA, chapter 15 Chapter on QTL mapping. Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121:185 199 The seminal paper. Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait mapping. Genetics 138:963 971 LOD thresholds by permutation tests. Kruglyak L, Lander ES (1995) A nonparametric approach for mapping quantitative trait loci. Genetics 139:1421 1428 Non-parameteric interval mapping. Broman KW (2003) Mapping quantitative trait loci in the case of a spike in the phenotype distribution. Genetics 163:1169 1175 QTL mapping with a special model for a non-normal phenotype. Visscher PM, Thompson R, Haley CS (1996) Confidence intervals in QTL mapping by bootstrapping. Genetics 143:1013 1020 Bootstrap confidence intervals. Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter 11. An old but excellent general genetics textbook with a very interesting discussion of epistasis.