QTL Mapping I: Overview and using Inbred Lines

Similar documents
Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

Lecture 6. QTL Mapping

Multiple QTL mapping

Mapping multiple QTL in experimental crosses

Mapping multiple QTL in experimental crosses

QTL Model Search. Brian S. Yandell, UW-Madison January 2017

Overview. Background

Statistical issues in QTL mapping in mice

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Gene mapping in model organisms

QTL model selection: key players

Model Selection for Multiple QTL

Use of hidden Markov models for QTL mapping

Methods for QTL analysis

Introduction to QTL mapping in model organisms

Lecture 9. QTL Mapping 2: Outbred Populations

Introduction to QTL mapping in model organisms

Major Genes, Polygenes, and

QTL model selection: key players

Mapping QTL to a phylogenetic tree

I Have the Power in QTL linkage: single and multilocus analysis

Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Introduction to QTL mapping in model organisms

Prediction of the Confidence Interval of Quantitative Trait Loci Location

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI

Lecture 28: BLUP and Genomic Selection. Bruce Walsh lecture notes Synbreed course version 11 July 2013

MULTIPLE-TRAIT MULTIPLE-INTERVAL MAPPING OF QUANTITATIVE-TRAIT LOCI ROBY JOEHANES

Lecture 11: Multiple trait models for QTL analysis

Single Marker Analysis and Interval Mapping

One-week Course on Genetic Analysis and Plant Breeding January 2013, CIMMYT, Mexico LOD Threshold and QTL Detection Power Simulation

Model Selection. Frank Wood. December 10, 2009

Selecting explanatory variables with the modified version of Bayesian Information Criterion

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Linkage Mapping. Reading: Mather K (1951) The measurement of linkage in heredity. 2nd Ed. John Wiley and Sons, New York. Chapters 5 and 6.

Causal Model Selection Hypothesis Tests in Systems Genetics

Inferring Genetic Architecture of Complex Biological Processes

Regression, Ridge Regression, Lasso

THE data in the QTL mapping study are usually composed. A New Simple Method for Improving QTL Mapping Under Selective Genotyping INVESTIGATION

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

A Statistical Framework for Expression Trait Loci (ETL) Mapping. Meng Chen

Variance Component Models for Quantitative Traits. Biostatistics 666

A new simple method for improving QTL mapping under selective genotyping

Lecture 8 Genomic Selection

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Quantitative genetics

Lecture 5: BLUP (Best Linear Unbiased Predictors) of genetic values. Bruce Walsh lecture notes Tucson Winter Institute 9-11 Jan 2013

STAT 100C: Linear models

(Genome-wide) association analysis

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Users Manual of QTL IciMapping v3.1

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Affected Sibling Pairs. Biostatistics 666

Lecture 2. Basic Population and Quantitative Genetics

Sparse Linear Models (10/7/13)

Hierarchical Generalized Linear Models for Multiple QTL Mapping

Multiple interval mapping for ordinal traits

Calculation of IBD probabilities

Supplementary Materials for Molecular QTL Discovery Incorporating Genomic Annotations using Bayesian False Discovery Rate Control

A Frequentist Assessment of Bayesian Inclusion Probabilities

what is Bayes theorem? posterior = likelihood * prior / C

Linear model selection and regularization

Calculation of IBD probabilities

Tuning Parameter Selection in L1 Regularized Logistic Regression

Mixture Models. Pr(i) p i (z) i=1. It is usually assumed that the underlying distributions are normals, so this becomes. 2 i

Module 4: Bayesian Methods Lecture 9 A: Default prior selection. Outline

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

BAYESIAN STATISTICAL METHODS IN GENE-ENVIRONMENT AND GENE-GENE INTERACTION STUDIES

Locating multiple interacting quantitative trait. loci using rank-based model selection

Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Feature selection with high-dimensional data: criteria and Proc. Procedures

Anumber of statistical methods are available for map- 1995) in some standard designs. For backcross populations,

MS-C1620 Statistical inference

Eiji Yamamoto 1,2, Hiroyoshi Iwata 3, Takanari Tanabata 4, Ritsuko Mizobuchi 1, Jun-ichi Yonemaru 1,ToshioYamamoto 1* and Masahiro Yano 5,6

How the mean changes depends on the other variable. Plots can show what s happening...

2. Map genetic distance between markers

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

The Admixture Model in Linkage Analysis

Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping

Lecture 9 GxE Mixed Models. Lucia Gutierrez Tucson Winter Institute

Causal Graphical Models in Systems Genetics

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Machine Learning for OR & FE

Chapter 4: Factor Analysis

Generalized Linear Models

Bayesian reversible-jump for epistasis analysis in genomic studies

Genome-wide Multiple Loci Mapping in Experimental Crosses by the Iterative Adaptive Penalized Regression

Multiple-Interval Mapping for Quantitative Trait Loci Controlling Endosperm Traits. Chen-Hung Kao 1

Variable Selection in Predictive Regressions

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Statistics 262: Intermediate Biostatistics Model selection

MOST complex traits are influenced by interacting

Linear Regression (1/1/17)

Consistent high-dimensional Bayesian variable selection via penalized credible regions

High-dimensional regression modeling

MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

QUANTITATIVE traits are characterized by continuous

1.5 Testing and Model Selection

Transcription:

QTL Mapping I: Overview and using Inbred Lines

Key idea: Looking for marker-trait associations in collections of relatives If (say) the mean trait value for marker genotype MM is statisically different from that for genotype mm, then the M/m marker is linked to a QTL One can use a random collection of such markers spanning a genome (a genomic scan) to search for QTLs

Experimental Design: Crosses P 1 x P 2 B 1 Backcross design F 1 B 2 F 1 x F 1 Backcross F 2 design design RILs = Recombinant inbred lines (selfed F 1 s) F 2 F k F 2 Advanced intercross Design (AIC, AIC k )

Experimental Designs: Marker Analysis Single marker analysis Flanking marker analysis (interval mapping) Composite interval mapping Interval mapping plus additional markers Multipoint mapping Uses all markers on a chromosome simultaneously

Conditional Probabilities of QTL Genotypes The basic building block for all QTL methods is Pr(Q k M j ) --- the probability of QTL genotype Q k given the marker genotype is M j. Pr(Q k M j ) = Pr(Q km j ) Pr(M j ) Consider a QTL linked to a marker (recombination Fraction = c). Cross MMQQ x mmqq. In the F1, all gametes are MQ and mq In the F2, freq(mq) = freq(mq) = (1-c)/2, freq(mq) = freq(mq) = c/2

Hence, Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c) 2 /4 Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c) /4 Pr(MMqq) = Pr(Mq)Pr(Mq) = c 2 /4 Why the 2? MQ from father, Mq from mother, OR MQ from mother, Mq from father Since Pr(MM) = 1/4, the conditional probabilities become Pr(QQ MM) = Pr(MMQQ)/Pr(MM) = (1-c) 2 Pr(Qq MM) = Pr(MMQq)/Pr(MM) = 2c(1-c) Pr(qq MM) = Pr(MMqq)/Pr(MM) = c 2 How do we use these?

Expected Marker Means The expected trait mean for marker genotype M j is just N µ Mj = µ Qk Pr( Q k M j ) k=1 For example, if QQ = 2a, Qq = a(1+k), qq = 0, then in the F2 of an MMQQ/mmqq cross, (µ M M µ mm )/2 = a(1 2c) If the trait mean is significantly different for the genotypes at a marker locus, it is linked to a QTL A small MM-mm difference could be (i) a tightly-linked QTL of small effect or (ii) loose linkage to a large QTL

2 Marker loci Suppose the cross is M 1 M 1 QQM 2 M 2 x m 1 m 1 qqm 2 m 2 Genetic map M 1 Q M 2 c 1 c 2 c 12 No interference: c 12 = c 1 + c 2-2c 1 c 2 Complete interference: c 12 = c 1 + c 2 In F2, Pr(M 1 QM 2 ) = (1-c 1 )(1-c 2 ) Pr(M 1 Qm 2 ) = (1-c 1 ) c 2 Pr(m 1 QM 2 ) = (1-c 1 ) c 2 Likewise, Pr(M 1 M 2 ) = 1-c 12 = 1- c 1 + c 2

A little bookkeeping gives Pr(QQ M 1 M 1 M 2 M 2 ) = (1 c 1 ) 2 (1 c 2 ) 2 (1 c 12 ) 2 Pr(Qq M 1 M 1 M 2 M 2 ) = 2c 1c 2 (1 c 1 )(1 c 2 ) (1 c 12 ) 2 Pr(qq M 1 M 1 M 2 M 2 ) = c2 1 c2 2 (1 c 12 ) 2 µ M1M1M2M2 µ m1m1m2m2 2 ( 1 c = a 1 c - 2 1 c 1 c 2 + 2c 1 c 2 a (1 2c 1 c 2 ) ) This is essentially a for even modest linkage

c 1 = 1 2 1 2 ( ) 1 µ M 1 M 1 µ m1 m 1 2a ( µ 1 M1M1 µ m1m1 µ M1 M 1 M 2 M 2 µ m1 m 1 m 2 m 2 ) Hence, a and c can be estimated from the mean values of flanking marker genotypes

Linear Models for QTL Detection The use of differences in the mean trait value for different marker genotypes to detect a QTL and estimate its effects is a use of linear models. One-way ANOVA. Value of trait in kth individual of marker genotype type i z ik = µ + b i + e ik Effect of marker genotype i on trait value

z ik = µ + b i + e ik Detection: a QTL is linked to the marker if at least one of the b i is significantly different from zero Estimation: (QTL effect and position): This requires relating the b i to the QTL effects and map position

Detecting epistasis One major advantage of linear models is their flexibility. To test for epistasis between two QTLs, used an ANOVA with an interaction term z = µ + ai + bk + dik + e Effect from marker genotype at first marker set (can be > 1 loci) Effect from marker genotype at second marker set Interaction between marker genotypes i in 1st marker set and k in 2nd marker set

Detecting epistasis z = µ + a i + b k + d ik + e At least one of the a i significantly different from 0 ---- QTL linked to first marker set At least one of the b k significantly different from 0 ---- QTL linked to second marker set At least one of the d ik significantly different from 0 ---- interactions between QTL in sets 1 and two Problem: Huge number of potential interaction terms (order m 2, where m = number of markers)

Maximum Likelihood Methods ML methods use the entire distribution of the data, not just the marker genotype means. More powerful that linear models, but not as flexible in extending solutions (new analysis required for each model) Basic likelihood function: Trait value given marker genotype is type j l(z Mj ) = N k=1 ϕ(z, µq k,σ 2 ) Pr( Qk Mj ) This is a mixture model

Maximum Likelihood Methods Sum over the N possible linked QTL genotypes Probability of QTL genotype k given marker genotype j --- genetic map and linkage phase entire here l(z Mj ) = N k=1 ϕ(z, µq k,σ 2 ) Pr( Qk Mj ) Distribution of trait value given QTL genotype is k is normal with mean µ Qk. (QTL effects enter here)

ML methods combine both detection and estimation of QTL effects/position. Test for a linked QTL given from the LR test LR = 2 ln The LR score is often plotted by trying different locations for the QTL (i.e., values of c) and computing a LOD score for each max l r(z) max l(z) LOD(c) = log 10 [ max lr (z) max l(z, c ) Maximum of the likelihood under a no-linked QTL model Maximum of the full likelihood ] = LR(c) 2 ln 10 LR(c) 4.61

A typical QTL map from a likelihood analysis Estimated QTL location Support interval Significance Threshold

Interval Mapping with Marker Cofactors Consider interval mapping using the markers i and i+1. QTLs linked to these markers, but outside this interval, can contribute (falsely) to estimation of QTL position and effect i-1 i i+1 i+2 Interval being mapped Now suppose we also add the two markers flanking the interval (i-1 and i+2)

i-1 i i+1 i+2 Inclusion of markers i-1 and i+2 fully account for any linked QTLs to the left of i-1 and the right of i+2 Interval mapping + marker cofactors is called Composite Interval Mapping (CIM) CIM works by adding an additional term to the linear model, k i,i+1 b k x kj CIM also (potentially) includes unlinked markers to account for QTL on other chromosomes.

Power and Precision While modest sample sizes are sufficient to detect a QTL of modest effect (power), large sample sizes are required to map it with some precision With 200-300 F 2, a QTL accounting for 5% of total variation can be mapped to a 40cM interval Over 10,000 F 2 individuals are required to map this QTL to a 1cM interval

Power and Repeatability: The Beavis Effect QTLs with low power of detection tend to have their effects overestimated, often very dramatically As power of detection increases, the overestimation of detected QTLs becomes far less serious This is often called the Beavis Effect, after Bill Beavis who first noticed this in simulation studies

Model selection With (say) 300 markers, we have (potentially) 300 single-marker terms and 300*299/2 = 44,850 epistatic terms Hence, a model with up to p= 45,150 possible parameters 2 p possible submodels = 10 13,600 ouch! The issue of Model selection becomes very important. How do we find the best model? Stepwise regression approaches Forward selection (add terms one at a time) Backwards selection (delete terms one at a time) Try all models, assess best fit Mixed-model approaches (SSVS)

Model Selection Model Selection: Use some criteria to chose among a number of candidate models. Weight goodness-of-fit (L, value of the likelihood at the MLEs) vs. number of estimated parameters (k) AIC = Akaike s information criterion AIC = 2k - 2 Ln(L) BIC = Bayesian information criterion (Schwarz criterion) BIC = k*ln(n)/n - 2 Ln(L)/n BIC penalizes free parameters more strongly than AIC Other measures. Smaller is better

Model averaging Model averaging: Generate a composite model by weighting (averaging) the various models, using AIC, BIC, or other Idea: Perhaps no best model, but several models all extremely close. Better to report this distribution rather than the best one One approach is to average the coefficients on the best-fitting models using some scheme to return a composite model

Stochastic search variable selection (SSVS) A Bayesian approach approach to search through a space of possible models Fit the model y i = µ + x i T! + e, including ALL possible covariates of the full model, X = (x i T.. x n T ) Idea: Assume model parameters fall into two classes: those with values very near zero and those with larger values We use the latent (unobservable) variable " i, which determines into which class the covariate falls! i ~ (1- " i )N(0,# i2 ) + " i N(0, c i2 # i2 ) # i 2 small, hence values near zero c i2 # i 2 large, hence values can deviate substantially from zero Posterior probability of (" i ) is the probability that the parameter is included in the final model

! i ~ (1- " i )N(0,# i2 ) + " i N(0, c i2 # i2 ) While # and c can vary over covariates (model variables), typically they are assigned the same values over all i (e.g., # i 2 = 0.001, c i2 # i 2 = 10) Let " be the vector of indicator random variables, with a value of one if in the model, zero otherwise Given a current " vector, the conditional prior of the! values is MVN with mean zero and covariance matrix D " RD " Where R is the prior correlation matrix, either taken as R = I or R = (X T X) -1 And D is a diagonal matrix whose ith element is 1 if " i = 0 and c i if " i = 1.

The idea is that with a current estimate of µ, $ e2, and " in hand, we can update!, which is drawn from a MVN distribution Mean = (X T X + $ e2 (D " RD " ) -1 ) -1 X T (y- µ I) Variance = (X T X + $ e2 (D " RD " ) -1 ) -1 With an updated! vector in hand, we update the " vector Usually, the prior on the " i is independent of i and taken to be p i = 1/2

Supersaturated Models A problem with many QTL approaches is that there are far more parameters (p) to estimate than there are independent samples (n). Case in point: epistasis Such supersaturated models arise commonly in Genomics. How do we deal with them? Again, an approach like SSVS, where all parameters are included, but some are shrunk back (regressed) towards zero by assigning them a very small posterior variance

Shrinkage estimators Shrinkage estimates: Rather than adding interaction terms one at a time, a shrinkage method starts with all interactions included, and then shrinks most back to zero. Under a Bayesian analysis, any effect is random. One can assume the effect for (say) interaction ij is drawn from a normal with mean zero and variance $ 2 ij Further, the interaction-specific variances are themselves random variables drawn from a hyperparameter distribution, such as an inverse chi-square. One then estimates the hyperparameters and uses these to predict the variances, with effects with small variances shrinking back to zero, and effects with large variances remaining in the model.

Key ideas in QTL mapping Look for marker-trait associations Many difference crossing designs (F2, BC, RIL) Many difference methods of analysis: Linear models, MLE, Bayesian Most studies UNDERPOWERED, esp. for fine mapping

What is a QTL A detected QTL in a mapping experiment is a region of a chromosome detected by linkage. Usually large (typically 10-40 cm) When further examined, most large QTLs turn out to be a linked collection of locations with increasingly smaller effects The more one localizes, the more subregions that are found, and the smaller their effect