Tutorial Session 2. MCMC for the analysis of genetic data on pedigrees:

Similar documents
MCMC IN THE ANALYSIS OF GENETIC DATA ON PEDIGREES

Calculation of IBD probabilities

Calculation of IBD probabilities

1 Springer. Nan M. Laird Christoph Lange. The Fundamentals of Modern Statistical Genetics

Statistical issues in QTL mapping in mice

QTL model selection: key players

Lecture 9. QTL Mapping 2: Outbred Populations

Gene mapping in model organisms

Expression QTLs and Mapping of Complex Trait Loci. Paul Schliekelman Statistics Department University of Georgia

The Lander-Green Algorithm. Biostatistics 666 Lecture 22

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Affected Sibling Pairs. Biostatistics 666

On Computation of P-values in Parametric Linkage Analysis

The genomes of recombinant inbred lines

Use of hidden Markov models for QTL mapping

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Multiple QTL mapping

The universal validity of the possible triangle constraint for Affected-Sib-Pairs

Introduc)on to Gene)cs How to Analyze Your Own Genome Fall 2013

The Admixture Model in Linkage Analysis

QTL Mapping I: Overview and using Inbred Lines

Introduction to QTL mapping in model organisms

Introduction to QTL mapping in model organisms

Statistics 246 Spring 2006

Introduction to QTL mapping in model organisms

Gene mapping, linkage analysis and computational challenges. Konstantin Strauch

Variance Component Models for Quantitative Traits. Biostatistics 666

Genotype Imputation. Biostatistics 666

Mapping multiple QTL in experimental crosses

QTL model selection: key players

Solutions to Problem Set 4

Normal approximation to Binomial

Linkage and Linkage Disequilibrium

Computation of Multilocus Prior Probability of Autozygosity for Complex Inbred Pedigrees

BAYESIAN MAPPING OF MULTIPLE QUANTITATIVE TRAIT LOCI

Genetic Association Studies in the Presence of Population Structure and Admixture

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

UNIT 8 BIOLOGY: Meiosis and Heredity Page 148

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

1. Understand the methods for analyzing population structure in genomes

Robert Collins CSE586, PSU Intro to Sampling Methods

Evaluating the Performance of a Block Updating McMC Sampler in a Simple Genetic Application

Optimal Allele-Sharing Statistics for Genetic Mapping Using Affected Relatives

Linkage Mapping. Reading: Mather K (1951) The measurement of linkage in heredity. 2nd Ed. John Wiley and Sons, New York. Chapters 5 and 6.

The problem Lineage model Examples. The lineage model

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Friday Harbor From Genetics to GWAS (Genome-wide Association Study) Sept David Fardo

MODEL-FREE LINKAGE AND ASSOCIATION MAPPING OF COMPLEX TRAITS USING QUANTITATIVE ENDOPHENOTYPES

Bayesian Inference of Interactions and Associations

theta H H H H H H H H H H H K K K K K K K K K K centimorgans

Natural Selection. Population Dynamics. The Origins of Genetic Variation. The Origins of Genetic Variation. Intergenerational Mutation Rate

p(d g A,g B )p(g B ), g B

Introduction to QTL mapping in model organisms

Mapping multiple QTL in experimental crosses

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

Introduction to QTL mapping in model organisms

SNP Association Studies with Case-Parent Trios

Objectives. Announcements. Comparison of mitosis and meiosis

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

Asymptotic properties of the likelihood ratio test statistics with the possible triangle constraint in Affected-Sib-Pair analysis

Runaway. demogenetic model for sexual selection. Louise Chevalier. Jacques Labonne

Lecture WS Evolutionary Genetics Part I 1

Principles of Genetics

R/qtl workshop. (part 2) Karl Broman. Biostatistics and Medical Informatics University of Wisconsin Madison. kbroman.org

Ch 11.4, 11.5, and 14.1 Review. Game

Models for Meiosis. Chapter The meiosis process

Lecture 7 (FW) February 11, 2009 Phenotype and Genotype Reading: pp

(Genome-wide) association analysis

Proportional Variance Explained by QLT and Statistical Power. Proportional Variance Explained by QTL and Statistical Power

Lesson 4: Understanding Genetics

2. Map genetic distance between markers

Genotype Imputation. Class Discussion for January 19, 2016

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Lecture 11: Multiple trait models for QTL analysis

Fast Bayesian Methods for Genetic Mapping Applicable for High-Throughput Datasets

Prediction of the Confidence Interval of Quantitative Trait Loci Location

SNP-SNP Interactions in Case-Parent Trios

Association Testing with Quantitative Traits: Common and Rare Variants. Summer Institute in Statistical Genetics 2014 Module 10 Lecture 5

Binomial Mixture Model-based Association Tests under Genetic Heterogeneity

Causal Graphical Models in Systems Genetics

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Mapping QTL to a phylogenetic tree

Linear Regression (1/1/17)

Lecture 6. QTL Mapping

Estimation of Parameters in Random. Effect Models with Incidence Matrix. Uncertainty

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data

Markov Chain Monte Carlo

Markov Chain Monte Carlo, Numerical Integration

Outline. P o purple % x white & white % x purple& F 1 all purple all purple. F purple, 224 white 781 purple, 263 white

Lecture 8. QTL Mapping 1: Overview and Using Inbred Lines

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

56th Annual EAAP Meeting Uppsala, 2005

Detection of multiple QTL with epistatic effects under a mixed inheritance model in an outbred population

Estimating Evolutionary Trees. Phylogenetic Methods

Markov Chain. Edited by. Andrew Gelman. Xiao-Li Meng. CRC Press. Taylor & Francis Croup. Boca Raton London New York. an informa business

Expectations, Markov chains, and the Metropolis algorithm

Missing Data. On Missing Data and Interactions in SNP Association Studies. Missing Data - Approaches. Missing Data - Approaches. Multiple Imputation

MCMC: Markov Chain Monte Carlo

LECTURE # How does one test whether a population is in the HW equilibrium? (i) try the following example: Genotype Observed AA 50 Aa 0 aa 50

Transcription:

MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation Monte Carlo estimation of linkage lod scores 1

GENETIC MARKERS Human genome: 3 10 9 bp of DNA. DNA variants that can be typed in individuals. Allele type of the DNA at position on chromosome Have been mapped: known locations on the genome. Locus position on a chromosome, or DNA at that position Idea: map genes for traits relative to these markers. Microsatellites; lots of alleles; 350 in a genome scan One every 10 7 bp SNPs: typically only two alleles; lots more exist; 1 per 1000 bp 2

THE STRUCTURE OF A GENETIC MODEL Population model: parameters q, provide probabilities for latent A allelic types of FGL at each j Inheritance model: parameters ρ, provide probabilities for latent S inheritance of FGL at j, jointly over j. Individual genotypes G is deterministic function of (S, A) penetrance model, parameters β relates G (and perhaps observable covariates) to observable data Y ξ = (q, ρ, β). 3

FROM RECOMBINATION TO LOCATION Recall model for Si, = (Si,1,..., S i,l ): Pr(Si,j S i,j+1 ) = ρj: assumed same i (convenience). Si,j assumed Markov in j: no genetic interference. Genetic distance d is expected number of crossover events on underlying chromosome: an additive measure. Crossovers arise as a Poisson process rate 1 (per Morgan). There is a recombination between two loci if there is an odd number W of crossovers between them: W (d) P(d). Hence the Haldane map function: ρ(d) = (1/2)(1 exp( 2d)). The key thing is the model: the map function just puts loci onto a linear location map. (See later: MCMC under interference.) 4

WHAT AND WHY THE LOCATION LOD SCORE γ Λ M = ((q i, λ i ); i = 1,..., L) M1 M2 M3 M4 M5 β YT Parameter ξ = (β, γ, Λ M ). Data Y = (Y M, Y T ) lod(γ) = log10 ( Pr(Y; Λ M, β, γ) Pr(Y; Λ M, β, γ = ) Trait locus location γ is parameter of interest: γ = is no linkage. Exact computation is infeasible ) 5

AN EXAMPLE PEDIGREE: APPROXIMATED SIMPED: disease status and marker availability Marker data are SIMULATED at 10 linked markers on Chr 1. Trait is close to M6 6

7 220 230 240 250 260 270 Chromosome Position (cm) lod score -6-5 -4-3 -2 AN EXAMPLE MULTIPOINT LOD SCORE

MONTE CARLO LIKELIHOODS ON PEDIGREES Monte Carlo estimates expectations. L(ξ) = Pξ (Y) = P ξ (S, Y) = P ξ (Y S) P ξ (S) S S for parameters ξ and latent variables S. Simple (but not useful) example: L(ξ) = E ξ (P ξ (Y S)) More generally L(ξ) = ( ) ( Pξ (S, Y) P S P Pξ (S, Y) (S) = E P (S) P (S) provided P (S) > 0 if P ξ (S, Y) > 0. ) 8

SEQUENTIAL IMPUTATION OVER LOCI Choose the sampling distribution: loci Now: P (S,j ) = P ξ (S 0,j S (j 1), Y (j) ) = P ξ (S 0,j S,1,... S,j 1, Y,1,..., Y,j 1, Y,j ) = P ξ (S 0,j S,j 1, Y,j) data j i Y,j meioses P ξ 0 (S,j S (j 1), Y (j) ) = P ξ0 (S,j, Y,j S (j 1), Y (j 1) ) P ξ 0 (Y,j S (j 1), Y (j 1) ) = P ξ0 (S,j, Y,j S (j 1), Y (j 1) ) wj. where, by pedigree-peeling, we can compute wj = P ξ 0 (Y,j Y (j 1), S (j 1) ) = P ξ 0 (Y,j S,j 1 ). 9

MONTE CARLO LIKELIHOOD ESTIMATE Thus sequential imputation distribution is P (S ) = L j=1 P ξ 0 (S,j S (j 1), Y (j) ) = P ξ0 (S, Y) W L (S ) where W L (S ) = L j=1 wj. Now ( ) Pξ L(ξ0) = P ξ (Y) = E (S, Y) 0 P P (S) = E P (W L (S )) Given N realizations S (τ) the estimate of L(ξ0) is N 1 τ W L (S (τ) ). 10

THE IDEAL SAMPLING DISTRIBUTION We want P (S) close to proportional to P ξ (Y, S) 0 that is P (S) P ξ (S Y). 0 Of course we cannot achieve this, else Monte Carlo would be unnecessary. Suppose we use MCMC to sample S from P ξ 0 (S Y). P ξ (Y) = S = E ξ 0 P ξ (Y, S) = S ( P ξ (Y, S) = P ξ 0 (Y) E ξ0 P ξ 0 (S Y) Y ( Pξ (Y, S) P ξ (Y, S) P ξ (S Y)P (S Y) ξ0 0 ) P ξ 0 (Y, S) Y ) 11

LIKELIHOOD RATIO ESTIMATION Thus we have L(ξ) L(ξ0) = P ξ(y) P ξ (Y) = E ξ0 0 ( Pξ (Y, S) P ξ 0 (Y, S) Y S is the random variable, Y is fixed. S P ξ ( Y). 0 If S (τ), τ = 1,..., N, are realized from P ξ ( Y) then the likelihood 0 ratio can be estimated by 1 N N τ=1 P ξ(y, S (τ) ) P ξ 0 (Y, S(τ) ) ) 12

LINKAGE LOCATION LIKELIHOOD RATIO The form for linkage lod that follows directly from this is ( L(β, γ1, Λ M ) Pξ L(β, γ0, Λ M ) = E (Y 1 T, Y M, S T, S M ) ) ξ0 P ξ (Y 0 T, Y M, S T, S M ) Y T, Y M for two hypothesized trait locus positions γ1 and γ0. Now P ξ (Y, S) = P β (Y T S T )P ΛM (Y M, S M )Pγ(S T S M ) so ratio reduces to ( L(β, γ1, Λ M ) P L(β, γ0, Λ M ) = E γ1 (S ) T S M ) ξ0 Pγ0 (S T S M ) Y T, Y M 13

LOCAL ESTIMATE IS VERY SIMPLE: GLOBAL IS HARD i... l T r... Pγ1 (S T S M ) Pγ0 (S T S M ) = i ρ1l ( ρ 0l ( ρ 1r ρ0r ) Si,T S i,l ( 1 ρ 1l 1 ρ 0l ) Si,T Si,r ( 1 ρ1r 1 ρ0r ) 1 Si,T S i,l ) 1 Si,T Si,r The above works well only for γ1 γ0, and for γ0, γ1 with same l and r. When likelihoods are not smooth, combining LR estimates does not work well especially across markers. 14

AN MCMC IMPORTANCE SAMPLING ESTIMATE Lange and Sobel (1996) write the likelihood in the form L(β, γ, Λ M ) = P β,γ,λm (Y M, Y T ) P β,γ,λm (Y T Y M ) = S M P β,γ (Y T S M )P ΛM (S M Y M ) = E ΛM (P β,γ (Y T S M ) Y M ). Sample S M given Y M : compute P (Y T S M ) β, γ a form of Rao-Blackwellization integrate over S T. Also importance sampling: maybe P (S M Y M ) P (S M Y M, Y T ) For fuzzy traits it works quite well. 15

METROPOLIS HASTINGS FOR INTERFERENCE Suppose we have interference model P (I) (S) in place of Haldane model P (H) (S) we have used so far. Use block-gibbs update of meiosis i (Si, ) to propose S. Hastings ratio is for current S and proposed S is h(s ; S) = P (I) (S, Y) P (I) (S, Y) P (H) (Si, S k,, k i, Y) P (H) (S i, S k,, k i, Y) = P (I) (S, Y)P (H) (S, Y) P (I) (S, Y)P (H) (S, Y) = P (Y S )P (I) (S )P (Y S)P (H) (S) P (Y S)P (I) (S)P (Y S )P (H) (S ) 16

INTERFERENCE ctd. h(s ; S) = m k=1 P (I) (S k, ) P (H) (S k, ) P (I) (S k, ) P (H) (S k, ) = P (I) (S i, ) P (I) (Si, ) P (H) (Si, ) P (H) (S i, ). Pr(S = S ) = a = min(1, h). Pr(S = S) = 1 a. Question: better to sample under H and reweight, or use M-H to sample under model I? 17