MCMC for the analysis of genetic data on pedigrees: Tutorial Session 2 Elizabeth Thompson University of Washington Genetic mapping and linkage lod scores Monte Carlo likelihood and likelihood ratio estimation Monte Carlo estimation of linkage lod scores 1
GENETIC MARKERS Human genome: 3 10 9 bp of DNA. DNA variants that can be typed in individuals. Allele type of the DNA at position on chromosome Have been mapped: known locations on the genome. Locus position on a chromosome, or DNA at that position Idea: map genes for traits relative to these markers. Microsatellites; lots of alleles; 350 in a genome scan One every 10 7 bp SNPs: typically only two alleles; lots more exist; 1 per 1000 bp 2
THE STRUCTURE OF A GENETIC MODEL Population model: parameters q, provide probabilities for latent A allelic types of FGL at each j Inheritance model: parameters ρ, provide probabilities for latent S inheritance of FGL at j, jointly over j. Individual genotypes G is deterministic function of (S, A) penetrance model, parameters β relates G (and perhaps observable covariates) to observable data Y ξ = (q, ρ, β). 3
FROM RECOMBINATION TO LOCATION Recall model for Si, = (Si,1,..., S i,l ): Pr(Si,j S i,j+1 ) = ρj: assumed same i (convenience). Si,j assumed Markov in j: no genetic interference. Genetic distance d is expected number of crossover events on underlying chromosome: an additive measure. Crossovers arise as a Poisson process rate 1 (per Morgan). There is a recombination between two loci if there is an odd number W of crossovers between them: W (d) P(d). Hence the Haldane map function: ρ(d) = (1/2)(1 exp( 2d)). The key thing is the model: the map function just puts loci onto a linear location map. (See later: MCMC under interference.) 4
WHAT AND WHY THE LOCATION LOD SCORE γ Λ M = ((q i, λ i ); i = 1,..., L) M1 M2 M3 M4 M5 β YT Parameter ξ = (β, γ, Λ M ). Data Y = (Y M, Y T ) lod(γ) = log10 ( Pr(Y; Λ M, β, γ) Pr(Y; Λ M, β, γ = ) Trait locus location γ is parameter of interest: γ = is no linkage. Exact computation is infeasible ) 5
AN EXAMPLE PEDIGREE: APPROXIMATED SIMPED: disease status and marker availability Marker data are SIMULATED at 10 linked markers on Chr 1. Trait is close to M6 6
7 220 230 240 250 260 270 Chromosome Position (cm) lod score -6-5 -4-3 -2 AN EXAMPLE MULTIPOINT LOD SCORE
MONTE CARLO LIKELIHOODS ON PEDIGREES Monte Carlo estimates expectations. L(ξ) = Pξ (Y) = P ξ (S, Y) = P ξ (Y S) P ξ (S) S S for parameters ξ and latent variables S. Simple (but not useful) example: L(ξ) = E ξ (P ξ (Y S)) More generally L(ξ) = ( ) ( Pξ (S, Y) P S P Pξ (S, Y) (S) = E P (S) P (S) provided P (S) > 0 if P ξ (S, Y) > 0. ) 8
SEQUENTIAL IMPUTATION OVER LOCI Choose the sampling distribution: loci Now: P (S,j ) = P ξ (S 0,j S (j 1), Y (j) ) = P ξ (S 0,j S,1,... S,j 1, Y,1,..., Y,j 1, Y,j ) = P ξ (S 0,j S,j 1, Y,j) data j i Y,j meioses P ξ 0 (S,j S (j 1), Y (j) ) = P ξ0 (S,j, Y,j S (j 1), Y (j 1) ) P ξ 0 (Y,j S (j 1), Y (j 1) ) = P ξ0 (S,j, Y,j S (j 1), Y (j 1) ) wj. where, by pedigree-peeling, we can compute wj = P ξ 0 (Y,j Y (j 1), S (j 1) ) = P ξ 0 (Y,j S,j 1 ). 9
MONTE CARLO LIKELIHOOD ESTIMATE Thus sequential imputation distribution is P (S ) = L j=1 P ξ 0 (S,j S (j 1), Y (j) ) = P ξ0 (S, Y) W L (S ) where W L (S ) = L j=1 wj. Now ( ) Pξ L(ξ0) = P ξ (Y) = E (S, Y) 0 P P (S) = E P (W L (S )) Given N realizations S (τ) the estimate of L(ξ0) is N 1 τ W L (S (τ) ). 10
THE IDEAL SAMPLING DISTRIBUTION We want P (S) close to proportional to P ξ (Y, S) 0 that is P (S) P ξ (S Y). 0 Of course we cannot achieve this, else Monte Carlo would be unnecessary. Suppose we use MCMC to sample S from P ξ 0 (S Y). P ξ (Y) = S = E ξ 0 P ξ (Y, S) = S ( P ξ (Y, S) = P ξ 0 (Y) E ξ0 P ξ 0 (S Y) Y ( Pξ (Y, S) P ξ (Y, S) P ξ (S Y)P (S Y) ξ0 0 ) P ξ 0 (Y, S) Y ) 11
LIKELIHOOD RATIO ESTIMATION Thus we have L(ξ) L(ξ0) = P ξ(y) P ξ (Y) = E ξ0 0 ( Pξ (Y, S) P ξ 0 (Y, S) Y S is the random variable, Y is fixed. S P ξ ( Y). 0 If S (τ), τ = 1,..., N, are realized from P ξ ( Y) then the likelihood 0 ratio can be estimated by 1 N N τ=1 P ξ(y, S (τ) ) P ξ 0 (Y, S(τ) ) ) 12
LINKAGE LOCATION LIKELIHOOD RATIO The form for linkage lod that follows directly from this is ( L(β, γ1, Λ M ) Pξ L(β, γ0, Λ M ) = E (Y 1 T, Y M, S T, S M ) ) ξ0 P ξ (Y 0 T, Y M, S T, S M ) Y T, Y M for two hypothesized trait locus positions γ1 and γ0. Now P ξ (Y, S) = P β (Y T S T )P ΛM (Y M, S M )Pγ(S T S M ) so ratio reduces to ( L(β, γ1, Λ M ) P L(β, γ0, Λ M ) = E γ1 (S ) T S M ) ξ0 Pγ0 (S T S M ) Y T, Y M 13
LOCAL ESTIMATE IS VERY SIMPLE: GLOBAL IS HARD i... l T r... Pγ1 (S T S M ) Pγ0 (S T S M ) = i ρ1l ( ρ 0l ( ρ 1r ρ0r ) Si,T S i,l ( 1 ρ 1l 1 ρ 0l ) Si,T Si,r ( 1 ρ1r 1 ρ0r ) 1 Si,T S i,l ) 1 Si,T Si,r The above works well only for γ1 γ0, and for γ0, γ1 with same l and r. When likelihoods are not smooth, combining LR estimates does not work well especially across markers. 14
AN MCMC IMPORTANCE SAMPLING ESTIMATE Lange and Sobel (1996) write the likelihood in the form L(β, γ, Λ M ) = P β,γ,λm (Y M, Y T ) P β,γ,λm (Y T Y M ) = S M P β,γ (Y T S M )P ΛM (S M Y M ) = E ΛM (P β,γ (Y T S M ) Y M ). Sample S M given Y M : compute P (Y T S M ) β, γ a form of Rao-Blackwellization integrate over S T. Also importance sampling: maybe P (S M Y M ) P (S M Y M, Y T ) For fuzzy traits it works quite well. 15
METROPOLIS HASTINGS FOR INTERFERENCE Suppose we have interference model P (I) (S) in place of Haldane model P (H) (S) we have used so far. Use block-gibbs update of meiosis i (Si, ) to propose S. Hastings ratio is for current S and proposed S is h(s ; S) = P (I) (S, Y) P (I) (S, Y) P (H) (Si, S k,, k i, Y) P (H) (S i, S k,, k i, Y) = P (I) (S, Y)P (H) (S, Y) P (I) (S, Y)P (H) (S, Y) = P (Y S )P (I) (S )P (Y S)P (H) (S) P (Y S)P (I) (S)P (Y S )P (H) (S ) 16
INTERFERENCE ctd. h(s ; S) = m k=1 P (I) (S k, ) P (H) (S k, ) P (I) (S k, ) P (H) (S k, ) = P (I) (S i, ) P (I) (Si, ) P (H) (Si, ) P (H) (S i, ). Pr(S = S ) = a = min(1, h). Pr(S = S) = 1 a. Question: better to sample under H and reweight, or use M-H to sample under model I? 17