Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Wei Zhang, Jing Liu, Rachel Fewster and Jesse Goodman Department of Statistics, The University of Auckland December 1, 2015

Overview 1 Introduction 2 Background 3 Methodology 4 Results 5 Acknowledgements 6 References

Introduction 1 Linkage disequilibrium (LD) indicates the statistical dependence of alleles at different loci in population genetics. Introduction December 1, 2015 3 / 35

Introduction 1 Linkage disequilibrium (LD) indicates the statistical dependence of alleles at different loci in population genetics. LD can be used to find genetic markers that might be linked to genes for diseases and understand the evolutionary history. Introduction December 1, 2015 3 / 35

Introduction 1 Linkage disequilibrium (LD) indicates the statistical dependence of alleles at different loci in population genetics. 2 r 2 is a quantitative measure of LD and we are aiming to find its stationary distribution under models for genetic drift. 3 Given some moments of the unknown distribution, the maximum entropy (Maxent) principle can be used to approximate the density function of r 2. Introduction December 1, 2015 4 / 35

Background The TLD model (Liu, 2012) Diffusion approximation Maximum entropy principle Background December 1, 2015 5 / 35

Some terminologies in genetics 1 One locus is a position of a gene or significant DNA sequence on a chromosome. 2 An allele is a variant form of a gene. 3 Diploid describes a cell or an organism that has paired chromosomes, one from each parent. 4 A mutation is a permanent change in the DNA sequence. 5 Recombination is the production of offspring with combinations of traits that differ from those found in either parent. Background The TLD model December 1, 2015 6 / 35

Some terminologies in genetics Background The TLD model December 1, 2015 7 / 35

Some terminologies in genetics Background The TLD model December 1, 2015 8 / 35

The TLD model TLD is short for two-locus diallelic model with mutation and recombination. Background The TLD model December 1, 2015 9 / 35

The TLD model TLD is short for two-locus diallelic model with mutation and recombination. Model assumptions and notations: The two possible alleles on each of the two loci are assumed to be A 1, A 2 and B 1, B 2, thus the four possible types of gamete are A 1 B 1, A 1 B 2, A 2 B 1 and A 2 B 2. Recombination rate C: A i B j + A m B n A i B n /A m B j. Equal mutation rate µ for both loci: A 1 A 2 and B 1 B 2. Background The TLD model December 1, 2015 9 / 35

The TLD model Background The TLD model December 1, 2015 10 / 35

The TLD model Table 1: The proportions of gametes in generation T and the expected proportions in generation T + 1 in the population. N is the population size. Generation Gamete A 1 B 1 A 1 B 2 A 2 B 1 A 2 B 2 T x 1 2N x 2 2N x 3 2N x 4 2N T + 1 φ 1 φ 2 φ 3 φ 4 Background The TLD model December 1, 2015 11 / 35

The TLD model Suppose We have φ 1 (x) = x 1 2N (1 µ)2 + φ 2 (x) = x 2 2N (1 µ)2 + φ 3 (x) = x 3 2N (1 µ)2 + φ 4 (x) = x 4 2N (1 µ)2 + D(x) = x 1 2N ( x2 2N + x 3 2N ( x1 2N + x 4 2N ( x1 2N + x 4 2N ( x2 2N + x 3 2N x 4 2N x 2 2N x 3 2N. ) µ(1 µ) + x 4 ) µ(1 µ) + x 3 ) µ(1 µ) + x 2 2N µ2 CD(x)(1 2µ) 2 2N µ2 + CD(x)(1 2µ) 2 2N µ2 + CD(x)(1 2µ) 2 ) µ(1 µ) + x 1 2N µ2 CD(x)(1 2µ) 2. Background The TLD model December 1, 2015 12 / 35

The TLD model The transition probability of going from x = (x 1, x 2, x 3, x 4 ) to y = (y 1, y 2, y 3, y 4 ) is: p xy = P (y x) = (2N)! y 1!y 2!y 3!y 4! (φ 1 (x)) y 1 (φ 2 (x)) y 2 (φ 3 (x)) y 3 (φ 4 (x)) y 4. Background The TLD model December 1, 2015 13 / 35

The TLD model The transition probability of going from x = (x 1, x 2, x 3, x 4 ) to y = (y 1, y 2, y 3, y 4 ) is: p xy = P (y x) = (2N)! y 1!y 2!y 3!y 4! (φ 1 (x)) y 1 (φ 2 (x)) y 2 (φ 3 (x)) y 3 (φ 4 (x)) y 4. The TLD model is an irreducible aperiodic Markov chain, thus there exists a unique stationary distribution. Background The TLD model December 1, 2015 13 / 35

Diffusion approximation The main idea is to rescale the discrete state space and time space by a factor related to the population size N, so that the gap between two successive states in the new space is infinitesimal when N is large enough. Example State space {0, 1, 2,, 2N} (2N) 1 {0, 1 2N, 2 2N,, 1}. Time space {0, 1, 2, } (2N) 1 {0δt, 1δt, 2δt, }, where δt = 1 2N. Background Diffusion approximation December 1, 2015 14 / 35

Diffusion approximation When N is large enough, the new chain is approximately continuous. In the derivation process, Taylor series expansion and the definition of derivative are used to get two important results for the TLD model. f f(x+ x) f(x) (x) = lim x 0 x Background Diffusion approximation December 1, 2015 15 / 35

The diffusion operator and master equation The diffusion operator for the TLD model is L = 1 2 p (1 p) 2 p 2 + 1 2 q (1 q) 2 q 2 + 1 { p (1 p) q (1 q) + D (1 2p) (1 2q) D 2 } 2 2 D 2 + D 2 p q + D (1 2p) 2 p D + D (1 2q) 2 q D + θ 4 (1 2p) p + θ 4 (1 2q) q ( D 1 + ρ ) 2 + θ D and the master equation at stationarity is E {Lf (p, q, D)} = 0. The master equation means the expected evolution over time of any nice function of p, q and D is zero at stationarity. Background Diffusion approximation December 1, 2015 16 / 35

The diffusion operator and master equation The diffusion operator for the TLD model is L = 1 2 p (1 p) 2 p 2 + 1 2 q (1 q) 2 q 2 + 1 { p (1 p) q (1 q) + D (1 2p) (1 2q) D 2 } 2 2 D 2 + D 2 p q + D (1 2p) 2 p D + D (1 2q) 2 q D + θ 4 (1 2p) p + θ 4 (1 2q) q ( D 1 + ρ ) 2 + θ D and the master equation at stationarity is E {Lf (p, q, D)} = 0. Here p and q are the frequencies of A 1 and B 1, D = f 11 pq, f 11 is the frequency of gamete A 1 B 1, ρ = 2NC, θ = 2Nµ and f is any twice continuously differentiable function with compact support. Background Diffusion approximation December 1, 2015 17 / 35

A simple example If letting f (p, q, D) = D, we can get that Lf (p, q, D) = D (1 + ρ ) 2 + θ and so E {Lf (p, q, D)} = (1 + ρ ) 2 + θ E (D) = 0 E (D) = 0. Background Diffusion approximation December 1, 2015 18 / 35

Maximum entropy principle Definition Entropy is the quantitative measure of disorder in a system. Suppose a random variable Z has K possible outcomes with probabilities p 1, p 2, p 3,, p K, the entropy is: I (Z) = K p i log K (p i ). i=1 Background Maximum entropy principle December 1, 2015 19 / 35

Maximum entropy principle Definition Entropy is the quantitative measure of disorder in a system. Suppose a random variable Z has K possible outcomes with probabilities p 1, p 2, p 3,, p K, the entropy is: I (Z) = K p i log K (p i ). i=1 The maximum entropy principle states that the solution that maximises the entropy is the most honest one. Background Maximum entropy principle December 1, 2015 19 / 35

An example In a coin toss experiment, suppose Pr(head)=p and Pr(tail)=1 p, then the entropy is: I = p log 2 (p) (1 p) log 2 (1 p). Background Maximum entropy principle December 1, 2015 20 / 35

An example Relationship of the entropy and p Entropy 0.2 0.4 0.6 0.8 1.0 (0.5, 1.0) 0.0 0.2 0.4 0.6 0.8 1.0 p Background Maximum entropy principle December 1, 2015 21 / 35

Maximum entropy principle The Maxent solution of an unknown probability density function π (p) given knowledge of n moments m i = E ( p i), i = 1,, n is the solution π n (p) that maximizes: I = π n (p) log { π n (p)} dp subject to m i = where Ω is the support of π. Ω Ω p i π n (p) dp for i = 0, 1,, n Background Maximum entropy principle December 1, 2015 22 / 35

Maximum entropy principle Considering the Lagrange function and Euler-Lagrange equation, then the Maxent solution is: π n (p) = exp ( λ 0 + λ 1 p + λ 2 p 2 + + λ n p n) where λ i, i = 0,, n are the solutions of { arg min λ exp ( λ 0 + λ 1 p + λ 2 p 2 + + λ n p n) dp Ω } n λ i m i. i=0 Background Maximum entropy principle December 1, 2015 23 / 35

Linkage disequilibrium coefficient r 2 The definition of r 2 is: r 2 = D 2 p (1 p) q (1 q) where p and q are the frequencies of A 1 and B 1, D = f 11 pq and f 11 is the frequency of gamete A 1 B 1. Methodology Analytic computation of the moments December 1, 2015 24 / 35

Reformulation of the problem Let u = 1 2p and v = 1 2q. Then the diffusion generator can be rewritten as L = 1 2 ( 1 u 2 ) 2 u 2 + 1 2 ( 1 v 2 ) 2 v 2 + 1 2 + 4D 2 u v 2Du 2 D u 2Dv 2 D v 1 2 θu { 1 ( 1 u 2 ) ( 1 v 2) } + Duv D 2 2 16 D 2 u 1 2 θv ( v D 1 + 1 2 ρ + θ ) D. This reparameterization yields r 2 = D 2 p (1 p) q (1 q) = 16D 2 (1 u 2 ) (1 v 2 ). Methodology Analytic computation of the moments December 1, 2015 25 / 35

Analytic computation of the moments Note that when 0 u 2, v 2 < 1: 1 1 u 2 = k=0 u 2k and 1 1 v 2 = v 2l. l=0 Considering the new form of r 2, it follows that when M = 1, 2 E (r ) 2M = 16 M k 1 =0 k M =0 l 1 =0 which can be simplified to E ( r 2M) = 16 M K=0 L=0 l M =0 { } E D 2M u 2(k 1+k 2 + +k M ) v 2(l 1+l 2 + +l M ), ( )( K + M 1 L + M 1 M 1 M 1 ) E ( D 2M u 2K v 2L). Methodology Analytic computation of the moments December 1, 2015 26 / 35

Analytic computation of the moments Our problem now is how to compute E ( D 2M u 2K v 2L) for all possible M, K and L. Step 1: Let f in the master equation be some specific forms of u n, uv, u 2 v and Du n, we can get the results of E (u n ), E (uv), E ( u 2 v ) etc. Step 2: Given a value of (m, n), apply the function f = D k u m+2 k v n+2 k into the master equation with k {0, 1, 2,, n + 2}. A system of n + 3 linear equations is generated, whose solutions are: E ( u m+2 v n+2), E ( Du m+1 v n+1), E ( D 2 u m v n) Methodology Analytic computation of the moments December 1, 2015 27 / 35

Analytic computation of the moments Given a M N and a truncation level l max N, we use E ( r 2M ) l max = 16 M 2K+2L=l max K,L 0 ( )( ) K + M 1 L + M 1 E ( D 2M u 2K v 2L) M 1 M 1 to approximate the M-th moment of r 2. Methodology Analytic computation of the moments December 1, 2015 28 / 35

Variance of r 2 Table 2: V ( r 2) computed by our method (l max = 700) ρ θ 0.0125 0.0500 0.1000 0.2500 0.7500 1.2500 0.00 0.006119 0.03109 0.03806 0.02433 0.00685 0.003239 0.25 0.003412 0.01950 0.02614 0.01891 0.00608 0.002928 0.50 0.002271 0.01372 0.01932 0.01517 0.00538 0.002739 1.25 0.001062 0.00666 0.00991 0.00892 0.00403 0.002197 2.50 0.000526 0.00327 0.00487 0.00476 0.00263 0.001590 5.00 0.000244 0.00142 0.00206 0.00202 0.00141 0.000961 Results Variance of r 2 December 1, 2015 29 / 35

Variance of r 2 V(r 2 ) 0.005 0.015 0.025 Variance of r 2 with ρ=0.25 Liu (2012) Analytic method 0.0 0.2 0.4 0.6 0.8 1.0 1.2 θ V(r 2 ) 0.001 0.003 0.005 Variance of r 2 with ρ=2.50 Liu (2012) Analytic method 0.0 0.2 0.4 0.6 0.8 1.0 1.2 θ Figure 1: Comparison of V ( r 2) between our analytic method and the method in Liu(2012) Results Variance of r 2 December 1, 2015 30 / 35

Probability density function of r 2 We can compute 50 moments in 2.5 hours with l max = 2000 on a laptop. Use n = 18 moments to calculate Maxent π ( r 2). Compare moments 19, 20,..., 50 using π ( r 2) vs analytic method. 2 π~ ) 0 50 100 n(r 150 200 250 300 Maxent density of r 2 using moments up to order n ( θ = 0.125 ; ρ = 5 ) ( No.Nodes = 1000 ) n = 18 Moment 5.0e 07 1.5e 06 2.5e 06 Analytic moments compared with the results of Maxent ( θ = 0.125 ; ρ = 5 ) ( No.Nodes = 1000 ; No.Moments = 18 ) Analytic Maxent 0.00 0.02 0.04 0.06 0.08 0.10 r 2 20 25 30 35 40 45 50 Order of moment Results Probability distribution of r 2 December 1, 2015 31 / 35

Probability density function of r 2 2 π~ ) 0 2000 n(r 4000 6000 8000 Maxent density of r 2 using moments up to order n ( θ = 0.0125 ; ρ = 0 ) ( No.Nodes = 1000 ) n = 16 Moment 0.0036 0.0038 0.0040 0.0042 Analytic moments compared with the results of Maxent ( θ = 0.0125 ; ρ = 0 ) ( No.Nodes = 1000 ; No.Moments = 16 ) Analytic Maxent 0e+00 2e 04 4e 04 6e 04 8e 04 1e 03 r 2 20 25 30 35 40 45 50 Order of moment Figure 2: Stationary density functions of r 2 for two pairs of θ and ρ approximated by the numerical univariate Maxent method. Results Probability distribution of r 2 December 1, 2015 32 / 35

Acknowledgements I would like to sincerely and gratefully thank my supervisor Rachel Fewster for her guidance, Dr. Jesse Goodman and Dr. Jing Liu for their useful ideas. Acknowledgements December 1, 2015 33 / 35

References Y. S. Song and J. S. Song, Analytic computation of the expectation of the linkage disequilibrium coefficient r 2, Theoretical population biology, vol. 71, no. 1, pp. 49 60, 2007. J. Liu, Reconstruction of probability distributions in population genetics. PhD thesis, The University of Auckland, 2012. X. Wu, Calculation of maximum entropy densities with application to income distribution, Journal of Econometrics, vol. 115, no. 2, pp. 347 354, 2003. M. Slatkin, Linkage disequilibrium understanding the evolutionary past and mapping the medical future, Nature Reviews Genetics, vol. 9, no. 6, pp. 477 485, 2008. W. G. Hill and A. Robertson, Linkage disequilibrium in finite populations, Theoretical and Applied Genetics, vol. 38, no. 6, pp. 226 231, 1968. L. R. Mead and N. Papanicolaou, Maximum entropy in the problem of moments, Journal of Mathematical Physics, vol. 25, no. 8, pp. 2404 2417, 1984. References December 1, 2015 34 / 35

Thanks!