Closed-form sampling formulas for the coalescent with recombination

Size: px

Start display at page:

Download "Closed-form sampling formulas for the coalescent with recombination"

Paula Warner
5 years ago
Views:

1 0 / 21 Closed-form sampling formulas for the coalescent with recombination Yun S. Song CS Division and Department of Statistics University of California, Berkeley September 7, 2009 Joint work with Paul Jenkins

2 0 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

3 1 / 21 Motivation Sample 1 = AACTAGG...CCGTGACC...ACAGCTAT Sample 2 = AACTAGG...CCGTAACC...ACAGCTAT Sample 3 = AACTGGG...CCGTGACC...ACAGCTAT Sample 4 = AACTGGG...CCGTAACC...ACAGTTAT Sample 5 = AACTAGG...CCGTGACC...ACAGTTAT The basic problem What is the probability of observing a sample of DNA sequences for a given population genetics model? This is behind many problems in population genetics, e.g. Estimating parameters: L(θ, ρ) = P(D θ, ρ) Ancestral inference Detecting departures from neutrality

4 1 / 21 Motivation Sample 1 = AACTAGG...CCGTGACC...ACAGCTAT Sample 2 = AACTAGG...CCGTAACC...ACAGCTAT Sample 3 = AACTGGG...CCGTGACC...ACAGCTAT Sample 4 = AACTGGG...CCGTAACC...ACAGTTAT Sample 5 = AACTAGG...CCGTGACC...ACAGTTAT The basic problem What is the probability of observing a sample of DNA sequences for a given population genetics model? But, except in the one-locus case with a special model of mutation, closed-form sampling formulas are generally unknown. When recombination is involved, obtaining an analytic formula for the sampling distribution has so far remained an intractable problem.

5 2 / 21 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms

6 2 / 21 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination

7 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination For large recombination rates, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. 2 / 21

8 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination For large recombination rates, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. However, we in fact expect the dynamics to be easier to study for large recombination rates, since the loci under consideration would then be less dependent. 2 / 21

9 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination For large recombination rates, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. However, we in fact expect the dynamics to be easier to study for large recombination rates, since the loci under consideration would then be less dependent. It seems plausible that there exists a stochastic process simpler than the standard coalescent with recombination, that describes the dynamics of the relevant degrees of freedom for large recombination rates. 2 / 21

10 3 / 21 Motivation Closed-form sampling formula for large recombination rates?

11 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate 3 / 21

12 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate n, sample configuration (defined later) q(n), the sampling distribution of n 3 / 21

13 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate n, sample configuration (defined later) q(n), the sampling distribution of n Asymptotic Series As ρ, find q(n ρ) = q 0 (n)+ q 1(n) + q ( ) 2(n) 1 ρ ρ 2 +O ρ 3, where q 0 (n), q 1 (n), q 2 (n),... are independent of ρ. 3 / 21

14 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate n, sample configuration (defined later) q(n), the sampling distribution of n θ A, θ B, the population-scaled mutation rates P A = (Pij A), P B = (Pij A ), transition matrices for the mutation process in the case of a finite-alleles model Asymptotic Series As ρ, find q(n ρ) = q 0 (n)+ q 1(n) + q ( ) 2(n) 1 ρ ρ 2 +O ρ 3, where q 0 (n), q 1 (n), q 2 (n),... are independent of ρ. 3 / 21

15 3 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

16 Closed-form one-locus sampling formulas n = (n 1,..., n K ), where n i denotes the number of gametes with allele i at the locus. q(n), probability of an ordered sample with configuration n. (x) n := x(x + 1) (x + n 1) Infinite-alleles model Ewens sampling formula (1972): q ESF (n) = θk (θ) n K (n i 1)! i=1 Finite-alleles Parent-Independent Mutation (PIM) model Mutation transition matrix satisfies P ij = P j. Wright s sampling formula (1949): q WSF (n) = 1 (θ) n K (θp i ) ni Any diallelic recurrent mutation model can be transformed into a PIM model. i=1 4 / 21

17 Two-locus sample configuration Two-locus sample configuration n = (a, b, c) a = (a i ), K -dim vector b = (b j ), L-dim vector c = (c ij ), K -by-l matrix A sample of 10 gametes c = (c ij ), a K -by-l matrix j {1,..., L} i {1,..., K } Sample size is c = c = / 21

18 6 / 21 Two-locus sample configuration Time Coalescence Mutation Recombination a = (a i ) i {1,...,K }, multiplicity of left half-fragments. b = (b j ) j {1,...,L}, multiplicity of right half-fragments. The complete sampling distribution is then q(a, b, c).

19 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3

20 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). c

21 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). c A = (c i ) c B = (c j )

22 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). c A = (c i ) c B = (c j ) add to a to get add to b to get q A (a + c A ) q B (b + c B ).

23 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). Infinite alleles: q 0 (a, b, c) = q A ESF(a + c A )q B ESF(b + c B )

24 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). Infinite alleles: q 0 (a, b, c) = q A ESF(a + c A )q B ESF(b + c B ) Finite-alleles PIM: q 0 (a, b, c) = q A WSF(a + c A )q B WSF(b + c B )

25 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). Infinite alleles: q 0 (a, b, c) = q A ESF(a + c A )q B ESF(b + c B ) Finite-alleles PIM: q 0 (a, b, c) = q A WSF(a + c A )q B WSF(b + c B ) In fact, for any model of mutation, q 0 (a, b, c) = q A (a + c A )q B (b + c B ) where q A, q B are (possibly unknown) one-locus sampling distributions.

26 8 / 21 Universality q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 Universality For all mutation models, the functional form of q 0 (a, b, c) is q 0 (a, b, c) = F (q A, q B ; a, b, c) := q A (a + c A )q B (b + c B )

27 8 / 21 Universality q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 Universality For all mutation models, the functional form of q 0 (a, b, c) is q 0 (a, b, c) = F (q A, q B ; a, b, c) := q A (a + c A )q B (b + c B ) Theorem 1 q 1 (n) also exhibits universality: q 1 (a, b, c) = G(q A, q B ; a, b, c)

28 Universality q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Universality + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For all mutation models, the functional form of q 0 (a, b, c) is q 0 (a, b, c) = F (q A, q B ; a, b, c) := q A (a + c A )q B (b + c B ) Theorem 1 q 1 (n) also exhibits universality: q 1 (a, b, c) = G(q A, q B ; a, b, c) q 1 (a, b, c) = ( c 2) q B (b + c B ) K q A (a + c A ) L + K L ( cij i=1 j=1 2 q A (a + c A )q B (b + c B ) ( ci i=1 ( 2 c j j=1 2 ) ) q A (a + c A e i ) ) q B (b + c B e j ) q A (a + c A e i )q B (b + c B e j ), where e i is a unit vector with a 1 at entry i and 0 otherwise. 8 / 21

29 9 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 Derivation of q 1 (a, b, c) 1 The full sampling distribution q(a, b, c) satisfies a recursion relation.

30 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Derivation of q 1 (a, b, c) + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 1 The full sampling distribution q(a, b, c) satisfies a recursion relation. Infinite-alleles case: [n(n 1) + θ A (a + c) + θ B (b + c) + ρc]q(a, b, c) = P K i=1 a i (a i 1 + 2c i )q(a e i, b, c) + P L j=1 b j (b j 1 + 2c j )q(a, b e j, c) + P K P L i=1 j=1 ˆcij (c ij 1)q(a, b, c e ij ) + 2a i b j q(a e i, b e j, c + e ij ) P h +θ K PL i A i=1 j=1 δ a i +c i,1 δ cij,1 q(a, b + e j, c e ij ) + δ ai,1δ ci,0 q(a e i, b, c) P h +θ L PK i B j=1 i=1 δ b j +c j,1 δ cij,1 q(a + e i, b, c e ij ) + δ bj,1δ c j,0 q(a, b e j, c) +ρ P K i=1 P L j=1 c ij q(a + e i, b + e j, c e ij ), with boundary conditions are q(e i, 0, 0) = q(0, e j, 0) = 1 for all i {1,..., K } and j {1,..., L}. 9 / 21

31 9 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Derivation of q 1 (a, b, c) + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 1 The full sampling distribution q(a, b, c) satisfies a recursion relation. 2 Using that recursion, one can show that q 1 (a, b, c) satisfies K L c ij q 1 (a, b, c) = f 1 (a, b, c)+ c q 1(a +e i, b +e j, c e ij ), i=1 j=1 where f 1 (a, b, c) is a function of the zeroth-order term q 0.

32 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Derivation of q 1 (a, b, c) + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 1 The full sampling distribution q(a, b, c) satisfies a recursion relation. 2 Using that recursion, one can show that q 1 (a, b, c) satisfies K L c ij q 1 (a, b, c) = f 1 (a, b, c)+ c q 1(a +e i, b +e j, c e ij ), i=1 j=1 where f 1 (a, b, c) is a function of the zeroth-order term q 0. 3 This equation admits a probabilistic interpretation: c q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + E[f 1 (A (m), B (m), C (m) )], m=1 where C (m) = (C (m) ij ) is a multivariate hypergeometric(c, c, m) random variable, and A (m), B (m) depend on C (m). 9 / 21

33 10 / 21 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + c E[f 1 (A (m), B (m), C (m) )] m=1 Random matrix C (m) = (C (m) ij ) corresponds to a random subsample obtained by sampling without replacement m gametes from c A (m) = a + c A C (m) A C (m) A = (C (m) i ) and C (m) B P (i,j) [ C (m) ij and B (m) = b + c B C (m) B = c (m) ij = (C (m) j ) ] = ( 1 c ) m (i,j) ( cij c (m) ij )., where

34 10 / 21 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + c E[f 1 (A (m), B (m), C (m) )] m=1 Random matrix C (m) = (C (m) ij ) corresponds to a random subsample obtained by sampling without replacement m gametes from c A (m) = a + c A C (m) A C (m) A = (C (m) i ) and C (m) B and B (m) = b + c B C (m) B = (C (m) j ), where f 1 is a deg-2 polynomial in the entries C ij of C (m), so the above expectation can be easily computed.

35 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + c E[f 1 (A (m), B (m), C (m) )] m=1 Random matrix C (m) = (C (m) ij ) corresponds to a random subsample obtained by sampling without replacement m gametes from c A (m) = a + c A C (m) A C (m) A = (C (m) i ) and C (m) B and B (m) = b + c B C (m) B = (C (m) j ), where f 1 is a deg-2 polynomial in the entries C ij of C (m), so the above expectation can be easily computed. Lemma For all a and b, q 1 (a, b, 0) = 0. That q 1 (a, b, 0) vanishes is not expected a priori. 10 / 21

36 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + Theorem 1 c E[f 1 (A (m), B (m), C (m) )], m=1 For either an infinite-alleles or an arbitrary finite-alleles model, ( ) c q 1 (a, b, c) = q A (a + c A )q B (b + c B ) 2 K ( ) q B (b + c B ) ci q A (a + c A e i ) q A (a + c A ) + K L i=1 j=1 ( cij 2 i=1 L j=1 2 ( c j 2 ) q B (b + c B e j ) ) q A (a + c A e i )q B (b + c B e j ), where e i is a unit vector with a 1 at entry i and 0 otherwise. 11 / 21

37 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m).

38 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )].

39 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )]. Unfortunately, q 2 (a, b, 0) 0 in general and we don t have a closed-form expression for it.

40 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )]. Unfortunately, q 2 (a, b, 0) 0 in general and we don t have a closed-form expression for it. However, it satisfies a simple recursion relation that can be easily solved numerically using dynamic programming.

41 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )]. Unfortunately, q 2 (a, b, 0) 0 in general and we don t have a closed-form expression for it. However, it satisfies a simple recursion relation that can be easily solved numerically using dynamic programming. The contribution of q 2 (a, b, 0) 0 is negligibly small compared to that of the other terms. 12 / 21

42 13 / 21 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite.

43 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite. Is the converse true? If q 1 (a, b, c) < 0, then is MLE infinite? 13 / 21

44 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite. Counter-example to the converse log likelihood ! Finite alleles PIM model, θ = a = b = 0, c = ( ). q 1 (a, b, c) < 0. q q q(n) 13 / 21

45 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite. Counter-example to the converse log likelihood ! Finite alleles PIM model, θ = a = b = 0, c = ( ). q 1 (a, b, c) < 0. q q q(n) 13 / 21

46 13 / 21 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, Example b, c) > 0, then the MLE of ρ is finite Counter-example to the converse log likelihood ! log likelihood Finite alleles PIM model, θ = a = b = 0, c = ( ) ! q 1 (a, b, c) < 0. q q q(n)

47 13 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

48 14 / 21 Detailed examples Truncate at order 2 q ASF (a, b, c) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 should approximate q(a, b, c) accurately for sufficiently large ρ.

49 14 / 21 Detailed examples Truncate at order 2 q ASF (a, b, c) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 should approximate q(a, b, c) accurately for sufficiently large ρ. Example 1 Likelihood (10 15 ) ρ a = b = 0, c = ( Finite alleles PIM model, θ = 0.01 per locus. q q q q q(n) q 0 (n) q 0 (n) + q 1(n) ρ q0 (n) + q 1(n) ρ + q 2(n) ρ 2 )

50 14 / 21 Detailed examples Truncate at order 2 q ASF (a, b, c) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 should approximate q(a, b, c) accurately for sufficiently large ρ. Example 2 Likeihood (10 16 ) ! a = b = 0, c = ( Finite alleles PIM model, θ = 0.01 per locus. q q q q q(n) q 0 (n) q 0 (n) + q 1(n) ρ q0 (n) + q 1(n) ρ + q 2(n) ρ 2 )

51 Distribution of errors A measure for accuracy: unsigned relative error q ASF (n) q(n) q(n) 100% q q q q 0 (n) q 0 (n) + q 1(n) ρ q0 (n) + q 1(n) ρ + q 2(n) ρ 2 Accuracy across all dimorphic (0, 0, c) samples of size 20, when ρ = 50 in the symmetric PIM model Density Density Unsigned relative error (%) θ = Unsigned relative error (%) θ = 1 15 / 21

52 Distribution of errors Accuracy: effect of recombination rate Accuracy of q 0 (0, 0, c) + q 1(0,0,c) ρ Accuracy across all dimorphic samples. + q 2(0,0,c) ρ 2 : Fix c = 20, vary ρ Density Unsigned relative error (%) θ = 0.01 Density ρ = ρ = 50 ρ = 100 ρ = Unsigned relative error (%) θ = 1 16 / 21

53 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) θ = 0.01

54 17 / 21 Distribution of errors For which samples is the ASF most accurate? Monomorphic at both loci. log 10 p(c) Unsigned relative error (%) θ = 0.01 None of the above.

55 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Monomorphic at both loci. Monomorphic at one locus Unsigned relative error (%) θ = 0.01 None of the above.

56 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one Unsigned relative error (%) θ = 0.01 None of the above.

57 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one. + Very low LD: r θ = 0.01 None of the above.

58 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one. + Very low LD: r Perfect LD. θ = 0.01 None of the above.

59 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) θ = 0.01 Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one. + Very low LD: r Perfect LD. Perfect LD except for one haplotype. None of the above.

60 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) 6 log 10 p(c) Unsigned relative error (%) Unsigned relative error (%) θ = 0.01 θ = 1

61 17 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

62 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n).

63 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models.

64 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models. 3 Found a closed-form formula + (extra bit) for q 2 (n). c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )]. m=1

65 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models. 3 Found a closed-form formula + (extra bit) for q 2 (n). c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )]. m=1 4 The extra bit is negligibly small and can be easily evaluated using dynamic programming if one wishes.

66 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models. 3 Found a closed-form formula + (extra bit) for q 2 (n). c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )]. m=1 4 The extra bit is negligibly small and can be easily evaluated using dynamic programming if one wishes. 5 Found a simple sufficient condition for a given two-locus sample configuration to have a finite MLE of ρ.

67 19 / 21 Summary Applications Papers Composite-likelihood method such as LDhat. (Monte Carlo methods become substantially less efficient as ρ increases, so our analytic work will be of practical utility.) Test for epistasis. Sample-wise LD: P[r 2 ] and E[r 2 ] as ρ. Infinite-alleles case: Jenkins, P.A. and Song, Y.S. An asymptotic sampling formula for the coalescent with recombination. Annals of Applied Probability, under minor revision. Arbitrary finite-alleles recurrent mutation case: Jenkins, P.A. and Song, Y.S. Closed-form two-locus sampling distributions: accuracy and universality. Genetics, in press.

68 Further work Future work 1 Does there exist a stochastic process dual to the coalescent with recombination? 2 Is there a combinatorial interpretation for q 1 (a, b, c)? cf. Interpretation of Ewens sampling formula as the joint distribution of cycle counts in a random permutation 3 Does the universality property extend to q k (c) for k > 1? 4 For all k > 0, we think the following is true: c q k (a, b, c) = q k (a + c A, b + c B, 0) + E[f k (A (m), B (m), C (m) )] m=1 where f k is a deg-2k polynomial in the entries C ij of C (m). Can we automate the computation of the expectation? q 5 Does k (a, b, c) ρ k converge? k=0 6 Extend to multiple loci? 20 / 21

69 Further work Acknowledgments Thank you for your attention! Useful discussions Bob Griffiths Chuck Langley Rasmus Nielsen Josh Paul Monty Slatkin Programming Fulton Wang Research supported in part by NIH, Alfred P. Sloan Research Fellowship, and Packard Fellowship for Science and Engineering 21 / 21

AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. By Paul A. Jenkins and Yun S. Song, University of California, Berkeley

AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION By Paul A. Jenkins and Yun S. Song, University of California, Berkeley Ewens sampling formula (ESF) is a one-parameter family of probability