Closed-form sampling formulas for the coalescent with recombination

Size: px
Start display at page:

Download "Closed-form sampling formulas for the coalescent with recombination"

Transcription

1 0 / 21 Closed-form sampling formulas for the coalescent with recombination Yun S. Song CS Division and Department of Statistics University of California, Berkeley September 7, 2009 Joint work with Paul Jenkins

2 0 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

3 1 / 21 Motivation Sample 1 = AACTAGG...CCGTGACC...ACAGCTAT Sample 2 = AACTAGG...CCGTAACC...ACAGCTAT Sample 3 = AACTGGG...CCGTGACC...ACAGCTAT Sample 4 = AACTGGG...CCGTAACC...ACAGTTAT Sample 5 = AACTAGG...CCGTGACC...ACAGTTAT The basic problem What is the probability of observing a sample of DNA sequences for a given population genetics model? This is behind many problems in population genetics, e.g. Estimating parameters: L(θ, ρ) = P(D θ, ρ) Ancestral inference Detecting departures from neutrality

4 1 / 21 Motivation Sample 1 = AACTAGG...CCGTGACC...ACAGCTAT Sample 2 = AACTAGG...CCGTAACC...ACAGCTAT Sample 3 = AACTGGG...CCGTGACC...ACAGCTAT Sample 4 = AACTGGG...CCGTAACC...ACAGTTAT Sample 5 = AACTAGG...CCGTGACC...ACAGTTAT The basic problem What is the probability of observing a sample of DNA sequences for a given population genetics model? But, except in the one-locus case with a special model of mutation, closed-form sampling formulas are generally unknown. When recombination is involved, obtaining an analytic formula for the sampling distribution has so far remained an intractable problem.

5 2 / 21 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms

6 2 / 21 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination

7 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination For large recombination rates, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. 2 / 21

8 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination For large recombination rates, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. However, we in fact expect the dynamics to be easier to study for large recombination rates, since the loci under consideration would then be less dependent. 2 / 21

9 Motivation Monte Carlo Approaches Importance Sampling MCMC Rejection algorithms The coalescent with recombination For large recombination rates, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. However, we in fact expect the dynamics to be easier to study for large recombination rates, since the loci under consideration would then be less dependent. It seems plausible that there exists a stochastic process simpler than the standard coalescent with recombination, that describes the dynamics of the relevant degrees of freedom for large recombination rates. 2 / 21

10 3 / 21 Motivation Closed-form sampling formula for large recombination rates?

11 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate 3 / 21

12 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate n, sample configuration (defined later) q(n), the sampling distribution of n 3 / 21

13 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate n, sample configuration (defined later) q(n), the sampling distribution of n Asymptotic Series As ρ, find q(n ρ) = q 0 (n)+ q 1(n) + q ( ) 2(n) 1 ρ ρ 2 +O ρ 3, where q 0 (n), q 1 (n), q 2 (n),... are independent of ρ. 3 / 21

14 Motivation Closed-form sampling formula for large recombination rates? A two-locus model The loci are labeled A and B. ρ, the population-scaled recombination rate n, sample configuration (defined later) q(n), the sampling distribution of n θ A, θ B, the population-scaled mutation rates P A = (Pij A), P B = (Pij A ), transition matrices for the mutation process in the case of a finite-alleles model Asymptotic Series As ρ, find q(n ρ) = q 0 (n)+ q 1(n) + q ( ) 2(n) 1 ρ ρ 2 +O ρ 3, where q 0 (n), q 1 (n), q 2 (n),... are independent of ρ. 3 / 21

15 3 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

16 Closed-form one-locus sampling formulas n = (n 1,..., n K ), where n i denotes the number of gametes with allele i at the locus. q(n), probability of an ordered sample with configuration n. (x) n := x(x + 1) (x + n 1) Infinite-alleles model Ewens sampling formula (1972): q ESF (n) = θk (θ) n K (n i 1)! i=1 Finite-alleles Parent-Independent Mutation (PIM) model Mutation transition matrix satisfies P ij = P j. Wright s sampling formula (1949): q WSF (n) = 1 (θ) n K (θp i ) ni Any diallelic recurrent mutation model can be transformed into a PIM model. i=1 4 / 21

17 Two-locus sample configuration Two-locus sample configuration n = (a, b, c) a = (a i ), K -dim vector b = (b j ), L-dim vector c = (c ij ), K -by-l matrix A sample of 10 gametes c = (c ij ), a K -by-l matrix j {1,..., L} i {1,..., K } Sample size is c = c = / 21

18 6 / 21 Two-locus sample configuration Time Coalescence Mutation Recombination a = (a i ) i {1,...,K }, multiplicity of left half-fragments. b = (b j ) j {1,...,L}, multiplicity of right half-fragments. The complete sampling distribution is then q(a, b, c).

19 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3

20 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). c

21 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). c A = (c i ) c B = (c j )

22 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). c A = (c i ) c B = (c j ) add to a to get add to b to get q A (a + c A ) q B (b + c B ).

23 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). Infinite alleles: q 0 (a, b, c) = q A ESF(a + c A )q B ESF(b + c B )

24 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). Infinite alleles: q 0 (a, b, c) = q A ESF(a + c A )q B ESF(b + c B ) Finite-alleles PIM: q 0 (a, b, c) = q A WSF(a + c A )q B WSF(b + c B )

25 7 / 21 Two-locus sample configuration q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 q 0 (a, b, c) q 0 (a, b, c) is the exact sampling distribution when the two loci are unlinked (ρ = ). Infinite alleles: q 0 (a, b, c) = q A ESF(a + c A )q B ESF(b + c B ) Finite-alleles PIM: q 0 (a, b, c) = q A WSF(a + c A )q B WSF(b + c B ) In fact, for any model of mutation, q 0 (a, b, c) = q A (a + c A )q B (b + c B ) where q A, q B are (possibly unknown) one-locus sampling distributions.

26 8 / 21 Universality q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 Universality For all mutation models, the functional form of q 0 (a, b, c) is q 0 (a, b, c) = F (q A, q B ; a, b, c) := q A (a + c A )q B (b + c B )

27 8 / 21 Universality q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 Universality For all mutation models, the functional form of q 0 (a, b, c) is q 0 (a, b, c) = F (q A, q B ; a, b, c) := q A (a + c A )q B (b + c B ) Theorem 1 q 1 (n) also exhibits universality: q 1 (a, b, c) = G(q A, q B ; a, b, c)

28 Universality q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Universality + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For all mutation models, the functional form of q 0 (a, b, c) is q 0 (a, b, c) = F (q A, q B ; a, b, c) := q A (a + c A )q B (b + c B ) Theorem 1 q 1 (n) also exhibits universality: q 1 (a, b, c) = G(q A, q B ; a, b, c) q 1 (a, b, c) = ( c 2) q B (b + c B ) K q A (a + c A ) L + K L ( cij i=1 j=1 2 q A (a + c A )q B (b + c B ) ( ci i=1 ( 2 c j j=1 2 ) ) q A (a + c A e i ) ) q B (b + c B e j ) q A (a + c A e i )q B (b + c B e j ), where e i is a unit vector with a 1 at entry i and 0 otherwise. 8 / 21

29 9 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 Derivation of q 1 (a, b, c) 1 The full sampling distribution q(a, b, c) satisfies a recursion relation.

30 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Derivation of q 1 (a, b, c) + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 1 The full sampling distribution q(a, b, c) satisfies a recursion relation. Infinite-alleles case: [n(n 1) + θ A (a + c) + θ B (b + c) + ρc]q(a, b, c) = P K i=1 a i (a i 1 + 2c i )q(a e i, b, c) + P L j=1 b j (b j 1 + 2c j )q(a, b e j, c) + P K P L i=1 j=1 ˆcij (c ij 1)q(a, b, c e ij ) + 2a i b j q(a e i, b e j, c + e ij ) P h +θ K PL i A i=1 j=1 δ a i +c i,1 δ cij,1 q(a, b + e j, c e ij ) + δ ai,1δ ci,0 q(a e i, b, c) P h +θ L PK i B j=1 i=1 δ b j +c j,1 δ cij,1 q(a + e i, b, c e ij ) + δ bj,1δ c j,0 q(a, b e j, c) +ρ P K i=1 P L j=1 c ij q(a + e i, b + e j, c e ij ), with boundary conditions are q(e i, 0, 0) = q(0, e j, 0) = 1 for all i {1,..., K } and j {1,..., L}. 9 / 21

31 9 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Derivation of q 1 (a, b, c) + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 1 The full sampling distribution q(a, b, c) satisfies a recursion relation. 2 Using that recursion, one can show that q 1 (a, b, c) satisfies K L c ij q 1 (a, b, c) = f 1 (a, b, c)+ c q 1(a +e i, b +e j, c e ij ), i=1 j=1 where f 1 (a, b, c) is a function of the zeroth-order term q 0.

32 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Derivation of q 1 (a, b, c) + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 1 The full sampling distribution q(a, b, c) satisfies a recursion relation. 2 Using that recursion, one can show that q 1 (a, b, c) satisfies K L c ij q 1 (a, b, c) = f 1 (a, b, c)+ c q 1(a +e i, b +e j, c e ij ), i=1 j=1 where f 1 (a, b, c) is a function of the zeroth-order term q 0. 3 This equation admits a probabilistic interpretation: c q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + E[f 1 (A (m), B (m), C (m) )], m=1 where C (m) = (C (m) ij ) is a multivariate hypergeometric(c, c, m) random variable, and A (m), B (m) depend on C (m). 9 / 21

33 10 / 21 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + c E[f 1 (A (m), B (m), C (m) )] m=1 Random matrix C (m) = (C (m) ij ) corresponds to a random subsample obtained by sampling without replacement m gametes from c A (m) = a + c A C (m) A C (m) A = (C (m) i ) and C (m) B P (i,j) [ C (m) ij and B (m) = b + c B C (m) B = c (m) ij = (C (m) j ) ] = ( 1 c ) m (i,j) ( cij c (m) ij )., where

34 10 / 21 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + c E[f 1 (A (m), B (m), C (m) )] m=1 Random matrix C (m) = (C (m) ij ) corresponds to a random subsample obtained by sampling without replacement m gametes from c A (m) = a + c A C (m) A C (m) A = (C (m) i ) and C (m) B and B (m) = b + c B C (m) B = (C (m) j ), where f 1 is a deg-2 polynomial in the entries C ij of C (m), so the above expectation can be easily computed.

35 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + c E[f 1 (A (m), B (m), C (m) )] m=1 Random matrix C (m) = (C (m) ij ) corresponds to a random subsample obtained by sampling without replacement m gametes from c A (m) = a + c A C (m) A C (m) A = (C (m) i ) and C (m) B and B (m) = b + c B C (m) B = (C (m) j ), where f 1 is a deg-2 polynomial in the entries C ij of C (m), so the above expectation can be easily computed. Lemma For all a and b, q 1 (a, b, 0) = 0. That q 1 (a, b, 0) vanishes is not expected a priori. 10 / 21

36 Details q 1 (a, b, c) = q 1 (a + c A, b + c B, 0) + Theorem 1 c E[f 1 (A (m), B (m), C (m) )], m=1 For either an infinite-alleles or an arbitrary finite-alleles model, ( ) c q 1 (a, b, c) = q A (a + c A )q B (b + c B ) 2 K ( ) q B (b + c B ) ci q A (a + c A e i ) q A (a + c A ) + K L i=1 j=1 ( cij 2 i=1 L j=1 2 ( c j 2 ) q B (b + c B e j ) ) q A (a + c A e i )q B (b + c B e j ), where e i is a unit vector with a 1 at entry i and 0 otherwise. 11 / 21

37 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m).

38 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )].

39 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )]. Unfortunately, q 2 (a, b, 0) 0 in general and we don t have a closed-form expression for it.

40 12 / 21 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )]. Unfortunately, q 2 (a, b, 0) 0 in general and we don t have a closed-form expression for it. However, it satisfies a simple recursion relation that can be easily solved numerically using dynamic programming.

41 Details q(a, b, c ρ) = q 0 (a, b, c) + q 1(a, b, c) ρ Theorem 2 + q 2(a, b, c) ρ 2 ( ) 1 + O ρ 3 For either an infinite-alleles or an arbitrary finite-alleles model, c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )], m=1 where f 2 is a deg-4 polynomial in the entries C ij of C (m). Remarks Found a closed-form formula for c m=1 E[f 2(A (m), B (m), C (m) )]. Unfortunately, q 2 (a, b, 0) 0 in general and we don t have a closed-form expression for it. However, it satisfies a simple recursion relation that can be easily solved numerically using dynamic programming. The contribution of q 2 (a, b, 0) 0 is negligibly small compared to that of the other terms. 12 / 21

42 13 / 21 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite.

43 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite. Is the converse true? If q 1 (a, b, c) < 0, then is MLE infinite? 13 / 21

44 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite. Counter-example to the converse log likelihood ! Finite alleles PIM model, θ = a = b = 0, c = ( ). q 1 (a, b, c) < 0. q q q(n) 13 / 21

45 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, b, c) > 0, then the MLE of ρ is finite. Counter-example to the converse log likelihood ! Finite alleles PIM model, θ = a = b = 0, c = ( ). q 1 (a, b, c) < 0. q q q(n) 13 / 21

46 13 / 21 Classifying the maximum likelihood estimate (MLE) Application: Classification of the MLE of ρ Theorem 3 (A sufficient condition for finite MLE) For a given sample configuration (a, b, c), if q 1 (a, Example b, c) > 0, then the MLE of ρ is finite Counter-example to the converse log likelihood ! log likelihood Finite alleles PIM model, θ = a = b = 0, c = ( ) ! q 1 (a, b, c) < 0. q q q(n)

47 13 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

48 14 / 21 Detailed examples Truncate at order 2 q ASF (a, b, c) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 should approximate q(a, b, c) accurately for sufficiently large ρ.

49 14 / 21 Detailed examples Truncate at order 2 q ASF (a, b, c) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 should approximate q(a, b, c) accurately for sufficiently large ρ. Example 1 Likelihood (10 15 ) ρ a = b = 0, c = ( Finite alleles PIM model, θ = 0.01 per locus. q q q q q(n) q 0 (n) q 0 (n) + q 1(n) ρ q0 (n) + q 1(n) ρ + q 2(n) ρ 2 )

50 14 / 21 Detailed examples Truncate at order 2 q ASF (a, b, c) = q 0 (a, b, c) + q 1(a, b, c) ρ + q 2(a, b, c) ρ 2 should approximate q(a, b, c) accurately for sufficiently large ρ. Example 2 Likeihood (10 16 ) ! a = b = 0, c = ( Finite alleles PIM model, θ = 0.01 per locus. q q q q q(n) q 0 (n) q 0 (n) + q 1(n) ρ q0 (n) + q 1(n) ρ + q 2(n) ρ 2 )

51 Distribution of errors A measure for accuracy: unsigned relative error q ASF (n) q(n) q(n) 100% q q q q 0 (n) q 0 (n) + q 1(n) ρ q0 (n) + q 1(n) ρ + q 2(n) ρ 2 Accuracy across all dimorphic (0, 0, c) samples of size 20, when ρ = 50 in the symmetric PIM model Density Density Unsigned relative error (%) θ = Unsigned relative error (%) θ = 1 15 / 21

52 Distribution of errors Accuracy: effect of recombination rate Accuracy of q 0 (0, 0, c) + q 1(0,0,c) ρ Accuracy across all dimorphic samples. + q 2(0,0,c) ρ 2 : Fix c = 20, vary ρ Density Unsigned relative error (%) θ = 0.01 Density ρ = ρ = 50 ρ = 100 ρ = Unsigned relative error (%) θ = 1 16 / 21

53 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) θ = 0.01

54 17 / 21 Distribution of errors For which samples is the ASF most accurate? Monomorphic at both loci. log 10 p(c) Unsigned relative error (%) θ = 0.01 None of the above.

55 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Monomorphic at both loci. Monomorphic at one locus Unsigned relative error (%) θ = 0.01 None of the above.

56 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one Unsigned relative error (%) θ = 0.01 None of the above.

57 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one. + Very low LD: r θ = 0.01 None of the above.

58 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one. + Very low LD: r Perfect LD. θ = 0.01 None of the above.

59 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) Unsigned relative error (%) θ = 0.01 Monomorphic at both loci. Monomorphic at one locus. At least one allele has multiplicity one. + Very low LD: r Perfect LD. Perfect LD except for one haplotype. None of the above.

60 17 / 21 Distribution of errors For which samples is the ASF most accurate? log 10 p(c) 6 log 10 p(c) Unsigned relative error (%) Unsigned relative error (%) θ = 0.01 θ = 1

61 17 / 21 Outline 1 Introduction Motivation 2 Asymptotic Sampling Formula Closed-form one-locus sampling formulas Two-locus sample configuration Universality Details Classifying the maximum likelihood estimate (MLE) 3 Accuracy Results Detailed examples Distribution of errors 4 Conclusion Summary Further work

62 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n).

63 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models.

64 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models. 3 Found a closed-form formula + (extra bit) for q 2 (n). c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )]. m=1

65 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models. 3 Found a closed-form formula + (extra bit) for q 2 (n). c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )]. m=1 4 The extra bit is negligibly small and can be easily evaluated using dynamic programming if one wishes.

66 18 / 21 Summary Asymptotic series: q(n ρ) = q 0 (n)+ q 1(n) + q 2(n) ρ ρ 2 +O ( ) 1 ρ 3 Summary of results 1 Found a closed-form formula for the first-order term q 1 (n). 2 This formula is universal to all mutation models. 3 Found a closed-form formula + (extra bit) for q 2 (n). c q 2 (a, b, c) = q 2 (a + c A, b + c B, 0) + E[f 2 (A (m), B (m), C (m) )]. m=1 4 The extra bit is negligibly small and can be easily evaluated using dynamic programming if one wishes. 5 Found a simple sufficient condition for a given two-locus sample configuration to have a finite MLE of ρ.

67 19 / 21 Summary Applications Papers Composite-likelihood method such as LDhat. (Monte Carlo methods become substantially less efficient as ρ increases, so our analytic work will be of practical utility.) Test for epistasis. Sample-wise LD: P[r 2 ] and E[r 2 ] as ρ. Infinite-alleles case: Jenkins, P.A. and Song, Y.S. An asymptotic sampling formula for the coalescent with recombination. Annals of Applied Probability, under minor revision. Arbitrary finite-alleles recurrent mutation case: Jenkins, P.A. and Song, Y.S. Closed-form two-locus sampling distributions: accuracy and universality. Genetics, in press.

68 Further work Future work 1 Does there exist a stochastic process dual to the coalescent with recombination? 2 Is there a combinatorial interpretation for q 1 (a, b, c)? cf. Interpretation of Ewens sampling formula as the joint distribution of cycle counts in a random permutation 3 Does the universality property extend to q k (c) for k > 1? 4 For all k > 0, we think the following is true: c q k (a, b, c) = q k (a + c A, b + c B, 0) + E[f k (A (m), B (m), C (m) )] m=1 where f k is a deg-2k polynomial in the entries C ij of C (m). Can we automate the computation of the expectation? q 5 Does k (a, b, c) ρ k converge? k=0 6 Extend to multiple loci? 20 / 21

69 Further work Acknowledgments Thank you for your attention! Useful discussions Bob Griffiths Chuck Langley Rasmus Nielsen Josh Paul Monty Slatkin Programming Fulton Wang Research supported in part by NIH, Alfred P. Sloan Research Fellowship, and Packard Fellowship for Science and Engineering 21 / 21

AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. By Paul A. Jenkins and Yun S. Song, University of California, Berkeley

AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. By Paul A. Jenkins and Yun S. Song, University of California, Berkeley AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION By Paul A. Jenkins and Yun S. Song, University of California, Berkeley Ewens sampling formula (ESF) is a one-parameter family of probability

More information

CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI

CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI Adv. Appl. Prob. 44, 391 407 (01) Printed in Northern Ireland Applied Probability Trust 01 CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER

More information

By Paul A. Jenkins and Yun S. Song, University of California, Berkeley July 26, 2010

By Paul A. Jenkins and Yun S. Song, University of California, Berkeley July 26, 2010 PADÉ APPROXIMANTS AND EXACT TWO-LOCUS SAMPLING DISTRIBUTIONS By Paul A. Jenkins and Yun S. Song, University of California, Berkeley July 26, 2010 For population genetics models with recombination, obtaining

More information

A Principled Approach to Deriving Approximate Conditional Sampling. Distributions in Population Genetics Models with Recombination

A Principled Approach to Deriving Approximate Conditional Sampling. Distributions in Population Genetics Models with Recombination Genetics: Published Articles Ahead of Print, published on June 30, 2010 as 10.1534/genetics.110.117986 A Principled Approach to Deriving Approximate Conditional Sampling Distributions in Population Genetics

More information

Frequency Spectra and Inference in Population Genetics

Frequency Spectra and Inference in Population Genetics Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient

More information

Analytic Computation of the Expectation of the Linkage Disequilibrium Coefficient r 2

Analytic Computation of the Expectation of the Linkage Disequilibrium Coefficient r 2 Analytic Computation of the Expectation of the Linkage Disequilibrium Coefficient r 2 Yun S. Song a and Jun S. Song b a Department of Computer Science, University of California, Davis, CA 95616, USA b

More information

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES:

1.5.1 ESTIMATION OF HAPLOTYPE FREQUENCIES: .5. ESTIMATION OF HAPLOTYPE FREQUENCIES: Chapter - 8 For SNPs, alleles A j,b j at locus j there are 4 haplotypes: A A, A B, B A and B B frequencies q,q,q 3,q 4. Assume HWE at haplotype level. Only the

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination)

Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) 12/5/14 Chapter 6 Linkage Disequilibrium & Gene Mapping (Recombination) Linkage Disequilibrium Genealogical Interpretation of LD Association Mapping 1 Linkage and Recombination v linkage equilibrium ²

More information

Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics

Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics 1 Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics Yun S. Song, Rune Lyngsø, and Jotun Hein Abstract Given a set D of input sequences, a genealogy for D can be

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

2. Map genetic distance between markers

2. Map genetic distance between markers Chapter 5. Linkage Analysis Linkage is an important tool for the mapping of genetic loci and a method for mapping disease loci. With the availability of numerous DNA markers throughout the human genome,

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Match probabilities in a finite, subdivided population

Match probabilities in a finite, subdivided population Match probabilities in a finite, subdivided population Anna-Sapfo Malaspinas a, Montgomery Slatkin a, Yun S. Song b, a Department of Integrative Biology, University of California, Berkeley, CA 94720, USA

More information

p(d g A,g B )p(g B ), g B

p(d g A,g B )p(g B ), g B Supplementary Note Marginal effects for two-locus models Here we derive the marginal effect size of the three models given in Figure 1 of the main text. For each model we assume the two loci (A and B)

More information

I of a gene sampled from a randomly mating popdation,

I of a gene sampled from a randomly mating popdation, Copyright 0 1987 by the Genetics Society of America Average Number of Nucleotide Differences in a From a Single Subpopulation: A Test for Population Subdivision Curtis Strobeck Department of Zoology, University

More information

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M.

For 5% confidence χ 2 with 1 degree of freedom should exceed 3.841, so there is clear evidence for disequilibrium between S and M. STAT 550 Howework 6 Anton Amirov 1. This question relates to the same study you saw in Homework-4, by Dr. Arno Motulsky and coworkers, and published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).

More information

Mathematical models in population genetics II

Mathematical models in population genetics II Mathematical models in population genetics II Anand Bhaskar Evolutionary Biology and Theory of Computing Bootcamp January 1, 014 Quick recap Large discrete-time randomly mating Wright-Fisher population

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch. Sampling Methods Oliver Schulte - CMP 419/726 Bishop PRML Ch. 11 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure

More information

Statistical population genetics

Statistical population genetics Statistical population genetics Lecture 7: Infinite alleles model Xavier Didelot Dept of Statistics, Univ of Oxford didelot@stats.ox.ac.uk Slide 111 of 161 Infinite alleles model We now discuss the effect

More information

URN MODELS: the Ewens Sampling Lemma

URN MODELS: the Ewens Sampling Lemma Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 3, 2014 1 2 3 4 Mutation Mutation: typical values for parameters Equilibrium Probability of fixation 5 6 Ewens Sampling

More information

Bayesian Methods with Monte Carlo Markov Chains II

Bayesian Methods with Monte Carlo Markov Chains II Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University hslu@stat.nctu.edu.tw http://tigpbp.iis.sinica.edu.tw/courses.htm 1 Part 3

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Phasing via the Expectation Maximization (EM) Algorithm

Phasing via the Expectation Maximization (EM) Algorithm Computing Haplotype Frequencies and Haplotype Phasing via the Expectation Maximization (EM) Algorithm Department of Computer Science Brown University, Providence sorin@cs.brown.edu September 14, 2010 Outline

More information

The Combinatorial Interpretation of Formulas in Coalescent Theory

The Combinatorial Interpretation of Formulas in Coalescent Theory The Combinatorial Interpretation of Formulas in Coalescent Theory John L. Spouge National Center for Biotechnology Information NLM, NIH, DHHS spouge@ncbi.nlm.nih.gov Bldg. A, Rm. N 0 NCBI, NLM, NIH Bethesda

More information

Bayesian Inference of Interactions and Associations

Bayesian Inference of Interactions and Associations Bayesian Inference of Interactions and Associations Jun Liu Department of Statistics Harvard University http://www.fas.harvard.edu/~junliu Based on collaborations with Yu Zhang, Jing Zhang, Yuan Yuan,

More information

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda

Population Genetics. with implications for Linkage Disequilibrium. Chiara Sabatti, Human Genetics 6357a Gonda 1 Population Genetics with implications for Linkage Disequilibrium Chiara Sabatti, Human Genetics 6357a Gonda csabatti@mednet.ucla.edu 2 Hardy-Weinberg Hypotheses: infinite populations; no inbreeding;

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Lecture 9. QTL Mapping 2: Outbred Populations

Lecture 9. QTL Mapping 2: Outbred Populations Lecture 9 QTL Mapping 2: Outbred Populations Bruce Walsh. Aug 2004. Royal Veterinary and Agricultural University, Denmark The major difference between QTL analysis using inbred-line crosses vs. outbred

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

CS294: Pseudorandomness and Combinatorial Constructions September 13, Notes for Lecture 5

CS294: Pseudorandomness and Combinatorial Constructions September 13, Notes for Lecture 5 UC Berkeley Handout N5 CS94: Pseudorandomness and Combinatorial Constructions September 3, 005 Professor Luca Trevisan Scribe: Gatis Midrijanis Notes for Lecture 5 In the few lectures we are going to look

More information

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity,

AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, AEC 550 Conservation Genetics Lecture #2 Probability, Random mating, HW Expectations, & Genetic Diversity, Today: Review Probability in Populatin Genetics Review basic statistics Population Definition

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Bayes Nets: Sampling Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

Genetic hitch-hiking in a subdivided population

Genetic hitch-hiking in a subdivided population Genet. Res., Camb. (1998), 71, pp. 155 160. With 3 figures. Printed in the United Kingdom 1998 Cambridge University Press 155 Genetic hitch-hiking in a subdivided population MONTGOMERY SLATKIN* AND THOMAS

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Frequency Estimation Karin S. Dorman Department of Statistics Iowa State University August 28, 2006 Fundamental rules of genetics Law of Segregation a diploid parent is equally

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012

Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium. November 12, 2012 Lecture 22: Signatures of Selection and Introduction to Linkage Disequilibrium November 12, 2012 Last Time Sequence data and quantification of variation Infinite sites model Nucleotide diversity (π) Sequence-based

More information

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data

Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data Modelling Linkage Disequilibrium, And Identifying Recombination Hotspots Using SNP Data Na Li and Matthew Stephens July 25, 2003 Department of Biostatistics, University of Washington, Seattle, WA 98195

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Modelling of Dependent Credit Rating Transitions

Modelling of Dependent Credit Rating Transitions ling of (Joint work with Uwe Schmock) Financial and Actuarial Mathematics Vienna University of Technology Wien, 15.07.2010 Introduction Motivation: Volcano on Iceland erupted and caused that most of the

More information

ON THE DIFFUSION OPERATOR IN POPULATION GENETICS

ON THE DIFFUSION OPERATOR IN POPULATION GENETICS J. Appl. Math. & Informatics Vol. 30(2012), No. 3-4, pp. 677-683 Website: http://www.kcam.biz ON THE DIFFUSION OPERATOR IN POPULATION GENETICS WON CHOI Abstract. W.Choi([1]) obtains a complete description

More information

The Moran Process as a Markov Chain on Leaf-labeled Trees

The Moran Process as a Markov Chain on Leaf-labeled Trees The Moran Process as a Markov Chain on Leaf-labeled Trees David J. Aldous University of California Department of Statistics 367 Evans Hall # 3860 Berkeley CA 94720-3860 aldous@stat.berkeley.edu http://www.stat.berkeley.edu/users/aldous

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Wei Zhang, Jing Liu, Rachel Fewster and Jesse Goodman Department of Statistics, The University of Auckland December 1, 2015 Overview

More information

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford

The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion. Bob Griffiths University of Oxford The Λ-Fleming-Viot process and a connection with Wright-Fisher diffusion Bob Griffiths University of Oxford A d-dimensional Λ-Fleming-Viot process {X(t)} t 0 representing frequencies of d types of individuals

More information

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates JOSEPH FELSENSTEIN Department of Genetics SK-50, University

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Combining the cycle index and the Tutte polynomial?

Combining the cycle index and the Tutte polynomial? Combining the cycle index and the Tutte polynomial? Peter J. Cameron University of St Andrews Combinatorics Seminar University of Vienna 23 March 2017 Selections Students often meet the following table

More information

Problems for 3505 (2011)

Problems for 3505 (2011) Problems for 505 (2011) 1. In the simplex of genotype distributions x + y + z = 1, for two alleles, the Hardy- Weinberg distributions x = p 2, y = 2pq, z = q 2 (p + q = 1) are characterized by y 2 = 4xz.

More information

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.! Integer Programming in Computational Biology D. Gusfield University of California, Davis Presented December 12, 2016. There are many important phylogeny problems that depart from simple tree models: Missing

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG)

CS1820 Notes. hgupta1, kjline, smechery. April 3-April 5. output: plausible Ancestral Recombination Graph (ARG) CS1820 Notes hgupta1, kjline, smechery April 3-April 5 April 3 Notes 1 Minichiello-Durbin Algorithm input: set of sequences output: plausible Ancestral Recombination Graph (ARG) note: the optimal ARG is

More information

The Quantitative TDT

The Quantitative TDT The Quantitative TDT (Quantitative Transmission Disequilibrium Test) Warren J. Ewens NUS, Singapore 10 June, 2009 The initial aim of the (QUALITATIVE) TDT was to test for linkage between a marker locus

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics 70 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 19, 2011 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.

26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

122 9 NEUTRALITY TESTS

122 9 NEUTRALITY TESTS 122 9 NEUTRALITY TESTS 9 Neutrality Tests Up to now, we calculated different things from various models and compared our findings with data. But to be able to state, with some quantifiable certainty, that

More information

Bayesian Estimation of Input Output Tables for Russia

Bayesian Estimation of Input Output Tables for Russia Bayesian Estimation of Input Output Tables for Russia Oleg Lugovoy (EDF, RANE) Andrey Polbin (RANE) Vladimir Potashnikov (RANE) WIOD Conference April 24, 2012 Groningen Outline Motivation Objectives Bayesian

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

Population genetics snippets for genepop

Population genetics snippets for genepop Population genetics snippets for genepop Peter Beerli August 0, 205 Contents 0.Basics 0.2Exact test 2 0.Fixation indices 4 0.4Isolation by Distance 5 0.5Further Reading 8 0.6References 8 0.7Disclaimer

More information

BIRS workshop Sept 6 11, 2009

BIRS workshop Sept 6 11, 2009 Diploid biparental Moran model with large offspring numbers and recombination Bjarki Eldon New mathematical challenges from molecular biology and genetics BIRS workshop Sept 6, 2009 Mendel s Laws The First

More information

Linkage and Linkage Disequilibrium

Linkage and Linkage Disequilibrium Linkage and Linkage Disequilibrium Summer Institute in Statistical Genetics 2014 Module 10 Topic 3 Linkage in a simple genetic cross Linkage In the early 1900 s Bateson and Punnet conducted genetic studies

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics Grundlagen der Bioinformatik, SoSe 14, D. Huson, May 18, 2014 67 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

ABCME: Summary statistics selection for ABC inference in R

ABCME: Summary statistics selection for ABC inference in R ABCME: Summary statistics selection for ABC inference in R Matt Nunes and David Balding Lancaster University UCL Genetics Institute Outline Motivation: why the ABCME package? Description of the package

More information

Lecture 18 : Ewens sampling formula

Lecture 18 : Ewens sampling formula Lecture 8 : Ewens sampling formula MATH85K - Spring 00 Lecturer: Sebastien Roch References: [Dur08, Chapter.3]. Previous class In the previous lecture, we introduced Kingman s coalescent as a limit of

More information

Mathematical Biology. Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. Matthias Birkner Jochen Blath

Mathematical Biology. Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model. Matthias Birkner Jochen Blath J. Math. Biol. DOI 0.007/s00285-008-070-6 Mathematical Biology Computing likelihoods for coalescents with multiple collisions in the infinitely many sites model Matthias Birkner Jochen Blath Received:

More information

Bayes Nets: Sampling

Bayes Nets: Sampling Bayes Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Approximate Inference:

More information

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important?

EXERCISES FOR CHAPTER 3. Exercise 3.2. Why is the random mating theorem so important? Statistical Genetics Agronomy 65 W. E. Nyquist March 004 EXERCISES FOR CHAPTER 3 Exercise 3.. a. Define random mating. b. Discuss what random mating as defined in (a) above means in a single infinite population

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic CS 188: Artificial Intelligence Fall 2008 Lecture 16: Bayes Nets III 10/23/2008 Announcements Midterms graded, up on glookup, back Tuesday W4 also graded, back in sections / box Past homeworks in return

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning More Approximate Inference Mark Schmidt University of British Columbia Winter 2018 Last Time: Approximate Inference We ve been discussing graphical models for density estimation,

More information

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation

CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation CSci 8980: Advanced Topics in Graphical Models Analysis of Genetic Variation Instructor: Arindam Banerjee November 26, 2007 Genetic Polymorphism Single nucleotide polymorphism (SNP) Genetic Polymorphism

More information

Lecture 8 - Algebraic Methods for Matching 1

Lecture 8 - Algebraic Methods for Matching 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) February 1, 2018 Lecture 8 - Algebraic Methods for Matching 1 In the last lecture we showed that

More information

Compute the Fourier transform on the first register to get x {0,1} n x 0.

Compute the Fourier transform on the first register to get x {0,1} n x 0. CS 94 Recursive Fourier Sampling, Simon s Algorithm /5/009 Spring 009 Lecture 3 1 Review Recall that we can write any classical circuit x f(x) as a reversible circuit R f. We can view R f as a unitary

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture 20: Epistasis and Alternative Tests in GWAS Jason Mezey jgm45@cornell.edu April 16, 2016 (Th) 8:40-9:55 None Announcements Summary

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB

Quantitative Genomics and Genetics BTRY 4830/6830; PBSB Quantitative Genomics and Genetics BTRY 4830/6830; PBSB.5201.01 Lecture16: Population structure and logistic regression I Jason Mezey jgm45@cornell.edu April 11, 2017 (T) 8:40-9:55 Announcements I April

More information

Quality Control Using Inferential Statistics In Weibull Based Reliability Analyses S. F. Duffy 1 and A. Parikh 2

Quality Control Using Inferential Statistics In Weibull Based Reliability Analyses S. F. Duffy 1 and A. Parikh 2 Quality Control Using Inferential Statistics In Weibull Based Reliability Analyses S. F. Duffy 1 and A. Parikh 2 1 Cleveland State University 2 N & R Engineering www.inl.gov ASTM Symposium on Graphite

More information

BTRY 4830/6830: Quantitative Genomics and Genetics

BTRY 4830/6830: Quantitative Genomics and Genetics BTRY 4830/6830: Quantitative Genomics and Genetics Lecture 23: Alternative tests in GWAS / (Brief) Introduction to Bayesian Inference Jason Mezey jgm45@cornell.edu Nov. 13, 2014 (Th) 8:40-9:55 Announcements

More information

1.5 MM and EM Algorithms

1.5 MM and EM Algorithms 72 CHAPTER 1. SEQUENCE MODELS 1.5 MM and EM Algorithms The MM algorithm [1] is an iterative algorithm that can be used to minimize or maximize a function. We focus on using it to maximize the log likelihood

More information

An introduction to mathematical modeling of the genealogical process of genes

An introduction to mathematical modeling of the genealogical process of genes An introduction to mathematical modeling of the genealogical process of genes Rikard Hellman Kandidatuppsats i matematisk statistik Bachelor Thesis in Mathematical Statistics Kandidatuppsats 2009:3 Matematisk

More information

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Coalescent based demographic inference. Daniel Wegmann University of Fribourg Coalescent based demographic inference Daniel Wegmann University of Fribourg Introduction The current genetic diversity is the outcome of past evolutionary processes. Hence, we can use genetic diversity

More information

A Frequentist Assessment of Bayesian Inclusion Probabilities

A Frequentist Assessment of Bayesian Inclusion Probabilities A Frequentist Assessment of Bayesian Inclusion Probabilities Department of Statistical Sciences and Operations Research October 13, 2008 Outline 1 Quantitative Traits Genetic Map The model 2 Bayesian Model

More information

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction Sankhyā : The Indian Journal of Statistics 2007, Volume 69, Part 4, pp. 700-716 c 2007, Indian Statistical Institute More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order

More information

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination

A sequentially Markov conditional sampling distribution for structured populations with migration and recombination A sequentially Markov conditional sampling distribution for structured populations with migration and recombination Matthias Steinrücken a, Joshua S. Paul b, Yun S. Song a,b, a Department of Statistics,

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

I Have the Power in QTL linkage: single and multilocus analysis

I Have the Power in QTL linkage: single and multilocus analysis I Have the Power in QTL linkage: single and multilocus analysis Benjamin Neale 1, Sir Shaun Purcell 2 & Pak Sham 13 1 SGDP, IoP, London, UK 2 Harvard School of Public Health, Cambridge, MA, USA 3 Department

More information

Introduction to Linkage Disequilibrium

Introduction to Linkage Disequilibrium Introduction to September 10, 2014 Suppose we have two genes on a single chromosome gene A and gene B such that each gene has only two alleles Aalleles : A 1 and A 2 Balleles : B 1 and B 2 Suppose we have

More information

Assessing Regime Uncertainty Through Reversible Jump McMC

Assessing Regime Uncertainty Through Reversible Jump McMC Assessing Regime Uncertainty Through Reversible Jump McMC August 14, 2008 1 Introduction Background Research Question 2 The RJMcMC Method McMC RJMcMC Algorithm Dependent Proposals Independent Proposals

More information

Joyce, Krone, and Kurtz

Joyce, Krone, and Kurtz Statistical Inference for Population Genetics Models by Paul Joyce University of Idaho A large body of mathematical population genetics was developed by the three main speakers in this symposium. As a

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 9 for Applied Multivariate Analysis Outline Two sample T 2 test 1 Two sample T 2 test 2 Analogous to the univariate context, we

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j. Chapter 9 Pearson s chi-square test 9. Null hypothesis asymptotics Let X, X 2, be independent from a multinomial(, p) distribution, where p is a k-vector with nonnegative entries that sum to one. That

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

ESTIMATION of recombination fractions using ped- ber of pairwise differences (Hudson 1987; Wakeley

ESTIMATION of recombination fractions using ped- ber of pairwise differences (Hudson 1987; Wakeley Copyright 2001 by the Genetics Society of America Estimating Recombination Rates From Population Genetic Data Paul Fearnhead and Peter Donnelly Department of Statistics, University of Oxford, Oxford, OX1

More information