Intro Likelihood & coa MCMC IS Sim tests Conclusions

Size: px
Start display at page:

Download "Intro Likelihood & coa MCMC IS Sim tests Conclusions"

Transcription

1 A faire RL : homogénéiser les notations avec les macros de FR, remplacer IS par SIS partout, insister sur le sequential = le long de la construction de l arbre, mieux séparer l obtention de la récurrence, les pi exactes et les pi chapeaux, mettre en bleu les mots clés ajouter l terme p(htau)=distribution stationnaire des états alléliques partout ou c est nécessaire FR : 1) Faire backward dans un cours précédent... (et d autres trucs sur ma présentation de backward) 2) l opérateur différentiel φ j n est pas explicité = phrases plus compliquées

2 Module de Master 2 Biostatistique: modèles de génétique des populations Likelihood-based demographic inference using the coalescent Raphaël Leblois & François Rousset Centre de Biologie pour la Gestion des populations (CBGP, Montpellier) Institut des Sciences de l Evolution, (ISEM, Montpellier) Janvier 2017

3 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC Griffiths et al. s IS Griffiths et al s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π Simulation tests Precision Validation Robustness MCMC vs. IS Conclusions

4 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Griffiths et al. s IS Simulation tests Conclusions

5 Intro Likelihood & coa MCMC IS Sim tests Conclusions Typical biological question : There are demographic evidences that orang-utan population sizes have collapsed but what is the major cause of the decline, when did it start and how strong is it? Can population genetics help? - Can we infer the time of the event? - Can we infer the strength of the population size decrease?

6 Methods based on coalescence simulations (Reminder...) Genealogy of the population Genealogy of the sample Coalescent tree forward in time backward in time ; ; P(T k = t) k(k 1) k(k 1) 2N e t 2N P(m t) = (µt)m e µt m!

7 Two different ways to use the coalescent theory Exploratory approaches & simulation tests - The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (sample vs. population) Specify the model and parameter values Coalescent process Simulated data sets Inferential approach - The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods) a real data set Coalescent process infer the model parameters

8 Two different ways to use the coalescent theory Exploratory approaches & simulation tests - The coalescent allows efficient simulations of the genetic variability under various demo-genetic models (sample vs. population) Specify the model and parameter values Coalescent process Simulated data sets Inferential approach - The coalescent allows the inference of populationnal evolutionary parameters (genetic, demographic, reproductive,...), some of those methods uses all the information contained in the genetic data (likelihood-based methods) a real data set Coalescent process infer the model parameters

9 Likelihood-based inference under the coalescent Inferential approaches are based on the modeling of population genetic processes. Each population genetic model is characterized by a set of demographic and genetic parameters P The aim is to infer those parameters from a polymorphism data set (i.e. a genetic sample) The genetic sample is then considered as the realization ( output ) of a stochastic process defined by the demo-genetic model

10 Likelihood-based inference under the coalescent First, compute or estimate the likelihood L(P ; D), i.e. the probability P(D; P ) of observing the data D for some parameter values P Second, infer the likelihood surface over all parameter values, find the set of parameter values that maximize it, and compute CI (maximum likelihood method), or Compute posterior distributions and compare with priors (Bayesian approach).

11 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Griffiths et al. s IS Simulation tests Conclusions

12 Likelihood computations under the coalescent Problem : Most of the time, the likelihood P(D; P ) of a genetic sample cannot be computed because there is no explicit mathematical expression However, the probability P(D; P G k ) of observing the data D given a specific genealogy G k can be computed for some parameter values P. Then we take the sum of all genealogy-specific likelihoods on the whole genealogical space, weighted by the probability of the genealogy given the parameters : L(P ; D) = G P(D; P G)P(G; P ) dg

13 Likelihood computations under the coalescent The likelihood can be written as the sum of P(D; P G k ) over the genealogical space (all possible genealogies) : L(P Mutation ; D) = P(D; P G)P(G; P ) dg G Demography (Coalescent) Genealogies are missing data, they are important for the computation of the likelihood but there is no interest in estimating them. very different from the phylogenetic approaches

14 Likelihood computations under the coalescent The likelihood can be written as the sum of P(D; P G k ) over the genealogical space (all possible genealogies) : L(P ; D) = G P(D; P G)P(G; P ) dg...usually impossible to sum over all possible genealogies... Monte Carlo simulations are used : a large number K of genealogies are simulated according to P(G; P ) and the mean over those simulations is taken as the expectation of P(D; P G) : L(P ; D) = E P(G;P )(P(D; P G)) 1 K K P(D; P G k ) k=1

15 Likelihood computations under the coalescent The likelihood can be written as the sum of P(D; P G k ) over the genealogical space (all possible genealogies) : L(P ; D) = G P(D; P G)P(G; P ) dg...usually impossible to sum over all possible genealogies... Monte Carlo simulations are used : L(P ; D) = E P(G;P )(P(D; P G)) 1 K K P(D; P G k ) k=1 many many genealogies necessary for a good estimation of the likelihood...

16 Likelihood computations under the coalescent Monte Carlo simulations are used : L(P ; D) = E P(G;P )(P(D; P G)) 1 K K P(D; P G k ) k=1 Monte Carlo simulations are often not very efficient because there are too many genealogies giving extremely low probabilities of observing the data, more efficient algorithms are used to explore the genealogical space and focus on genealogies well supported by the data.

17 Likelihood computations under the coalescent Two main approaches developed using more efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D; P G). MCMC Monte Carlo Markov chains on the genealogical and the parameter space, based on Felsenstein s pruning algorithm (1973,1981) Felsenstein, J. (1981). Evolutionary trees from DNA sequences : A maximum likelihood approach. J. of Mol. Evol. 17 (6) : IS Importance Sampling on genealogies, based on the work of Griffiths & Tavaré Griffiths, R.C. and S. Tavaré (1994). Simulating probability distributions in the coalescent. Theor. Pop. Biol., 46 :

18 Likelihood computations under the coalescent More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D; P G) MCMC Felsenstein s pruning algorithm. - Easier to implement, can easily consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavaré s coalescent recursion - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)

19 Likelihood computations under the coalescent More efficient algorithms that allows better exploration of the genealogies proportionally to their probability of explaining the data P(D; P G) MCMC Felsenstein s pruning algorithm (quick overview) - Easier to implement, can consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavaré s coalescent recursion - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)

20 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Griffiths et al. s IS Simulation tests Conclusions

21 The approach of Felsenstein et al. Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes : 1 The probability of a genealogy given the parameters of the demographic model P(G k ; Pdemo ) can be computed from the distributions of time between events. 2 The probability of the data given a genealogy and mutational parameters P(D; Pmut G k ) can be computed from the mutation model parameters, the mutation rate, tree topology and branch lengths.

22 The approach of Felsenstein et al. Based on (1) on the availability of approximate exponential distributions for time intervals between events (coalescence and migration and recombinaison) and (2) on the separation of demographic and mutational processes : 1 P(G k ; Pdemo ) computed from the distributions of time between events. 2 P(D; Pmut G k ) computed from the mutation parameters, tree topology and branch lengths. From this, an efficient algorithm to explore the genealogical and the parameter spaces should allow the inference of the likelihood over the two spaces. MCMC

23 Felsenstein et al. s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC

24 MCMC with Metropolis-Hastings sampler Full conditional distributions can not be computed, MCMC classical sampler can not thus be used (e.g. Gibbs) Monte Carlo Markov Chains (MCMC) simulations using the Metropolis-Hastings (MH) algorithm - To explore the genealogy space (G) - and the parameter space (P = P demo + P mut ) all algorithms based on the Felsenstein et al. approach uses similar MH/MCMC algorithms with slight differences in the MCMC update steps.

25 Metropolis-Hastings sampling for the coalescent For the Metropolis-Hastings algorithm, we need to compute the ratio of the probability of the proposed update over the current state : 1. Computation of P(G k ; P demo ) : P(G k ; P demo ) = MRCA 1 i=0 γ(t i+1 )e t i+1 t γ(t)dt i - Example for a stable WF population (coalescence only, time homogeneous) MRCA 1 P(G k ; P demo ) = i=0 k i+1 (k i+1 1) e (t i+1 t i ) k i+1 (k i+1 1) 2 2

26 Metropolis-Hastings sampling for the coalescent 1. Computation of P(G k ; P demo ) : P(G k ; P demo ) = MRCA 1 i=0 γ(t i+1 )e t i+1 t γ(t)dt i 2 Then compute the probability P(D; P mut G k ) of the data D given the genealogy G k, by going from the MRCA to the leaves and considering the probability of occurrence of all mutations on each branch of length t b and their effects (i.e.transition among genetic states x y) :

27 Metropolis-Hastings sampling for the coalescent 2 Then compute the probability P(D; P mut G k ) : Mutation matrix : transition probability between genetic states (x, y) P(D; P mut G k ) = = nb branch b=1 effect of mutations P(y x, m b ) Poisson probability for the m b mutations number of mutations P(m b t b ) 2(n 1) ((Mat mut ) m b (µt ) b ) m b e µt b x,y b=1 m b!

28 Metropolis-Hastings sampling for the coalescent 1. Computation of P(G k ; P demo ) : P(G k ; P demo ) = 2 Then compute P(D; P mut G k ) : P(D; P mut G k ) = MRCA 1 i=0 γ(t i+1 )e t i+1 t γ(t)dt i 2(n 1) ((Mat mut ) m b (µt ) b ) m b e µt b x,y b=1 m b! 3 These probabilities are plugged into the MH formula for acceptance probabilities of candidate changes for the next state of the Markov chain. Reminder : P(D; P G k ) = P(D; P mut G k )P(G k ; P demo )

29 Metropolis-Hastings sampling for the coalescent for each update, the new state (P or G ) is accepted or rejected according to the Metropolis-Hastings ratio, the MH ratio is chosen so that the chain converge towards the good stationary distribution P(D; P), e.g. r MH = P(D; P G)Prior(P ) P(P P) P(D; P G)Prior(P) P(P P )

30 Felsenstein et al. s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC

31 Intro Likelihood & coa MCMC IS Sim tests Conclusions Coalescent-based MCMC example : MsVar One example of a coalescent-based MCMC algorithm : MsVar Beaumont, M Detecting Population Expansion and Decline Using Microsatellites. Genetics. Biological contexte : Past changes in population sizes (cf. Orang-Utans) - Details of the demographic and mutation models - few results on the Orang-Utan data set

32 Coalescent-based MCMC example : MsVar Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size.

33 Coalescent-based MCMC example : MsVar Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size. Population contraction or expansion

34 Coalescent-based MCMC example : MsVar Demographic model : a single isolated panmictic (WF) population with a exponential past change in population size. 3 demographic parameters : N, T, N anc + 1 mutation parameter µ 3 scaled parameters (diffusion approx.) : θ, D, θ anc

35 Coalescent-based MCMC example : MsVar Mutation model : Stepwise Mutation Model (SMM)

36 Coalescent-based MCMC example : MsVar P = N, T, N anc, µ P scaled = θ, D, θ anc Aim : infer those parameters (P or P scaled ) from a unique actual genetic sample using coalescent-based MCMC algorithms

37 MH/MCMC of MsVar 1. Initialization step : Build a genealogy that is compatible with the data Starting with the sample, choose a set of events depending on starting values of the parameters ; the events are also chosen to be compatible with the data 2. MCMC steps : Explore the parameter and the genealogical space Update the parameters for population sizes (θ act, D, θ anc ). or Update the genealogy (sequence and times of coalescence and mutation events (T i )) both updates made using the Metropolis-Hastings algorithm

38 MCMC updates in MsVar T i = times of coa & mut, r = θact θ anc pop size ratio, t f = D time of pop size change M. Beaumont : This scheme was devised by trial and error to obtain good rates of convergence.

39 Analyses of MsVar results First check that the chains mixed and converged properly Visual check (very useful) Traces of likelihood / parameters Autocorrelation Compute convergence criteria among chains (GR,...) not always useful... Run different chains and check concordance between results Problem : Convergence is often pretty bad with such coalescent-based MCMC algorithms... but simulation tests show that posterior distributions are generally correct (at least the mode as point estimate) despite no clear convergence indices...

40 Analyses of MsVar results Bayesian method compare posteriors (plain) and priors (dashed)... and test different priors

41 Analyses of MsVar results Bayesian method compute Bayes factor to check for contraction or expansion signal (Posterior prob. model 1) (Prior prob. model 2) BF = (Posterior prob. model 2) (Prior prob. model 1) Equal priors for models 1 and 2, the Bayes factor for a contraction is thus BF = Posterior P(N anc/n act > 1) Posterior P(N anc /N act < 1) BF = # MCMC steps where (N anc/n act > 1) # MCMC steps where (N anc /N act < 1)

42 Intro Likelihood & coa MCMC IS Sim tests An application of MsVar : Orang-Utans and the deforestation of Borneo Does the genome of Orang-utans carry the signature of population bottlenecks? (Goossens et al PLoS Biology) Conclusions

43 An application of MsVar : Orang-Utans and the deforestation of Borneo Population sizes have collapsed : what is the cause? Can population genetics help? (Delgado & Van Schaik, 2001 Evol. Anthropology)

44 An application of MsVar : Orang-Utans and the deforestation of Borneo The data

45 An application of MsVar : Orang-Utans and the deforestation of Borneo MsVar results MsVar efficiently detects a past decrease in population size

46 An application of MsVar : Orang-Utans and the deforestation of Borneo MsVar results FE : beginning of massive forest exploitation F : first farmers HG : first hunter-gatherers MsVar efficiently detects a past decrease in population size and allows for the dating of the beginning of the decrease : massive forest exploitation seems to be the most likely cause

47 Felsenstein et al. s MCMC Metropolis-Hastings MsVar example Conclusions on MCMC

48 Conclusions about MsVar/ MCMC approaches Coalescent theory provides a powerful framework for statistical inference Allows to infer past history from a unique actual sample! (it was impossible with moment based methods) Gene genealogies are missing data (but important...) MCMCs with coalescent simulations are difficult (to run) But what is the robustness to model assumptions : Mutational processes (e.g. large mutation steps long branches) Population structure (e.g. immigrants long branches)

49 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Griffiths et al. s IS Simulation tests Conclusions

50 Likelihood computations under the coalescent More efficient algorithms that allows better exploration of the genealogies (i.e. proportionally to P(D; P G)). MCMC Felsenstein s pruning algorithm. - Easier to implement, can consider various models - Implemented in many softwares (LAMARC, Batwing, MsVar, MIGRATE, IM) IS Griffiths &Tavaré s coalescent recursion (cf. Ewens recursion) - Extension to different models may be difficult - Implemented in fewer softwares (Genetree, Migraine)

51 The approach of Griffiths et al. Coalescent-based likelihood at a given point of the parameter space is an integral aver all possible histories (genealogies with mutations) leading to the present genetic sample Monte Carlo scheme used to compute this integral Histories are build backward in time, event by event, starting from the present sample But computation of exact backward transition probabilities is often too difficult an IS scheme is used to compute the likelihoods by simulation

52 Griffiths et al. s IS Griffiths et al s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

53 Recursions for sampling distributions Ewens 1972 : a Wright-Fisher, infinite-allele model General recursion at stationarity [with e j = (0,, 0, 1 jth, 0,, 0)] : P n (a) = θ n 1 + θ P n 1(a e 1 )+ n 1 n 1 + θ j(a j + 1) a j+1 >0 n 1 P n 1(a+e j e j+1 ) where given that a coalescence occurs and that the descendant sample has (, a j, a j+1, ), the ancestral one has (, a j + 1, a j+1 1, ) and the probability that one of the a j + 1 alleles with j gene copies is chosen to duplicate is j(a j + 1)/(n 1). Rappel : les effectifs de l échantilons (vecteur a est ordonné selon les effectifs des alleles = allele frequency spectrum = a j est le nombre de l alleles ayant j copies dans l échantilon) Griffiths and Tavaré : recursion for mutation models defined by a matrix (p ij ) of mutation rates from i to j

54 The recursion of Griffiths et al. Coalescent-based likelihood at a given point of the parameter space is an integral over all possible histories (genealogies with mutations) H = {H k ; k = 0,..., τ} corresponding to all coalescent or mutation events that occurred from H 0 the current sample state to H τ the allelic state of the most recent common ancestor (MRCA) of the sample.

55 The recursion of Griffiths et al. Then for any given state H k of the history (cf. Ewens) : p(h k ) = p(h k H k )p(h k ) {H k } where H k is the ancestral sample state (i.e. the state before the last event) and p(h k H k ) are the forward transition probabilities (i.e. from the ancestral to the current state)

56 The recursion of Griffiths et al. Griffiths & Tavaré 1994 : example for a single population p(h k = η) = ( n(n 1) 2N 1 + nµ) (nµ i j n j >0,j i n(n 1) n j 1 + ( 2N j n j >1 n 1 p(h k = η e j)). - Setting θ = 4Nµ and β = n(n 1 + θ), we have p(h k = η) = 1 β θ i j n j >0,j i n i + 1 n p ijp(h k = η e j + e i )) (n i + 1)p ij p(h k = η e j + e i ) + n (n j 1)p(H k = η e j ), j n j >1

57 The recursion of Griffiths et al. Griffiths & Tavaré 1994 : example for a single population p(h k = η) = 1 β θ i j n j >0,j i (n i + 1)p ij p(h k = η e j + e i ) + n (n j 1)p(H k = η e j ), j n j >1 Such recursions are too difficult to solve except for very simple models (WF + IAM, cf Ewens) Griffiths & Tavaré (1994) proposed to use a Monte Carlo approach using sequential importance sampling on past histories to solve the recursion.

58 Griffiths et al. s IS Griffiths et al s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

59 Inference of the likelihood by simulation Griffiths & Tavaré 1994 : p(h k = η) = 1 β θ i or equivalently p(h k ) = w GT (H k )( j n j >0,j i (n i + 1)p ij p(h k = η e j + e i ) + n (n j 1)p(H k = η e j ), j n j >1 i,j n j >0,j i M ij (H k )p(h k e j + e ai ) + C j (H k )p(h k e j )) j n j >1

60 Inference of the likelihood by simulation Griffiths & Tavaré 1994 : Backward absorbing Markov chain based on forward transition probabilities p(h k ) = w GT (H k )( i,j n j >0,j i M ij (H k )p(h k e j + e ai ) + C j (H k )p(h k e j )) j n j >1 Histories are build backward event by event using absorbing Markov chain (abs. state = MRCA) based on forward transitions probabilities ( uniform sampling based on M ij (H k ) and C j (H k )) among all possible events. w GT (H k ) is the weight associated with the IS proposal.

61 Inference of the likelihood by simulation Expending the recursion p(h k ) = {Hk } p(h k H k )p(h k ) over all possible ancestral histories of a current sample leads to Then p(h 0 ) = E [p(h 0 H 1 )...p(h τ 1 H τ )p(h τ )] L(P; D) = p(h 0 ) = H W GT (H)f GT (H) 1 L L τ 1 w GT ((H h ) k ). L h=1 k=0 L h=1 W GT (H h ) This IS scheme f GT (H) is not very efficient because it does not appropriately consider that some backward transitions are more likely than others given the current state (example : SMM mutation).

62 Griffiths et al. s IS Griffiths et al s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

63 Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004) A better Importance Sampling (IS) scheme should be used : Let Q(H k ) be a given distribution, then p(h k H k ) p(h k ) = {H k } Q(H k ) Q(H k )p(h k ) = E Q [ p(h 0 H 1 ) Q(H 1 )...p(h τ 1 H τ ) ] Q(H τ ) where E Q is expectation over the distribution of full histories induced by Q. This means that Q may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(h k H k /Q(H k )).

64 Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004) A better Importance Sampling (IS) scheme should be used : Let Q(H k ) be a given distribution, then p(h k ) = E Q [ p(h 0 H 1 ) Q(H 1 )...p(h τ 1 H τ ) ] Q(H τ ) where E Q is expectation over the distribution of full histories induced by Q. This means that Q may be viewed as a proposal distribution in a sequential IS algorithm with matching weights p(h k H k /Q(H k )). The problem is then to find the proposal distribution that minimizes the variance of likelihood estimates 1 L L τ w GT ((H h ) k ). h=1 k=0

65 Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004) The ideal proposal is the backward transition probability p(h k H k ), because the IS weights are then p(h k ) Q(H k ) = p(h k H k ) p(h k H k ) = p(h k) p(h k ) and thus their product is always the sample likelihood, p(h 0 ). expliciter a single tree reconstruction allows exact likelihood computations (null variance). However, backward transition probabilities p(h k H k ) are generally unknown Aim : find good approximations ˆp(H k H k ) of p(h k H k )

66 Towards a better IS scheme (Stephens & Donnelly 2000, de Iorio & Griffiths 2004) The likelihood at a given point is an integral over all possible histories H = {H k ; k = 0,..., τ}. Markov coalescent process p(h k ) = p(h k H k )p(h k ) and p(h 0 ) = E [p(h 0 H 1 )...p(h τ 1 H τ )p(h τ )]. However, forward transition probabilities p(h k H k ) are not efficient in a backward process Importance sampling techniques based on an approximation ˆp(H k H k ) of p(h k H k ) are used to build more likely histories p(h 0 ) = Eˆp [ p(h 0 H 1 ) ˆp(H 1 H 0 )...p(h τ 1 H τ ) ˆp(H τ H τ 1 ) ].

67 Linking optimal weights to addition of a gene to a sample Represent sample probability p(n) as integral over the joint distribution f (x) of allele frequencies in the population p(n) = x p(n x)f (x) dx = E f [( n n ) i X n i i ] where is the binomial coefficient. ( n n ) = n! i n i!

68 Linking optimal weights to addition of a gene to a sample Represent sample probability p(n) as integral over the joint distribution f (x) of allele frequencies in the population p(n) = x p(n x)f (x) dx = E f [( n n ) i X n i i ] Then the joint probability that we have a sample n and that an additional gene copy is of type j is E f [X j ( n n ) i X n i i ] = n j + 1 n + 1 p(n + e j).

69 Linking optimal weights to addition of a gene to a sample Then the joint probability that we have a sample n and that an additional gene copy is of type j is E f [X j ( n n ) i X n i i ] = n j + 1 n + 1 p(n + e j). We write this joint probability as p(n) times π(j n), where π is thus the probability that an additional gene is of type j, given we have already drawn the sample n from the population. Thus if H k and H k differ by the addition of one gene of type j, we can write the optimal IS weight as p(h k = n) p(h k = n + e j ) = n j n + 1 π(j n)

70 Towards a better IS scheme : the π s Let π( H k ) be the conditional distribution of the allelic type of a n + 1 gene, given H k the configuration (i.e. allelic types) of the first n genes of the sample. Then the optimal IS distribution (exact backward transition probabilities) is, for a single population : p(h k H k ) = 1 β θn π(i H k e j ) j π(j H k e j ) P ij p(h k H k ) = 1 β n j (n j 1) π(j H k e j ) for H k = H k e j + e i for H k = H k e j

71 Towards a better IS scheme : the ˆπ s Unfortunately, π s are generally unknown Stephens & Donnelly (2000) proposed a good approximation ˆπ for the πs for a single WF population. de Iorio & Griffiths (2004) proposed a general method for appoximating the πs under different mutational and demographic models Then approximate backward transition probabilities using the ˆπs are used : ˆp(H k H k ) = 1 β θn ˆπ(i H k e j ) j ˆπ(j H k e j ) P ij ˆp(H k H k ) = 1 n j (n j 1) β ˆπ(j H k e j ) for H k = H k e j + e i for H k = H k e j

72 Griffiths et al. s IS Griffiths et al s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

73 The backward equation for f (X t X 0 = x) Pour un processus de diffusion, la densité de probabilité f des fréquences alléliques satisfait l équation arrière de Kolmogorov, qui décrit les changements de f au cours du temps sous la forme df (X t X 0 = x) dt = Φ(f (x)), où Φ est un opérateur différentiel qui prend ici la forme avec Φ = 1 2 i E j E = j E Φ j x i (δ ij x j ) x j 2 + ( x i r ij ) x i x j j E i E x j R = {r ij } θ 2 (P I ) où P = {p ij } est la matrice de mutation, et I la matrice identité.

74 The backward equation for E[g(X t ) X 0 = x] In the same way as df (X t X 0 = x) dt = Φ(f (x)), the following generator equation (Karlin and Taylor, 1981, p.215) holdsfor any function g(x) with bounded second derivatives E[g(X t ) X 0 = x] g(x) lim = Φ(g(x)). t 0 t We will apply this result with g the sample probability given population allele frequencies x.

75 ˆπ s computation Pour obtenir une récurrence sur les probabilités p(n) avec n = H 0 de l échantillon, on écrit p(n) sous la forme E [g(x)] p(n) = E[( n n ) i X n i i ] où ( n n ) = n! i n i!. On a donc d(p(n)) = Φ [p(n)]. dt A l équilibre stationnaire, d(p(n))/dt est nulle. En développant l expression pour Φ [p(n)], on retrouve alors la récurrence entre les p(n).

76 Explicit recursions in terms of π Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence or mutation event : N (n ( n 1 N + µ)) p(n) = N j n n j 1 N p(n e j) + Nµ j P ij (n i + 1 δ ij )p(n e j + e i ). i Expressing all p(.) in terms of p(n e j )s for distinct js : N ( n 1 j N + µ)π(j n e j)np(n e j ) = N n n j 1 d,j N p(n e j) + Nµ j P ij nπ(i n e j )p(n e j ) i...huge system of linear equations, not easier to solve in this form.

77 Griffiths et al. s IS Griffiths et al s recursion Old IS scheme New IS scheme A general method based on diffusion Approximations of π

78 ˆπ s computation On note que Φ [p(n)] peut s écrire sous la forme j E Φ j [p(n)], x j La technique d approximation développée par de Iorio & Griffiths est d approximer les π (dérivés précédemment de p(n), solution de Φ [p(n)] = 0) par des ˆπ dérivés des solutions de E[Φ j p(n) x j ]= E[Φ j x j ( n n ) i x n i i ]= 0, pour chaque j E.

79 ˆπ s computation La technique d approximation développée par de Iorio & Griffiths est d approximer les π (dérivés précédemment de p(n), solution de Φ [p(n)] = 0) par des ˆπ dérivés des solutions de E[Φ j ( n x j n ) i x n i i ]= 0, pour chaque j E. ce qui donne, pour une population panmictique, pour chaque j E n j (n 1 + θ)ˆp(n) = n(n j 1)ˆp(n e j ) + θp ij (n i + 1 δ ij )ˆp(n e j + e i ) i E

80 ˆπ s computation Rappel : π(j n) peut être exprimé en fonction de p(n) et p(n + e j ) : π(j n)p(n) = n j + 1 n + 1 p(n + e j). Si l on considère que cette relation est aussi valable pour les ˆπ et ˆp, ce qui ne sera généralement pas le cas, on a ˆπ(j n)ˆp(n) = n j + 1 n + 1 ˆp(n + e j)

81 ˆπ s computation Approximer les p(n), solutions de Φ [p(n)] = 0, par les ˆp(n) solutions de E[Φ j ( n x j n ) i x n i i ]= 0, pour chaque j E. ce qui donne, pour une population panmictique, pour chaque j E n j (n 1 + θ)ˆp(n) = n(n j 1)ˆp(n e j ) + θp ij (n i + 1 δ ij )ˆp(n e j + e i ) i E et en utilisant ˆπ(j n)ˆp(n) = n j +1 n+1 ˆp(n + e j) et remplaçant n par n + e j, on obtient donc pour chaque j E : (n 1 + θ)ˆπ(j n) = n j + θp ij ˆπ(i n) i E C est le systeme linéaire permettant le calcul des ˆπ(j n) pour un modèle de Wright-Fisher.

82 New IS scheme with the ˆπ s faire deux dipaos de bilan du nouveau schema d IS bilan en reprennant des bouts des diapos 63, 64, 65, 66, 70 et 71

83 A much better IS scheme based on the ˆπ s Drastic gain in efficiently with this new IS scheme (old IS : millions of trees) extract backward transition probabilities for a WF model with parent independent mutation (i.e. KAM) only 30 histories necessary for a good estimation of the likelihood for more complex models (structured populations & KAM) but efficiency slightly decrease with non parent-independent mutations models, e.g. stepwise mutation model (200 histories for structured populations & SMMM) and still limited efficiency for time inhomogeneous demographic models, e.g. one population with past size change (cf. Orang-Utan example) up to 20,000 histories necessary for strong disequilibrium scenarios (e.g. quick change in population size)

84 Implementations of IS : Genetree and Migraine Genetree (Bahlo & Griffiths 2000, old IS algorithm) - 2 to 4 populations with migration (ISM) Migraine (Rousset & Leblois , new IS algorithms) - One single stable population (KAM, SMM, GSM, ISM) - One pop. with past size variation (KAM, SMM, GSM, ISM) - 2 populations with migration (KAM, SMM, ISM) - Isolation By Distance in 1D and 2D (KAM)

85 Implementation of IS in Migraine 1. C++ core IS computations Stratified random sampling of parameter points Estimation of the likelihood at each point using IS 2. R code for post-treatment Likelihood surface interpolation by Kriging Inference of MLEs and CIs Plots of 1D and 2D likelihood profiles

86 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Griffiths et al. s IS Simulation tests Conclusions

87 Simulation tests Can we trust the demographic / historical inferences made with those methods?

88 Simulation tests Can we trust the demographic / historical inferences made with those methods? Aim Assess validity and robustness of the method : Bias, RMSE, coverage properties of confidence intervals robustness to realistic but uninteresting mis-specifications to this aim, we tested by simulation : - The performances of Migraine to infer dispersal under IBD - The performances of MsVar and Migraine to detect and measure past pop size changes

89 Simulation tests Can we trust the demographic / historical inferences made with those methods? Aim Assess validity and robustness of the method : Bias, RMSE, coverage properties of confidence intervals robustness to realistic but uninteresting mis-specifications to this aim, we tested by simulation : - The performances of Migraine to infer dispersal under IBD - The performances of MsVar and Migraine to detect and measure past pop size changes few interesting results...

90 Simulation tests Precision Validation Robustness MCMC vs. IS

91 Simulation tests (MsVar Girod et al. 2011) strong correlations between some pairs of natural parameters but this is expected given the coalescent theory...

92 Simulation tests (MsVar Girod et al. 2011) There is no information in the genetic data to infer µ, N and T separately because coalescent histories (H, genealogies with mutations) generated with the usual diffusion/coalescent approximations (large N, small µ) only depends on the scaled parameters Nµ and T /N constant Nµ product same unscaled history and same polymorphism Two indistinguishable situations under the coalescent approximations!

93 Simulation tests (MsVar Girod et al. 2011) Much better results by rescaling parameters as in the coalescent approximations

94 Simulation tests (Migraine) Nµ D 2N ancµ rel. bias & rel. RMSE Good reliability of the estimates for population declines, provided they are neither too recent, nor too weak BDR: FEDR: D Why does the method s performance strongly depend upon the time of the event, and its intensity?

95 Simulation tests (MsVar & Migraine) How genealogies are affected by demographic parameters? Predict the quantity of information present in the data The information in the data strongly depends on the number of mutations and coalecent events during the different demographic phases

96 Simulation tests Precision Validation Robustness MCMC vs. IS

97 Simulation tests (Migraine) Beyond biases, RMSE et bottleneck detection rates... ECDF of P values Nmu = 0.4 KS: Rel. bias, rel. RMSE 0.116, Nancmu = 400 KS: Rel. bias, rel. RMSE , D = 1.25 KS: Rel. bias, rel. RMSE , 0.14 Nratio = DR: 1 ( 0 ) KS: Rel. bias, rel. RMSE 0.152, (usually )GOOD

98 Simulation tests (Migraine) Beyond biases, RMSE et bottleneck detection rates... Testing CI coverage properties using LRT P-value distributions ECDF of P values Nmu = 0.4 KS: Rel. bias, rel. RMSE 0.116, Nancmu = 400 KS: Rel. bias, rel. RMSE , D = 1.25 KS: Rel. bias, rel. RMSE , 0.14 Nratio = DR: 1 ( 0 ) KS: Rel. bias, rel. RMSE 0.152, (usually )GOOD

99 Simulation tests (Migraine) Beyond biases, RMSE et bottleneck detection rates... Testing CI coverage properties using LRT P-value distributions ECDF of P values Nmu = 0.4 KS: Rel. bias, rel. RMSE 0.116, Nancmu = 400 KS: Rel. bias, rel. RMSE , D = 1.25 KS: Rel. bias, rel. RMSE , 0.14 Nratio = DR: 1 ( 0 ) KS: Rel. bias, rel. RMSE 0.152, Extremely recent and strong 10 Generations, D = N ratio = (θ anc = 400.0) (usually )GOOD (very rarely) BAD

100 Simulation tests Precision Validation Robustness MCMC vs. IS

101 Simulation tests (Migraine) Microsatellite markers show complex mutation processes Mutations do not fit SMM, indels of more than one repeat often occur

102 Simulation tests (Migraine) Microsatellite markers show complex mutation processes Mutations do not fit SMM, indels of more than one repeat often occur Better mutation model = Generalized Stepwise Model (GSM) indels of X (geometric) repeats at each mutation event commonly found value in natura : pgsm 0.22

103 Simulation tests (Migraine) Microsatellite markers show complex mutation processes Mutations do not fit SMM, indels of more than one repeat often occur Better mutation model = GSM indels of X (geometric) repeats commonly found value in natura : pgsm 0.22 Problem : Analyses under the SMM of data simulated under a GSM in a stable population often show false signs of bottleneck (57% of false detection with pgsm = 0.22)

104 Simulation tests Precision Validation Robustness MCMC vs. IS

105 Simulation tests (MsVar vs. Migraine) Some comparison with MsVar Similar performances for good scenarios Better bottleneck detection rate for non-optimal scenarios Parameter inference and CIs are slightly more accurate But comparison is not easy Frequentist vs. bayesian approaches very long computation times for MCMC (and sometimes for IS)

106 Conclusions from the simulation tests (MCMC & IS) Very efficient for bottleneck detections Accurate inferences for most demographic scenarios IS faster and sometimes more accurate than MCMC But : Not robutst to mutational processes Not robust to immigration (structured populations) Inaccurate for extremely strong and recent pop size change and... very long computation times for large data sets with many loci (i.e. NGS >> 100-1,000 loci)

107 Introduction Likelihoods under the coalescent Felsenstein et al. s MCMC Griffiths et al. s IS Simulation tests Conclusions

108 Conclusions Coalescent theory and ML-based approaches provide a powerful framework for statistical inference in population genetics. They extract much more information from the data than moment based methods. In these methods, gene genealogies are missing data Coalescent theory may also help understanding the limits of these methods (the reliability of a method also depends upon the quantity of information available in the data) Testing methods by simulation greatly helps to clearly understand real data analyses

109 Books

110 écrit pour pop subdiv, mais pas utile ici? p(n) = E[ ( n d (n d di ) ) i X n di di ]

111 écrit pour pop subdiv, mais pas utile ici? The backward diffusion equation holds with an operator which is a sum over different demes : Φ= 1 2 N demes d tot allele pairs i,j x N di (δ ij x dj ) 2 d x di x + d i M di dj x di where x di is the frequency of allele i in deme j.

112 écrit pour pop subdiv, mais pas utile ici? Applying the previous arguments then leads to a relation between probabilities of samples that differ by one coalescence/mutation/migration event : N tot ( n d ( n d 1 + m d + µ)) p(n) = d N d n dj 1 N tot n d p(n e dj ) d,j N d + N tot µ d,j P ij (n di + 1 δ ij )p(n e dj + e di ) i n d + N tot n d m j + 1 dd d,j d d n d + 1 p(n e dj + e d j).

113 écrit pour pop subdiv, mais pas utile ici? Expressing all p(.) in terms of p(n e dj )s for distinct d, j : N tot ( n d 1 + m d + µ)π(j d, n e dj )n d p(n e dj ) = d,j N d n dj 1 N tot n d p(n e dj ) d,j N d + N tot µ d,j P ij n d π(i d, n e dj )p(n e dj ) i + N tot n d m dd π(j d, n e dj )p(n e dj ) d,j d d

114 écrit pour pop subdiv, mais pas utile ici? Expressing all p(.) in terms of p(n e dj )s for distinct d, j : N tot ( n d 1 + m d + µ)π(j d, n e dj )n d p(n e dj ) = d,j N d n dj 1 N tot n d p(n e dj ) d,j N d + N tot µ d,j P ij n d π(i d, n e dj )p(n e dj ) i + N tot n d m dd π(j d, n e dj )p(n e dj ) d,j d d Huge system of linear equations, not easier to solve in this form.

Maximum likelihood inference of population size contractions from microsatellite data

Maximum likelihood inference of population size contractions from microsatellite data Maximum likelihood inference of population size contractions from microsatellite data Raphaël Leblois a,b,f,, Pierre Pudlo a,d,f, Joseph Néron b, François Bertaux b,e, Champak Reddy Beeravolu a, Renaud

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Coalescent based demographic inference. Daniel Wegmann University of Fribourg Coalescent based demographic inference Daniel Wegmann University of Fribourg Introduction The current genetic diversity is the outcome of past evolutionary processes. Hence, we can use genetic diversity

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Stochastic Demography, Coalescents, and Effective Population Size

Stochastic Demography, Coalescents, and Effective Population Size Demography Stochastic Demography, Coalescents, and Effective Population Size Steve Krone University of Idaho Department of Mathematics & IBEST Demographic effects bottlenecks, expansion, fluctuating population

More information

Frequency Spectra and Inference in Population Genetics

Frequency Spectra and Inference in Population Genetics Frequency Spectra and Inference in Population Genetics Although coalescent models have come to play a central role in population genetics, there are some situations where genealogies may not lead to efficient

More information

Mathematical models in population genetics II

Mathematical models in population genetics II Mathematical models in population genetics II Anand Bhaskar Evolutionary Biology and Theory of Computing Bootcamp January 1, 014 Quick recap Large discrete-time randomly mating Wright-Fisher population

More information

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009 Gene Genealogies Coalescence Theory Annabelle Haudry Glasgow, July 2009 What could tell a gene genealogy? How much diversity in the population? Has the demographic size of the population changed? How?

More information

Inférence en génétique des populations IV.

Inférence en génétique des populations IV. Inférence en génétique des populations IV. François Rousset & Raphaël Leblois M2 Biostatistiques 2015 2016 FR & RL Inférence en génétique des populations IV. M2 Biostatistiques 2015 2016 1 / 33 Modeling

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics Grundlagen der Bioinformatik, SoSe 14, D. Huson, May 18, 2014 67 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

Demography April 10, 2015

Demography April 10, 2015 Demography April 0, 205 Effective Population Size The Wright-Fisher model makes a number of strong assumptions which are clearly violated in many populations. For example, it is unlikely that any population

More information

Inferring Population Decline and Expansion From Microsatellite Data: A Simulation-Based Evaluation of the Msvar Method

Inferring Population Decline and Expansion From Microsatellite Data: A Simulation-Based Evaluation of the Msvar Method Copyright 211 by the Genetics Society of America DOI: 1.134/genetics.11.121764 Inferring Population Decline and Expansion From Microsatellite Data: A Simulation-Based Evaluation of the Msvar Method Christophe

More information

Genetic Variation in Finite Populations

Genetic Variation in Finite Populations Genetic Variation in Finite Populations The amount of genetic variation found in a population is influenced by two opposing forces: mutation and genetic drift. 1 Mutation tends to increase variation. 2

More information

6 Introduction to Population Genetics

6 Introduction to Population Genetics 70 Grundlagen der Bioinformatik, SoSe 11, D. Huson, May 19, 2011 6 Introduction to Population Genetics This chapter is based on: J. Hein, M.H. Schierup and C. Wuif, Gene genealogies, variation and evolution,

More information

The Combinatorial Interpretation of Formulas in Coalescent Theory

The Combinatorial Interpretation of Formulas in Coalescent Theory The Combinatorial Interpretation of Formulas in Coalescent Theory John L. Spouge National Center for Biotechnology Information NLM, NIH, DHHS spouge@ncbi.nlm.nih.gov Bldg. A, Rm. N 0 NCBI, NLM, NIH Bethesda

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem? Who was Bayes? Bayesian Phylogenetics Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison October 6, 2011 The Reverand Thomas Bayes was born in London in 1702. He was the

More information

Bayesian Phylogenetics

Bayesian Phylogenetics Bayesian Phylogenetics Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison October 6, 2011 Bayesian Phylogenetics 1 / 27 Who was Bayes? The Reverand Thomas Bayes was born

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Challenges when applying stochastic models to reconstruct the demographic history of populations.

Challenges when applying stochastic models to reconstruct the demographic history of populations. Challenges when applying stochastic models to reconstruct the demographic history of populations. Willy Rodríguez Institut de Mathématiques de Toulouse October 11, 2017 Outline 1 Introduction 2 Inverse

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Stat 516, Homework 1

Stat 516, Homework 1 Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

I of a gene sampled from a randomly mating popdation,

I of a gene sampled from a randomly mating popdation, Copyright 0 1987 by the Genetics Society of America Average Number of Nucleotide Differences in a From a Single Subpopulation: A Test for Population Subdivision Curtis Strobeck Department of Zoology, University

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

The Wright-Fisher Model and Genetic Drift

The Wright-Fisher Model and Genetic Drift The Wright-Fisher Model and Genetic Drift January 22, 2015 1 1 Hardy-Weinberg Equilibrium Our goal is to understand the dynamics of allele and genotype frequencies in an infinite, randomlymating population

More information

Approximate Bayesian Computation: a simulation based approach to inference

Approximate Bayesian Computation: a simulation based approach to inference Approximate Bayesian Computation: a simulation based approach to inference Richard Wilkinson Simon Tavaré 2 Department of Probability and Statistics University of Sheffield 2 Department of Applied Mathematics

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

Outils de Recherche Opérationnelle en Génie MTH Astuce de modélisation en Programmation Linéaire

Outils de Recherche Opérationnelle en Génie MTH Astuce de modélisation en Programmation Linéaire Outils de Recherche Opérationnelle en Génie MTH 8414 Astuce de modélisation en Programmation Linéaire Résumé Les problèmes ne se présentent pas toujours sous une forme qui soit naturellement linéaire.

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Closed-form sampling formulas for the coalescent with recombination

Closed-form sampling formulas for the coalescent with recombination 0 / 21 Closed-form sampling formulas for the coalescent with recombination Yun S. Song CS Division and Department of Statistics University of California, Berkeley September 7, 2009 Joint work with Paul

More information

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences

A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences Indian Academy of Sciences A comparison of two popular statistical methods for estimating the time to most recent common ancestor (TMRCA) from a sample of DNA sequences ANALABHA BASU and PARTHA P. MAJUMDER*

More information

How robust are the predictions of the W-F Model?

How robust are the predictions of the W-F Model? How robust are the predictions of the W-F Model? As simplistic as the Wright-Fisher model may be, it accurately describes the behavior of many other models incorporating additional complexity. Many population

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Molecular Epidemiology Workshop: Bayesian Data Analysis

Molecular Epidemiology Workshop: Bayesian Data Analysis Molecular Epidemiology Workshop: Bayesian Data Analysis Jay Taylor and Ananias Escalante School of Mathematical and Statistical Sciences Center for Evolutionary Medicine and Informatics Arizona State University

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ), ABC estimation of the scaled effective population size. Geoff Nicholls, DTC 07/05/08 Refer to http://www.stats.ox.ac.uk/~nicholls/dtc/tt08/ for material. We will begin with a practical on ABC estimation

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Lecture 18 : Ewens sampling formula

Lecture 18 : Ewens sampling formula Lecture 8 : Ewens sampling formula MATH85K - Spring 00 Lecturer: Sebastien Roch References: [Dur08, Chapter.3]. Previous class In the previous lecture, we introduced Kingman s coalescent as a limit of

More information

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

More information

Population Genetics: a tutorial

Population Genetics: a tutorial : a tutorial Institute for Science and Technology Austria ThRaSh 2014 provides the basic mathematical foundation of evolutionary theory allows a better understanding of experiments allows the development

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Nicolas Salamin Department of Ecology and Evolution University of Lausanne

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Random variables. Florence Perronnin. Univ. Grenoble Alpes, LIG, Inria. September 28, 2018

Random variables. Florence Perronnin. Univ. Grenoble Alpes, LIG, Inria. September 28, 2018 Random variables Florence Perronnin Univ. Grenoble Alpes, LIG, Inria September 28, 2018 Florence Perronnin (UGA) Random variables September 28, 2018 1 / 42 Variables aléatoires Outline 1 Variables aléatoires

More information

Apprentissage automatique Méthodes à noyaux - motivation

Apprentissage automatique Méthodes à noyaux - motivation Apprentissage automatique Méthodes à noyaux - motivation MODÉLISATION NON-LINÉAIRE prédicteur non-linéaire On a vu plusieurs algorithmes qui produisent des modèles linéaires (régression ou classification)

More information

Genetic Drift in Human Evolution

Genetic Drift in Human Evolution Genetic Drift in Human Evolution (Part 2 of 2) 1 Ecology and Evolutionary Biology Center for Computational Molecular Biology Brown University Outline Introduction to genetic drift Modeling genetic drift

More information

Lecture 6: Markov Chain Monte Carlo

Lecture 6: Markov Chain Monte Carlo Lecture 6: Markov Chain Monte Carlo D. Jason Koskinen koskinen@nbi.ku.dk Photo by Howard Jackman University of Copenhagen Advanced Methods in Applied Statistics Feb - Apr 2016 Niels Bohr Institute 2 Outline

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2

Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Stationary Distribution of the Linkage Disequilibrium Coefficient r 2 Wei Zhang, Jing Liu, Rachel Fewster and Jesse Goodman Department of Statistics, The University of Auckland December 1, 2015 Overview

More information

A Backward Particle Interpretation of Feynman-Kac Formulae

A Backward Particle Interpretation of Feynman-Kac Formulae A Backward Particle Interpretation of Feynman-Kac Formulae P. Del Moral Centre INRIA de Bordeaux - Sud Ouest Workshop on Filtering, Cambridge Univ., June 14-15th 2010 Preprints (with hyperlinks), joint

More information

The problem Lineage model Examples. The lineage model

The problem Lineage model Examples. The lineage model The lineage model A Bayesian approach to inferring community structure and evolutionary history from whole-genome metagenomic data Jack O Brien Bowdoin College with Daniel Falush and Xavier Didelot Cambridge,

More information

From Individual-based Population Models to Lineage-based Models of Phylogenies

From Individual-based Population Models to Lineage-based Models of Phylogenies From Individual-based Population Models to Lineage-based Models of Phylogenies Amaury Lambert (joint works with G. Achaz, H.K. Alexander, R.S. Etienne, N. Lartillot, H. Morlon, T.L. Parsons, T. Stadler)

More information

Brownian Models and Coalescent Structures

Brownian Models and Coalescent Structures Brownian Models and Coalescent Structures Michael Blum a,b,, Christophe Damerval b, Stephanie Manel c and Olivier François b a Laboratoire Ecologie, Systematique et Evolution, Bâtiment 362, Université

More information

Diffusion Models in Population Genetics

Diffusion Models in Population Genetics Diffusion Models in Population Genetics Laura Kubatko kubatko.2@osu.edu MBI Workshop on Spatially-varying stochastic differential equations, with application to the biological sciences July 10, 2015 Laura

More information

CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI

CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER OF LOCI Adv. Appl. Prob. 44, 391 407 (01) Printed in Northern Ireland Applied Probability Trust 01 CLOSED-FORM ASYMPTOTIC SAMPLING DISTRIBUTIONS UNDER THE COALESCENT WITH RECOMBINATION FOR AN ARBITRARY NUMBER

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

ESTIMATION of recombination fractions using ped- ber of pairwise differences (Hudson 1987; Wakeley

ESTIMATION of recombination fractions using ped- ber of pairwise differences (Hudson 1987; Wakeley Copyright 2001 by the Genetics Society of America Estimating Recombination Rates From Population Genetic Data Paul Fearnhead and Peter Donnelly Department of Statistics, University of Oxford, Oxford, OX1

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Human Population Genomics Outline 1 2 Damn the Human Genomes. Small initial populations; genes too distant; pestered with transposons;

More information

Likelihood-free MCMC

Likelihood-free MCMC Bayesian inference for stable distributions with applications in finance Department of Mathematics University of Leicester September 2, 2011 MSc project final presentation Outline 1 2 3 4 Classical Monte

More information

Multimodal Nested Sampling

Multimodal Nested Sampling Multimodal Nested Sampling Farhan Feroz Astrophysics Group, Cavendish Lab, Cambridge Inverse Problems & Cosmology Most obvious example: standard CMB data analysis pipeline But many others: object detection,

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Robust demographic inference from genomic and SNP data

Robust demographic inference from genomic and SNP data Robust demographic inference from genomic and SNP data Laurent Excoffier Isabelle Duperret, Emilia Huerta-Sanchez, Matthieu Foll, Vitor Sousa, Isabel Alves Computational and Molecular Population Genetics

More information

Selection on selected records

Selection on selected records Selection on selected records B. GOFFINET I.N.R.A., Laboratoire de Biometrie, Centre de Recherches de Toulouse, chemin de Borde-Rouge, F 31320 Castanet- Tolosan Summary. The problem of selecting individuals

More information

Non-Parametric Bayesian Population Dynamics Inference

Non-Parametric Bayesian Population Dynamics Inference Non-Parametric Bayesian Population Dynamics Inference Philippe Lemey and Marc A. Suchard Department of Microbiology and Immunology K.U. Leuven, Belgium, and Departments of Biomathematics, Biostatistics

More information

Approximate Bayesian Computation

Approximate Bayesian Computation Approximate Bayesian Computation Michael Gutmann https://sites.google.com/site/michaelgutmann University of Helsinki and Aalto University 1st December 2015 Content Two parts: 1. The basics of approximate

More information

Markov chain Monte Carlo

Markov chain Monte Carlo 1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop

More information

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge

The mathematical challenge. Evolution in a spatial continuum. The mathematical challenge. Other recruits... The mathematical challenge The mathematical challenge What is the relative importance of mutation, selection, random drift and population subdivision for standing genetic variation? Evolution in a spatial continuum Al lison Etheridge

More information

Reminder of some Markov Chain properties:

Reminder of some Markov Chain properties: Reminder of some Markov Chain properties: 1. a transition from one state to another occurs probabilistically 2. only state that matters is where you currently are (i.e. given present, future is independent

More information

Statistics & Data Sciences: First Year Prelim Exam May 2018

Statistics & Data Sciences: First Year Prelim Exam May 2018 Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book

More information

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article.

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article. Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis David Bryant,*,1 Remco Bouckaert, 2 Joseph Felsenstein, 3 Noah A. Rosenberg, 4 and Arindam

More information

Reinforcement Learning: Part 3 Evolution

Reinforcement Learning: Part 3 Evolution 1 Reinforcement Learning: Part 3 Evolution Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Cross-entropy method for TSP Simple genetic style methods can

More information

State Space and Hidden Markov Models

State Space and Hidden Markov Models State Space and Hidden Markov Models Kunsch H.R. State Space and Hidden Markov Models. ETH- Zurich Zurich; Aliaksandr Hubin Oslo 2014 Contents 1. Introduction 2. Markov Chains 3. Hidden Markov and State

More information

I. Bayesian econometrics

I. Bayesian econometrics I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory D. Large sample results E. Diffuse priors F. Numerical Bayesian methods

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

On a multivariate implementation of the Gibbs sampler

On a multivariate implementation of the Gibbs sampler Note On a multivariate implementation of the Gibbs sampler LA García-Cortés, D Sorensen* National Institute of Animal Science, Research Center Foulum, PB 39, DK-8830 Tjele, Denmark (Received 2 August 1995;

More information

Discrete & continuous characters: The threshold model

Discrete & continuous characters: The threshold model Discrete & continuous characters: The threshold model Discrete & continuous characters: the threshold model So far we have discussed continuous & discrete character models separately for estimating ancestral

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

QTL model selection: key players

QTL model selection: key players Bayesian Interval Mapping. Bayesian strategy -9. Markov chain sampling 0-7. sampling genetic architectures 8-5 4. criteria for model selection 6-44 QTL : Bayes Seattle SISG: Yandell 008 QTL model selection:

More information

Taming the Beast Workshop

Taming the Beast Workshop Workshop and Chi Zhang June 28, 2016 1 / 19 Species tree Species tree the phylogeny representing the relationships among a group of species Figure adapted from [Rogers and Gibbs, 2014] Gene tree the phylogeny

More information

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin

Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin Solutions to Even-Numbered Exercises to accompany An Introduction to Population Genetics: Theory and Applications Rasmus Nielsen Montgomery Slatkin CHAPTER 1 1.2 The expected homozygosity, given allele

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

New Bayesian methods for model comparison

New Bayesian methods for model comparison Back to the future New Bayesian methods for model comparison Murray Aitkin murray.aitkin@unimelb.edu.au Department of Mathematics and Statistics The University of Melbourne Australia Bayesian Model Comparison

More information

URN MODELS: the Ewens Sampling Lemma

URN MODELS: the Ewens Sampling Lemma Department of Computer Science Brown University, Providence sorin@cs.brown.edu October 3, 2014 1 2 3 4 Mutation Mutation: typical values for parameters Equilibrium Probability of fixation 5 6 Ewens Sampling

More information

DNA-based species delimitation

DNA-based species delimitation DNA-based species delimitation Phylogenetic species concept based on tree topologies Ø How to set species boundaries? Ø Automatic species delimitation? druhů? DNA barcoding Species boundaries recognized

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

AARMS Homework Exercises

AARMS Homework Exercises 1 For the gamma distribution, AARMS Homework Exercises (a) Show that the mgf is M(t) = (1 βt) α for t < 1/β (b) Use the mgf to find the mean and variance of the gamma distribution 2 A well-known inequality

More information

ABC methods for phase-type distributions with applications in insurance risk problems

ABC methods for phase-type distributions with applications in insurance risk problems ABC methods for phase-type with applications problems Concepcion Ausin, Department of Statistics, Universidad Carlos III de Madrid Joint work with: Pedro Galeano, Universidad Carlos III de Madrid Simon

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Statistical Inference for Stochastic Epidemic Models

Statistical Inference for Stochastic Epidemic Models Statistical Inference for Stochastic Epidemic Models George Streftaris 1 and Gavin J. Gibson 1 1 Department of Actuarial Mathematics & Statistics, Heriot-Watt University, Riccarton, Edinburgh EH14 4AS,

More information

Monte Carlo Methods. Geoff Gordon February 9, 2006

Monte Carlo Methods. Geoff Gordon February 9, 2006 Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Peter Beerli October 10, 2005 [this chapter is highly influenced by chapter 1 in Markov chain Monte Carlo in Practice, eds Gilks W. R. et al. Chapman and Hall/CRC, 1996] 1 Short

More information

Populations in statistical genetics

Populations in statistical genetics Populations in statistical genetics What are they, and how can we infer them from whole genome data? Daniel Lawson Heilbronn Institute, University of Bristol www.paintmychromosomes.com Work with: January

More information

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral

Mean field simulation for Monte Carlo integration. Part II : Feynman-Kac models. P. Del Moral Mean field simulation for Monte Carlo integration Part II : Feynman-Kac models P. Del Moral INRIA Bordeaux & Inst. Maths. Bordeaux & CMAP Polytechnique Lectures, INLN CNRS & Nice Sophia Antipolis Univ.

More information

Population Genetics I. Bio

Population Genetics I. Bio Population Genetics I. Bio5488-2018 Don Conrad dconrad@genetics.wustl.edu Why study population genetics? Functional Inference Demographic inference: History of mankind is written in our DNA. We can learn

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information