Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS

Reconstruire le passé biologique modèles, méthodes, performances, limites O. Gascuel, M. Steel. 2010. Inferring ancestral sequences in taxon-rich phylogenies. Mathematical Biosciences 227(2):125-135. O. Gascuel, M. Steel. 2014. Predicting the ancestral character changes in a tree is typically easier than predicting the root state. Systematic Biology, 63(3):421 435.

Reconstruire le passé biologique modèles, méthodes, performances, limites Focus on characters rather than trees An introduction to phylogenetics * Motivations * Tree models * Sequence and character evolution models * Ancestral inference methods Inferring the tree root Inferring character changes Uncertainty principle, perpectives

Darwin (1837)

Haeckel (1875)

The Tree of Life

A growing impact. Nothing in Biology Makes Sense Except in the Light of Evolution T. Dobzhansky 1973 2016 28,000

Inferring the root character? A C A T C A G C

Broadcasting on trees? Ising model 0 1 0 1 0 1 0 1

(Adv. App. Prob. 2000)

(Adv. App. Prob. 2000) Uniform error probability No branch length (time duration) The tree is fixed, not random

Inferring all character changes??????? A C A T C A G C

Parallel Adaptations to High Temperatures in the Archean Eon Boussau*, Blanquart* et al Nature 2008

HIV1 subtype A Eastern & Southern Europe (Chevenet et al. Bioinformatics 2013)

HIV1 subtype C

Phylogenetic tree models Unoriented, labelled, binary tree Mathematical expression Search tree Orangutan Gorilla Bonobo Chimpanze Human

Phylogenetic tree models Unoriented, labelled, binary tree Mathematical expression Search tree Orangutan Gorilla Human Chimpanze Bonobo

Phylogenetic tree models Rooted tree Time dimension (difficult to infer) Orangutan Gorilla Bonobo Chimpanze Human

Phylogenetic tree models O(n n ) topologies Orangutan Gorilla Bonobo Chimpanze Human

Yule-Harding (YH) speciation model - Topology (1924, 1971) An initial species (leaf) Until we obtain n extant species, randomly select a leaf in the growing tree, and speciate that (ancestral) species into 2 new species Labels are uniformly assigned to the tree leaves Robust to extinction and sampling

YH distribution is not uniform Expected number of cherries: n/3 versus n/4 Expected diameter: O(log(n)) versus O(sqrt(n)) 0.6 0.5 Diameter (n = 95) 0.4 0.3 YH Uniform (PDA) 0.2 0.1 0 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61

Yule-Harding model with time-valued edges The speciation time on a given branch follows an exponential law (without memory) of parameter (expectation = 1 / ) 1/ 1/ 1/ 1/ 1/

Yule-Harding model with time-valued edges The minimum of k independent, exponential laws is an exponential law with parameter k 1/ 1/2 O(1/n 1/3 1/4 1/5 1/6 1/7 1/8 1/9

Yule-Harding model with time-valued edges The minimum of k independent, exponential laws is an exponential law with parameter k (many other more sophisticated models) 1/ 1/2 O(1/n 1/3 1/4 1/5 1/6 1/7 1/8 1/9

Modeling sequence (and character) evolution We aim at explaining the data (alignment) using a probablistic scenario of the evolution of the sites along a phylogeny

Modeling sequence (character) evolution A A C

Modeling sequence (character) evolution A A A C

Modeling sequence (character) evolution A or C A A A C Parsimony

Modeling sequence (character) evolution A or C A ACGT ACGT A A C Parsimony Probabilistic modelling

Modeling sequence evolution: standard assumptions Evolution is independent among lineages Evolution is memory-less (Markov model) The sites evolve independently and identically Models are time reversible Models are time homogeneous and stationary

The simplest RY (0,1) symmetrical Markov, time continuous model R μ Y R Y R. expected number of mutations = t

The simplest RY (0,1) symmetrical Markov, time continuous model R μ Y The rate matrix: Q The matrix of probability changes: P t e Qt P t 1 1 2 2 1 1 2 2 2t 2t 1e 1e 2t 2t 1e 1e The equilibrium distribution: R Y 1 2

The simplest RY (0,1) symmetrical Markov, time continuous model The rate matrix: Q The matrix of probability changes: The equilibrium distribution: R Y P t 1 1 2 2 1 1 2 2 1 2 2t 2t 1e 1e 2t 2t 1e 1e This model is time-reversible: X XY Y YX P t P t We assume stationnarity (frequencies of R and Y are nearly equal)

Jukes and Cantor model (JC69) for DNA M = A T C G A T C G Eq. (1/4, 1/4, 1/4, 1/4)

Felsenstein 1981 (F81) model for DNA A T C G M = A T C G T A C G C A T G G A T C Eq. ( A, T, C, G ) Felsenstein s 1981 model allows for any arbitrary set of equilibrium frequencies.

Kimura 2-parameter (K2P) model for DNA M = A T C G A T C G Eq. (1/4, 1/4, 1/4, 1/4) Kimura s 2-parameter model aims at reflecting the fact that transitions are more frequent than transitions

Hasegawa, Kishino,Yano (HKY) model for DNA A T C G M = A T C G T A C G C A T G G A T C Eq. ( A, T, C, G ) The HKY model is a way to incorporate both transition/transversion bias and an arbitrary set of equilibrium frequencies. It captures the two main aspects of DNA evolution.

Reconstruction methods: Majority The majority state at the tree tips is predicted (no knowledge of the tree or the model) 1 0 1 0 1 0 1 1 1

Reconstruction Methods: Parsimony We minimize the number of changes along tree branches (no knowledge of the model and time duration) 1 st : Recursive postorder calculation (bottom-up) 1 01 01 0 01 1 1 0 1 0 1 0 1 1 1

Reconstruction Methods: Parsimony We minimize the number of changes along tree branches (no knowledge of the model and time duration) 2 nd : Recursive preorder calculation (topdown) 01 01 0 1 01 1 1 0 1 0 1 0 1 1 1

Maximum Likelihood Require to know (or estimate) the tree, branch lengths, and model parameter values One predicts the maximum posterior probability (MAP) Best possible method!

Maximum Likelihood Recursive postorder calculation marginal likelihood of the root states u T v L h T M P h h u M L h U M h' AC,, G, T, ', ', h' AC,, G, T ', ', P h h v M L h V M U V For a tip: L h U, M 1if U h, else 0

Maximum Likelihood Computation of the best scenario: We apply (independently!) the same algorithm to every internal nodes Marginal posterior for each We use a dynamic programming approach (Pupko et al. 2000) to compute the scenario with maximal joint probability (but exponential number of scenarios )

Results (OG & Steel 2010-2014) What part of the past is reconstructible? (PAC etc) Can we compare the different methods? (simulations) In this presentation: RY (0/1) symmetric model () Yule-Harding model with t (or n) ( is key) (in which condition the past disappear?) Yule-Harding trees with t fixed and (impact of the sample size?) Extreme trees (examples and counter-examples)

Root state: fundamental limitation Yule-Harding trees (YH) with t (or n) For any root prediction method M 1if42:predictiveaccuracyofrootreconstructionwith:speciationrate:substitutionrateMMPAPAMOlivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017

Root state: fundamental limitation Yule-Harding trees (YH) with t (or n) Pif4M 2 1 A1I;LM 2 2 1 Mutual information I(;L) and accuracy PA : PAleafsetL nxp4texpt4 et Mutual information erosion with time I;L

Root state : parsimony limitation Yule-Harding trees (YH) with t Pif6Parsimony 2 1A PAOlivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017

Root state : Majority rule and MAP (Mossel and Steel 2014) YH trees with t Majority has best possible bound: / = 4 Thus, the same holds for MAP

Root state : Majority rule p 1 YH tree, fixed t and : Majority PA With a conservative model: P ii (t) > P ij (t) If i is the root state, we expect a majority of i among tips. But the tree paths are not independent!

Root state : Majority rule Star tree: independent paths, laws of large numbers? AA AG AC, AT A C A T C A G A

Root state : Majority rule What is the spread of Yule-Harding trees?

Root state : Majority rule Spread index: l YX ST ( ) xy, l xy n n1 t X Y Theorem: for any fixed and t, the probability that S(T) is larger than tends to 0 when (speciation rate) Then, T becomes close to a star tree, and the accuracy of the majority rule converges to 1.

rsi1olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Root state : Parsimony and MAP With YH trees, fixed t and : PAPamony MAP 1pPandA p Realistic simulations MAP > Majority > Parsimony Majority is affected by potential sampling biases MAP and Parsimony are surprisingly robust

Root/Internal nodes: not that simple! Root: Yes Nodes: No Root: No Nodes: Yes

(ed)olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Internal nodes and YH trees: Yule-Harding trees with t (or n) PAv 21,fix

(ed)olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Internal nodes and YH trees: Yule-Harding trees with t (or n) PAv 21,fix At least half of the nodes are connected to a tip. In a time conditioned YH tree, the expected length of pending branches is 1 / 2 Thus, the mutual information is > 0 and the predictability > 1/2

(ed)olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Internal nodes and YH trees: Yule-Harding trees with t (or n) PAv 21,fix Strong contrast with the tree root A 2 (but no quantitative input ) 1if4P

Realistic simulation results: PHKY+nAOlivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017

Discussion Internal nodes are much easier than the tree root Method robustness: model violation (ML, MP, MAJ), sampling bias (ML and MP), tree uncertainty (all) In phylogeography we predict well the flows among countries. (but not the tree root) Lakner et al. (2010) results demonstrate that these methods predict stable/credible ancestral proteins. (but not the future ;))