Reconstruire le passé biologique modèles, méthodes, performances, limites

Similar documents
arxiv: v1 [q-bio.pe] 4 Sep 2013

Dr. Amira A. AL-Hosary

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Tree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Constructing Evolutionary/Phylogenetic Trees

Molecular Evolution, course # Final Exam, May 3, 2006

Phylogenetics: Likelihood

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Algorithms in Bioinformatics

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Phylogenetic inference

arxiv: v1 [q-bio.pe] 1 Jun 2014

BMI/CS 776 Lecture 4. Colin Dewey

EVOLUTIONARY DISTANCES

C.DARWIN ( )

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Inferring Molecular Phylogeny

Phylogeny of Mixture Models

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

BINF6201/8201. Molecular phylogenetic methods

Phylogenetics: Building Phylogenetic Trees

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4. Models of DNA and protein change. Likelihood methods

Phylogenetics: Parsimony

π b = a π a P a,b = Q a,b δ + o(δ) = 1 + Q a,a δ + o(δ) = I 4 + Qδ + o(δ),

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

A (short) introduction to phylogenetics

Phylogenetic Assumptions

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Probabilistic modeling and molecular phylogeny

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Constructing Evolutionary/Phylogenetic Trees

Molecular Evolution and Phylogenetic Tree Reconstruction

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

TheDisk-Covering MethodforTree Reconstruction

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Using algebraic geometry for phylogenetic reconstruction

Concepts and Methods in Molecular Divergence Time Estimation

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Estimating Evolutionary Trees. Phylogenetic Methods

Evolutionary Tree Analysis. Overview

Taming the Beast Workshop

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetics. BIOL 7711 Computational Bioscience

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Letter to the Editor. Department of Biology, Arizona State University

Chapter 7: Models of discrete character evolution

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Algebraic Statistics Tutorial I

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Reconstruction of certain phylogenetic networks from their tree-average distances

Theory of Evolution Charles Darwin

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

1. Can we use the CFN model for morphological traits?

From Individual-based Population Models to Lineage-based Models of Phylogenies

Consistency Index (CI)

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Lecture Notes: Markov chains

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Molecular Evolution & Phylogenetics

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic invariants versus classical phylogenetics

Lecture 6 Phylogenetic Inference

Evolutionary Models. Evolutionary Models

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

The Generalized Neighbor Joining method

Discrete & continuous characters: The threshold model

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Phylogeny: building the tree of life

A Phylogenetic Network Construction due to Constrained Recombination

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Is the equal branch length model a parsimony model?

Maximum Likelihood in Phylogenetics

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

STA 4273H: Statistical Machine Learning

Week 5: Distance methods, DNA and protein models

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

BIOL 1010 Introduction to Biology: The Evolution and Diversity of Life. Spring 2011 Sections A & B

Probability Distribution of Molecular Evolutionary Trees: A New Method of Phylogenetic Inference

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Properties of normal phylogenetic networks

Inferring Speciation Times under an Episodic Molecular Clock

What is Phylogenetics

Transcription:

Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS

Reconstruire le passé biologique modèles, méthodes, performances, limites O. Gascuel, M. Steel. 2010. Inferring ancestral sequences in taxon-rich phylogenies. Mathematical Biosciences 227(2):125-135. O. Gascuel, M. Steel. 2014. Predicting the ancestral character changes in a tree is typically easier than predicting the root state. Systematic Biology, 63(3):421 435.

Reconstruire le passé biologique modèles, méthodes, performances, limites Focus on characters rather than trees An introduction to phylogenetics * Motivations * Tree models * Sequence and character evolution models * Ancestral inference methods Inferring the tree root Inferring character changes Uncertainty principle, perpectives

Darwin (1837)

Haeckel (1875)

The Tree of Life

A growing impact. Nothing in Biology Makes Sense Except in the Light of Evolution T. Dobzhansky 1973 2016 28,000

Inferring the root character? A C A T C A G C

Broadcasting on trees? Ising model 0 1 0 1 0 1 0 1

(Adv. App. Prob. 2000)

(Adv. App. Prob. 2000) Uniform error probability No branch length (time duration) The tree is fixed, not random

Inferring all character changes??????? A C A T C A G C

Parallel Adaptations to High Temperatures in the Archean Eon Boussau*, Blanquart* et al Nature 2008

HIV1 subtype A Eastern & Southern Europe (Chevenet et al. Bioinformatics 2013)

HIV1 subtype C

Phylogenetic tree models Unoriented, labelled, binary tree Mathematical expression Search tree Orangutan Gorilla Bonobo Chimpanze Human

Phylogenetic tree models Unoriented, labelled, binary tree Mathematical expression Search tree Orangutan Gorilla Human Chimpanze Bonobo

Phylogenetic tree models Rooted tree Time dimension (difficult to infer) Orangutan Gorilla Bonobo Chimpanze Human

Phylogenetic tree models O(n n ) topologies Orangutan Gorilla Bonobo Chimpanze Human

Yule-Harding (YH) speciation model - Topology (1924, 1971) An initial species (leaf) Until we obtain n extant species, randomly select a leaf in the growing tree, and speciate that (ancestral) species into 2 new species Labels are uniformly assigned to the tree leaves Robust to extinction and sampling

YH distribution is not uniform Expected number of cherries: n/3 versus n/4 Expected diameter: O(log(n)) versus O(sqrt(n)) 0.6 0.5 Diameter (n = 95) 0.4 0.3 YH Uniform (PDA) 0.2 0.1 0 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61

Yule-Harding model with time-valued edges The speciation time on a given branch follows an exponential law (without memory) of parameter (expectation = 1 / ) 1/ 1/ 1/ 1/ 1/

Yule-Harding model with time-valued edges The minimum of k independent, exponential laws is an exponential law with parameter k 1/ 1/2 O(1/n 1/3 1/4 1/5 1/6 1/7 1/8 1/9

Yule-Harding model with time-valued edges The minimum of k independent, exponential laws is an exponential law with parameter k (many other more sophisticated models) 1/ 1/2 O(1/n 1/3 1/4 1/5 1/6 1/7 1/8 1/9

Modeling sequence (and character) evolution We aim at explaining the data (alignment) using a probablistic scenario of the evolution of the sites along a phylogeny

Modeling sequence (character) evolution A A C

Modeling sequence (character) evolution A A A C

Modeling sequence (character) evolution A or C A A A C Parsimony

Modeling sequence (character) evolution A or C A ACGT ACGT A A C Parsimony Probabilistic modelling

Modeling sequence evolution: standard assumptions Evolution is independent among lineages Evolution is memory-less (Markov model) The sites evolve independently and identically Models are time reversible Models are time homogeneous and stationary

The simplest RY (0,1) symmetrical Markov, time continuous model R μ Y R Y R. expected number of mutations = t

The simplest RY (0,1) symmetrical Markov, time continuous model R μ Y The rate matrix: Q The matrix of probability changes: P t e Qt P t 1 1 2 2 1 1 2 2 2t 2t 1e 1e 2t 2t 1e 1e The equilibrium distribution: R Y 1 2

The simplest RY (0,1) symmetrical Markov, time continuous model The rate matrix: Q The matrix of probability changes: The equilibrium distribution: R Y P t 1 1 2 2 1 1 2 2 1 2 2t 2t 1e 1e 2t 2t 1e 1e This model is time-reversible: X XY Y YX P t P t We assume stationnarity (frequencies of R and Y are nearly equal)

Jukes and Cantor model (JC69) for DNA M = A T C G A T C G Eq. (1/4, 1/4, 1/4, 1/4)

Felsenstein 1981 (F81) model for DNA A T C G M = A T C G T A C G C A T G G A T C Eq. ( A, T, C, G ) Felsenstein s 1981 model allows for any arbitrary set of equilibrium frequencies.

Kimura 2-parameter (K2P) model for DNA M = A T C G A T C G Eq. (1/4, 1/4, 1/4, 1/4) Kimura s 2-parameter model aims at reflecting the fact that transitions are more frequent than transitions

Hasegawa, Kishino,Yano (HKY) model for DNA A T C G M = A T C G T A C G C A T G G A T C Eq. ( A, T, C, G ) The HKY model is a way to incorporate both transition/transversion bias and an arbitrary set of equilibrium frequencies. It captures the two main aspects of DNA evolution.

Reconstruction methods: Majority The majority state at the tree tips is predicted (no knowledge of the tree or the model) 1 0 1 0 1 0 1 1 1

Reconstruction Methods: Parsimony We minimize the number of changes along tree branches (no knowledge of the model and time duration) 1 st : Recursive postorder calculation (bottom-up) 1 01 01 0 01 1 1 0 1 0 1 0 1 1 1

Reconstruction Methods: Parsimony We minimize the number of changes along tree branches (no knowledge of the model and time duration) 2 nd : Recursive preorder calculation (topdown) 01 01 0 1 01 1 1 0 1 0 1 0 1 1 1

Maximum Likelihood Require to know (or estimate) the tree, branch lengths, and model parameter values One predicts the maximum posterior probability (MAP) Best possible method!

Maximum Likelihood Recursive postorder calculation marginal likelihood of the root states u T v L h T M P h h u M L h U M h' AC,, G, T, ', ', h' AC,, G, T ', ', P h h v M L h V M U V For a tip: L h U, M 1if U h, else 0

Maximum Likelihood Computation of the best scenario: We apply (independently!) the same algorithm to every internal nodes Marginal posterior for each We use a dynamic programming approach (Pupko et al. 2000) to compute the scenario with maximal joint probability (but exponential number of scenarios )

Results (OG & Steel 2010-2014) What part of the past is reconstructible? (PAC etc) Can we compare the different methods? (simulations) In this presentation: RY (0/1) symmetric model () Yule-Harding model with t (or n) ( is key) (in which condition the past disappear?) Yule-Harding trees with t fixed and (impact of the sample size?) Extreme trees (examples and counter-examples)

Root state: fundamental limitation Yule-Harding trees (YH) with t (or n) For any root prediction method M 1if42:predictiveaccuracyofrootreconstructionwith:speciationrate:substitutionrateMMPAPAMOlivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017

Root state: fundamental limitation Yule-Harding trees (YH) with t (or n) Pif4M 2 1 A1I;LM 2 2 1 Mutual information I(;L) and accuracy PA : PAleafsetL nxp4texpt4 et Mutual information erosion with time I;L

Root state : parsimony limitation Yule-Harding trees (YH) with t Pif6Parsimony 2 1A PAOlivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017

Root state : Majority rule and MAP (Mossel and Steel 2014) YH trees with t Majority has best possible bound: / = 4 Thus, the same holds for MAP

Root state : Majority rule p 1 YH tree, fixed t and : Majority PA With a conservative model: P ii (t) > P ij (t) If i is the root state, we expect a majority of i among tips. But the tree paths are not independent!

Root state : Majority rule Star tree: independent paths, laws of large numbers? AA AG AC, AT A C A T C A G A

Root state : Majority rule What is the spread of Yule-Harding trees?

Root state : Majority rule Spread index: l YX ST ( ) xy, l xy n n1 t X Y Theorem: for any fixed and t, the probability that S(T) is larger than tends to 0 when (speciation rate) Then, T becomes close to a star tree, and the accuracy of the majority rule converges to 1.

rsi1olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Root state : Parsimony and MAP With YH trees, fixed t and : PAPamony MAP 1pPandA p Realistic simulations MAP > Majority > Parsimony Majority is affected by potential sampling biases MAP and Parsimony are surprisingly robust

Root/Internal nodes: not that simple! Root: Yes Nodes: No Root: No Nodes: Yes

(ed)olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Internal nodes and YH trees: Yule-Harding trees with t (or n) PAv 21,fix

(ed)olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Internal nodes and YH trees: Yule-Harding trees with t (or n) PAv 21,fix At least half of the nodes are connected to a tip. In a time conditioned YH tree, the expected length of pending branches is 1 / 2 Thus, the mutual information is > 0 and the predictability > 1/2

(ed)olivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017 Internal nodes and YH trees: Yule-Harding trees with t (or n) PAv 21,fix Strong contrast with the tree root A 2 (but no quantitative input ) 1if4P

Realistic simulation results: PHKY+nAOlivier Gascuel Reconstruire le passé biologique - Polytechnique, Novembre 2017

Discussion Internal nodes are much easier than the tree root Method robustness: model violation (ML, MP, MAJ), sampling bias (ML and MP), tree uncertainty (all) In phylogeography we predict well the flows among countries. (but not the tree root) Lakner et al. (2010) results demonstrate that these methods predict stable/credible ancestral proteins. (but not the future ;))