Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Similar documents
"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Lecture Notes: Markov chains

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

C.DARWIN ( )

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

BINF6201/8201. Molecular phylogenetic methods

Chapter 26 Phylogeny and the Tree of Life

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Reading for Lecture 13 Release v10

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic inference

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetic Tree Reconstruction

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Quantifying sequence similarity

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

Phylogeny and the Tree of Life

Understanding relationship between homologous sequences

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

What is Phylogenetics

PHYLOGENY AND SYSTEMATICS

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Molecular Evolution & Phylogenetics

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Constructing Evolutionary/Phylogenetic Trees

Genomics and Bioinformatics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

8/23/2014. Phylogeny and the Tree of Life

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Cladistics and Bioinformatics Questions 2013

Mutation models I: basic nucleotide sequence mutation models

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogeny and the Tree of Life

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Theory of Evolution Charles Darwin

Chapter 26: Phylogeny and the Tree of Life

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Algorithms in Bioinformatics

Phylogeny and the Tree of Life

A (short) introduction to phylogenetics

EVOLUTIONARY DISTANCES

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Evolutionary Models. Evolutionary Models

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline


Lecture 4. Models of DNA and protein change. Likelihood methods

Organizing Life s Diversity

BIOLOGICAL SCIENCE. Lecture Presentation by Cindy S. Malone, PhD, California State University Northridge. FIFTH EDITION Freeman Quillin Allison

Constructing Evolutionary/Phylogenetic Trees

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Multiple Sequence Alignment. Sequences

AP Biology. Cladistics

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetic Networks, Trees, and Clusters

Phylogenetics: Building Phylogenetic Trees

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Processes of Evolution

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

The Phylo- HMM approach to problems in comparative genomics, with examples.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

How to read and make phylogenetic trees Zuzana Starostová

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

PHYLOGENY & THE TREE OF LIFE

Rooting Major Cellular Radiations using Statistical Phylogenetics

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Molecular Evolution, course # Final Exam, May 3, 2006

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

The practice of naming and classifying organisms is called taxonomy.

Reconstruire le passé biologique modèles, méthodes, performances, limites

Evolutionary Tree Analysis. Overview

Evolutionary Analysis of Viral Genomes

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Evolution Problem Drill 09: The Tree of Life

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Phylogeny & Systematics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogeny and the Tree of Life

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Transcription:

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 1

Learning Objectives understand the concepts and the vocabulary pertaining to phylogenetic trees know from what type of biological data it is possible to build phylogenies understand the concept of evolutionary divergence between sequences understand what evolutionary models are and know how to use them 2

Learning Outcomes be comfortable in discussing with a colleague about the parameters and details involved in running a phylogenetic analysis be able to prepare a phylogenetic analysis (choice of the data and of an appropriate model of evolution), upstream of the phylogenetic inference process itself 3

Molecular Evolution & Phylogenetics Traits, taxa, phylogenetic trees 4

Some vocabulary a phylogeny, a.k.a. phylogenetic tree, is a tree (connex acyclic graph) whose edges represent direct evolutionary links and nodes represent past or present taxa a taxon (pl. taxa) is a member of a taxonomic representation (species, virus strain, variety, identified subpopulation, etc) terminal nodes (leaves of a tree) are the extant taxa, a.k.a OTU (Operational Taxonomic Unit) a trait or character is any biological feature that can be compared across taxa traits can be qualitative/categorical variables (e.g. aligned nucleotides or aminoacids) or quantitative, in which case they can be discrete, semi-continuous or continuous (e.g. number of repeats of a microsatellite, frequency of an allele, diameter of the skull, etc) 5

A mathematician s definition: Phylogenetic trees tree topology: connex acyclic graph G = (V,E) V : set of vertices or nodes (e.g. species, virus strains, genes) E : set of edges or branches (materialising evolution) acyclicity and connexity impose that there is exactly one path between any two nodes of the tree the degree of a node is the number of edges it is connected to branch lengths are not always present. If present, they correlate with the amount of time elapsed (branch length unit: expected number of character substitutions per site) tree is rooted if we know where is the most ancient node (evolutionary origin), unrooted otherwise. 6

Some more vocabulary and remarks in a binary tree, all non-terminal nodes have 3 neighbours (or one parent node and two children in case the tree is rooted) a node with degree > 3 is called a multifurcation multifurcations represent well geological moments of sudden adaptive radiation, e.g. the Cambrian explosion trees can be built based on a single trait (e.g. one phenotypic characteristic) or (most commonly) on a set of characters (e.g. 1000 aligned sites) 7

A rooted binary tree in a rooted binary tree, every internal node (except the root) has one father and two descendants (sons) n taxa 2n-1 vertices 2n-2 edges Source: J. Felsenstein, Inferring Phylogenies (Sinauer Associates, 2004) 8

An unrooted binary tree in an unrooted binary tree, every internal node has three neighbours n taxa 2n-2 vertices 2n-3 edges this tree has branch lengths Source: J. Felsenstein, Inferring Phylogenies (Sinauer Associates, 2004) 9

Circular trees: a fancy representation Animals Fungi Slime molds Plants Algae Eukaryota Protozoa Gram-positives Chlamydiae Green nonsulfur bacteria Actinobacteria Planctomycetes Spirochaetes Crenarchaeota Nanoarchaeota Archea Bacteria Fusobacteria Cyanobacteria (blue-green algae) Euryarchaeota Thermophilic sulfate-reducers Protoeobacteria Acidobacteria 10

The tree inference problem Problem: Assuming common descent, how to derive the most probably correct tree from the knowledge of the traits in the extant taxa (leaves)? tax1: ACGG tax2: AAGG tax3: AAGT tax4: GAGG phylogenetic inference tax2 tax4 tax3 tax1? input data = aligned nucl. tax2 tax4 tax1 tax3 11

Simpler problem: evolutionary distance between two sequences Let us consider two aligned sequences and the elementary tree linking them: mrca seq1: ACGGGTATTG seq2: ACGATTATTT t seq1 t seq2 We want to know the evolutionary time (= 2t) separating seq1 and seq2 from their MRCA (Most Recent Common Ancestor). Naive distance = edit distance = 3/10 = 0.3 12

Shortcomings of the edit distance between two sequences all mutations (transitions or transversions) yield same distance, while from a biochemical standpoint, transitions (within purines, A G, or within pyrimidines, C T) are less costly than transversions. temporal chains of successive mutations on the same site result in underestimation of the true evolutionary distance by the observed (edit) distance, e.g. A G T (edit distance 1, true distance 2) or A G A (edit distance 0, true distance 2) we have no means to translate edit distances into actual biological time, and date the MRCA in Mya (= Million years ago ): problem of calibration 13

Molecular Evolution & Phylogenetics Modeling evolution 14

What is a model, in general? simplification of the reality based on current human understanding must be simple enough to be computationally tractable must be complex enough to lead to measures, findings or simulations in approximate agreement with our observations (empirical truth) tradeoff between the complexity of a model and its tractability (a.k.a. computability, which implies practical usefulness) Einstein: It can scarcely be denied that the supreme goal of all theory is to make the irreducible basic elements as simple and as few as possible without having to surrender the adequate representation of a single datum of experience. or: Everything should be made as simple as possible, but not simpler. 15

What is a model of evolution, specifically? deals with aligned sequences (nucleotides or amino acids) all evolutionary models are stochastic: they predict probabilities of change, without yielding any certainty models evolution in terms of character substitutions: enables to calculate e.g. P t ( A C) is defined with a certain number of parameters to be estimated from some training data essentially all models in use are Markovian (memory-less): the fate of a character depends only on its present state, not on its previous history of mutations. some models are time-reversible, some others are not. 16

Stochasticity Stochastic (probabilistic) models are based on stochastic processes, i.e. processes defining the trajectory of some random variables. Examples of models: non stochastic: nucleotides mutate once every thousand years according to the cycle: A C, C G, G T, T A stochastic: nucleotide substitutions occur randomly at a constant rate α (expected number of substitutions per site per time unit), equal for all substitutions. (JC69) 17

Markovian property The evolutionary history of a character X doesn t further impact its future trajectory, only its present state matters to determine its future (memory-less property): t 0 t 1 t 2 evolution Pr( X (t 2 ) {X (t 0 ), X (t 1 )})=Pr( X (t 2 ) X (t 1 )) 18

Stationarity Stationary models accept limits for the frequencies of the characters, after the stochastic model has run for an infinite amount of time. For instance, lim t lim t Pr t ( A C) = π C Pr t (G C) = π C probability to have a C after an infinite amount of time does not depend on the original nucleotide t=0 t {Pr (A)=πA Pr (C)=π C Pr (G)=π G Pr (T )=π T 19

Time-reversibility Time-reversible models give no preference to evolution in one direction compared to in the opposite direction, e.g. on some branch of a tree: they give no indication about the orientation of the time arrow. A C A C t Pr( A) Pr t ( A C) = Pr(C) Pr t (C A) t Consequence: time-reversible models cannot lead by themselves to a rooted tree. Unrooted trees are built first, and then rooted by some extra information (outgroup). 20

Exchangeability rates When dealing with time-reversible models, one has: π A Pr t ( A C) = π C Pr t (C A) And using instantaneous rates: π A q AC = π C q CA so we can define symmetric exchangeabilities: q AC π C = q CA π A = s AC = s CA 21

Molecular Evolution & Phylogenetics Simplest evolutionary model on nucleotide data: JC69 22

single constant rate for all substitutions: model has only one parameter α JC69: Jukes & Cantor, 1969 R + (odd situation for α = 0: no evolution). only substitutions, no insertions and deletions model explains or creates only ungapped alignments model can be trained on relatively small datasets - oversimplification of the reality: observations give credit to varying substitution rates depending on the position in the genome or in the Tree of Life + very simple maths to calculate the probability of one sequence having evolved into another one 23

JC69 instantaneous rate matrix (Q) to: from: A C G T A -3α α α α q AT C α -3α α α G α α -3α α q GC T α α α -3α matrix Q of instantaneous rates: over a very short duration dt, probability α*dt that e.g. a nucleotide A mutates into a C 24

Dynamics of nucleotide evolution under JC69 We consider a population of N nucleotides evolving independently from t = 0 under the JC69 model. n A (t): count of adenines at time t Calculus leads to: n A (t) = [ n A (0) N 4 ] e 4 α t + N 4 (and symmetric expressions for C, G and T) JC69 is a stationary model: when t, n A (t) N/4 25

From instantaneous rates to probabilities of change we want to calculate from Q the probability of having two aligned nucleotides e.g. A and C separated by an evolutionary time t fundamental relation: Pr t ( A C)=[e Qt ] A,C equilibrium probabilities for the JC69 model: π A = π C = π G = π T = 0.25 26

Estimation of divergence times under JC69 if p = proportion of observed differences between two nucleotide sequences (edit distance) and d is the expected number of mutations per site between the two sequences, we have: d = 3 4 log(1 4 3 p) > p in order to get to proper geological durations (millions of years), one needs to calibrate the rate of mutations (order of magnitude: 10-9 mutations per site and per year) using reliably dated data (e.g. fossils) 27

More complex models for nucleotide data K2P (2 parameters): equal equilibrium frequencies but two exchangeabilities (one for transitions and one for transversions) F81 (4 parameters): one single exchangeability rate but four different equilibrium frequencies π A, π C, π G and π T F84/HKY (5 parameters): different equilibrium frequencies and two exchangeabilities (one for transitions and one for transversions) GTR (9 parameters): general time-reversible, most complex timereversible model on nucleotides Beware! Using the most complex model on a small dataset is not the best strategy because of overfitting in parameter estimation! 28

most complex model Hierarchy of models for nucleotide data image credit to: Guy Perrière (LBBE, France) number of free parameters simplest model 29

Substitution models for amino acid data Matrices of amino-acid substitution rates have been developed empirically: JTT (Jones, Taylor, Thornton 1992): first matrix built from a large number of pairwise alignments from the Swissprot databank WAG (Whelan and Goldman 2001): derived from 3905 sequences in 182 protein families LG (Le & Gascuel 2008): estimated on 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall mtrev: for mitochondrial protein data 30

Amino acid exchangeabilities derive from their physico-chemical properties amino acids grouped according to physicochemical properties (Esquivel & al. 2013) symmetric matrix comparing WAG & LG exchangeabilities (Le & Gascuel 2008) 31