Phylogenetics: Parsimony

Similar documents
Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics: Likelihood

Evolutionary Tree Analysis. Overview

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Phylogenetic Tree Reconstruction

Constructing Evolutionary/Phylogenetic Trees

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Dr. Amira A. AL-Hosary

Theory of Evolution Charles Darwin


Consistency Index (CI)

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Is the equal branch length model a parsimony model?

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Constructing Evolutionary/Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees

Molecular Evolution & Phylogenetics

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

A Phylogenetic Network Construction due to Constrained Recombination

Intraspecific gene genealogies: trees grafting into networks

arxiv: v5 [q-bio.pe] 24 Oct 2016

Pinvar approach. Remarks: invariable sites (evolve at relative rate 0) variable sites (evolves at relative rate r)

Properties of normal phylogenetic networks

Reconstruire le passé biologique modèles, méthodes, performances, limites

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Using algebraic geometry for phylogenetic reconstruction

What is Phylogenetics

Questions we can ask. Recall. Accuracy and Precision. Systematics - Bio 615. Outline

Reconstructing Phylogenetic Networks Using Maximum Parsimony

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

A Fitness Distance Correlation Measure for Evolutionary Trees

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

Phylogeny: building the tree of life

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Introduction to characters and parsimony analysis

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Phylogenetic analyses. Kirsi Kostamo

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Algorithms in Bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Efficient Parsimony-based Methods for Phylogenetic Network Reconstruction

EVOLUTIONARY DISTANCES

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Theory of Evolution. Charles Darwin

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

arxiv: v1 [q-bio.pe] 1 Jun 2014

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

TheDisk-Covering MethodforTree Reconstruction

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Phylogenetic inference: from sequences to trees

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Finding a Maximum Likelihood Tree Is Hard

C.DARWIN ( )

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogeny: traditional and Bayesian approaches

X X (2) X Pr(X = x θ) (3)

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Phylogenetic Networks, Trees, and Clusters

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

BINF6201/8201. Molecular phylogenetic methods

Phylogeny Tree Algorithms

Evolutionary Models. Evolutionary Models

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

A Minimum Spanning Tree Framework for Inferring Phylogenies

arxiv: v1 [q-bio.pe] 6 Jun 2013

Phylogenetic invariants versus classical phylogenetics

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Comparative Bioinformatics Midterm II Fall 2004

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Anatomy of a tree. clade is group of organisms with a shared ancestor. a monophyletic group shares a single common ancestor = tapirs-rhinos-horses

Beyond Galled Trees Decomposition and Computation of Galled Networks

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Effects of Gap Open and Gap Extension Penalties

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Ratio of explanatory power (REP): A new measure of group support

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stephen Scott.

A (short) introduction to phylogenetics

Phylogeny of Mixture Models

Improved maximum parsimony models for phylogenetic networks

Phylogenetic inference

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Transcription:

1 Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University he Problem 2 Input: Multiple alignment of a set S of sequences Output: ree leaf-labeled with S Assumptions Characters are mutually independent Following a speciation event, characters continue to evolve independently

4 In parsimony-based methods, the inferred tree is fully labeled. 5-1 ACC GGA ACG GAA 5-2 ACC GGA ACC GAA ACG GAA

A Simple Solution: ry All rees 6 Problem: (2n-)!! rooted trees (2m-5)!! unrooted trees A Simple Solution: ry All rees 7 Number of axa Number of unrooted trees Number of rooted trees 1 4 15 5 15 105 6 105 945 7 945 1095 8 1095 1515 9 1515 2027025 10 2027025 4459425 20 2.22E+20 8.20E+21 0 8.69E+6 4.95E+8 40 1.1E+55 1.01E+57 50 2.84E+74 2.75E+76 60 5.01E+94 5.86E+96 70 5.00E+115 6.85E+117 80 2.18E+17.4E+19 Solution 8 Define an optimization criterion Find the tree (or, set of trees) that optimizes the criterion wo common criteria: parsimony and likelihood

9 Parsimony 10 he parsimony of a fully-labeled unrooted tree, is the sum of lengths of all the edges in Length of an edge is the Hamming distance between the sequences at its two endpoints PS() 11-1 ACC GGA ACC GAA ACG GAA

11-2 ACC 0 1 ACC GAA 1 0 GGA ACG GAA 11- ACC 0 1 ACC GAA 1 0 GGA ACG GAA Parsimony score = 5 Maximum Parsimony (MP) 12 Input: a multiple alignment S of n sequences Output: tree with n leaves, each leaf labeled by a unique sequence from S, internal nodes labeled by sequences, and PS() is minimized

1 AGC 14-1 AGC AGC AGC 14-2 AGC AGC AGC

14- AGC AGC AGC 14-4 AGC AGC AGC he three trees are equally good MP trees 14-5 AGC AGC AGC

15 16-1 16-2 5

16-5 6 16-4 5 6 4 16-5 5 4 MP tree 6

Weighted Parsimony 17 Each transition from one character state to another is given a weight Each character is given a weight See a tree that minimizes the weighted parsimony 18 Both the MP and weighted MP problems are NP-hard A Heuristic For Solving the MP Problem 19 Starting with a random tree, move through the tree space while computing the parsimony of trees, and keeping those with optimal score (among the ones encountered) Usually, the search time is the stopping factor

wo Issues 20 How do we move through the tree search space? Can we compute the parsimony of a given leaf-labeled tree efficiently? Searching hrough the ree Space Use tree transformation operations (NNI, BR, and SPR) 21 Searching hrough the ree Space 22 Use tree transformation operations (NNI, BR, and SPR) global maximum local maximum

Computing the Parsimony Length of a Given ree 2 Fitch s algorithm Computes the parsimony score of a given leaf-labeled rooted tree Polynomial time Fitch s Algorithm 24 Alphabet Σ Character c takes states from Σ vc denotes the state of character c at node v Fitch s Algorithm 25 Bottom-up phase: For each node v and each character c, compute the set Sc,v as follows: If v is a leaf, then Sc,v={vc} If v is an internal node whose two children are x and y, then { Sc,x S S c,v = c,y S c,x S c,y S c,x S c,y otherwise

26 Fitch s Algorithm 27 op-down phase: For the root r, let rc=a for some arbitrary a in the set Sc,r For internal node v whose parent is u, { uc u v c = c S c,v arbitrary α S c,v otherwise 28-1

28-2 28-28-4

28-5 28-6 mutations Fitch s Algorithm 29 akes time O(nkm), where n is the number of leaves in the tree, m is the number of sites, and k is the maximum number of states per site (for DNA, k=4)

Informative Sites and Homoplasy 0 Invariable sites: In the search for MP trees, sites that exhibit exactly one state for all taxa are eliminated from the analysis Only variable sites are used Informative Sites and Homoplasy 1 However, not all variable sites are useful for finding an MP tree topology Singleton sites: any nucleotide site at which only unique nucleotides (singletons) exist is not informative, because the nucleotide variation at the site can always be explained by the same number of substitutions in all topologies 2 C,,G are three singleton substitutions non-informative site All trees have parsimony score

Informative Sites and Homoplasy For a site to be informative for constructing an MP tree, it must exhibit at least two different states, each represented in at least two taxa hese sites are called informative sites For constructing MP trees, it is sufficient to consider only informative sites Informative Sites and Homoplasy 4 Because only informative sites contribute to finding MP trees, it is important to have many informative sites to obtain reliable MP trees However, when the extent of homoplasy (backward and parallel substitutions) is high, MP trees would not be reliable even if there are many informative sites available Measuring the Extent of Homoplasy 5 he consistency index (Kluge and Farris, 1969) for a single nucleotide site (i-th site) is given by c i =m i /s i, where m i is the minimum possible number of substitutions at the site for any conceivable topology (= one fewer than the number of different kinds of nucleotides at that site, assuming that one of the observed nucleotides is ancestral) s i is the minimum number of substitutions required for the topology under consideration

Measuring the Extent of Homoplasy 6 he lower bound of the consistency index is not 0 he consistency index varies with the topology herefore, Farris (1989) proposed two more quantities: the retention index and the rescaled consistency index he Retention Index 7 he retention index, ri, is given by (gi-si)/(gi-mi), where gi is the maximum possible number of substitutions at the i-th site for any conceivable tree under the parsimony criterion and is equal to the number of substitutions required for a star topology when the most frequent nucleotide is placed at the central node he Retention Index 8 he retention index becomes 0 when the site is least informative for MP tree construction, that is, si=gi

he Rescaled Consistency Index 9 rc i = g i s i g i m i m i s i Ensemble Indices 40 he three values are often computed for all informative sites, and the ensemble or overall consistency index (CI), overall retention index (RI), and overall rescaled index (RC) for all sites are considered Ensemble Indices RI = CI = i m i i s i i g i i s i i g i i m i RC = CI RI 41 hese indices should be computed only for informative sites, because for uninformative sites they are undefined

Homoplasy Index 42 he homoplasy index is HI =1 CI When there are no backward or parallel substitutions, we have HI =0. In this case, the topology is uniquely determined A Major Caveat 4 Maximum parsimony is not statistically consistent! 44 Questions?