TheDisk-Covering MethodforTree Reconstruction

Similar documents
Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Beyond Galled Trees Decomposition and Computation of Galled Networks

BIOINFORMATICS DISCOVERY NOTE

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

Recent Advances in Phylogeny Reconstruction

Consistency Index (CI)

LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE. (extended abstract submitted to RECOMB '99)

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

Phylogenetic Networks, Trees, and Clusters

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Tree Reconstruction

A few logs suce to build (almost) all trees: Part II

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Constructing Evolutionary/Phylogenetic Trees

Reconstructing Trees from Subtree Weights

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

A Minimum Spanning Tree Framework for Inferring Phylogenies

Phylogenetics: Likelihood

An Investigation of Phylogenetic Likelihood Methods

arxiv: v1 [q-bio.pe] 27 Oct 2011

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

A Phylogenetic Network Construction due to Constrained Recombination

Reconstruire le passé biologique modèles, méthodes, performances, limites

Molecular Evolution & Phylogenetics

Evolutionary Tree Analysis. Overview

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetics. BIOL 7711 Computational Bioscience

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

CS 394C Algorithms for Computational Biology. Tandy Warnow Spring 2012

Phylogenetic analyses. Kirsi Kostamo

X X (2) X Pr(X = x θ) (3)

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

Distances that Perfectly Mislead

Dr. Amira A. AL-Hosary

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types


Jijun Tang and Bernard M.E. Moret. Department of Computer Science, University of New Mexico, Albuquerque, NM 87131, USA

Mul$ple Sequence Alignment Methods. Tandy Warnow Departments of Bioengineering and Computer Science h?p://tandy.cs.illinois.edu

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

EVOLUTIONARY DISTANCES

Phylogenetics: Parsimony

Effects of Gap Open and Gap Extension Penalties

BIOINFORMATICS. New approaches for reconstructing phylogenies from gene order data. Bernard M.E. Moret, Li-San Wang, Tandy Warnow and Stacia K.

arxiv: v5 [q-bio.pe] 24 Oct 2016

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenetic inference

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

PhyQuart-A new algorithm to avoid systematic bias & phylogenetic incongruence

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Advances in Phylogeny Reconstruction from Gene Order and Content Data

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

A new algorithm to construct phylogenetic networks from trees

New Approaches for Reconstructing Phylogenies from Gene Order Data

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Theory of Evolution Charles Darwin

I. Short Answer Questions DO ALL QUESTIONS

Final Exam, Machine Learning, Spring 2009

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

first (i.e., weaker) sense of the term, using a variety of algorithmic approaches. For example, some methods (e.g., *BEAST 20) co-estimate gene trees

Properties of normal phylogenetic networks

Key words. computational learning theory, evolutionary trees, PAC-learning, learning of distributions,

BINF6201/8201. Molecular phylogenetic methods

Algorithms in Bioinformatics

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

The Generalized Neighbor Joining method

arxiv: v1 [q-bio.pe] 1 Jun 2014

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation

Isolating - A New Resampling Method for Gene Order Data

Letter to the Editor. Department of Biology, Arizona State University

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

1. Can we use the CFN model for morphological traits?

Reconstruction of certain phylogenetic networks from their tree-average distances

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetics: Building Phylogenetic Trees

Regular networks are determined by their trees

Evolutionary Models. Evolutionary Models

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogeny Tree Algorithms

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

arxiv: v1 [q-bio.pe] 6 Jun 2013

Theory of Evolution. Charles Darwin

BMI/CS 776 Lecture 4. Colin Dewey

arxiv: v1 [q-bio.pe] 3 May 2016

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Algebraic Statistics Tutorial I

Transcription:

TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1

Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license can be found at http://www.gnu.org/copyleft/fdl.html

Reportonjointworkwith: Scott Nettles CIS, University of Pennsylvania Kenneth Rice Smith-Kline Beecham Tandy Warnow CIS, University of Pennsylvania Shibu Yooseph DIMACS, Rutgers University References D.H. Huson, S. Nettles, L. Parida, T.J. Warnow and S. Yooseph. The Disk-Covering Method for Tree Reconstruction. In: R. Battiti and A.A. Bertossi eds., Proceedings of Algorithms and Experiments (ALEX 98) (Trento, Italy, Feb. 9 11, 1998), 62-75, 1998. D.H. Huson and K. Rice and S. Nettles and T.J. Warnow and S. Yooseph. Hybrid Tree Construction Methods. Submitted to: Workshop on Algorithms Engineering (WAE 98). D.H. Huson, S. Nettles and T.J. Warnow. Obtaining highly accurate topology and evolutionary estimates of evolutionary trees from very short sequences. Submitted to: Recomb 99. 2

TheTreeofLife The first phylogenetic tree of life, Ernst Haeckel (1866) 3

TheTreeofLife A modern view of the tree of life 4

Tree Reconstruction M ethods Maximum Parsimony Popular sequence method NP-hard Hamming distance Steiner tree problem, most parsimonious tree Use heuristics Neighbor-Joining Popular distance method Successively join close pairs of taxa Buneman Tree Distance-based method with nice mathematical properties, but low resolution 3-Approximation methods Agarwala et al. SODA 96 5

MaximumParsimony Find tree that explains data using a minimal number of mutations. A: 0 1 0 1 1 B: 0 0 0 1 0 C: 0 1 0 1 1 D: 1 0 0 1 0 Determine tree topology and labeling of internal nodes: A D A B A B B C C D D C 6

N ew Variants of N eighbor-joining Bio-NJ Improved version of the NJ algorithm based on a simple model of sequence data Oliver Gascuel, 1997. Weighbor Weighted Neighbor-Joining W. Bruno, A. Halpern & N. D. Socci, 1998 7

E xperim ental Sim ulation Studies Choose model tree T (e.g. inspired by biology) Choose model of evolution (e.g. Jukes-Cantor model) Evolve sequences along the model tree Apply tree reconstruction method M to evolved sequences S Compare estimation M(S) with model tree T for topological accuracy Software: ecat, seqgen, PAUP, Phylip. Our programs in C++, LEDA. 8

Jukes-Cantor M odel of E volution Given a model tree T, with edge weights P(e) A r G B C D P(e) E F Interpret P(e) as the probability of change at any given position in a sequence along the edge e 4-state characters, all transitions have equal probabilities, i.i.d. Fix a sequence length k. Choose a root r and a random start sequence S at r Evolve sequences along the tree T, Markov tree. 9

Objective: Topological A ccuracy The main goal in biology is to correctly infer the order of speciation events, hence the objective is to minimize: False positives: wrongly inferred edges False negatives: missing edges Model tree T: S1 S5 S2 S3 S4 Estimation M(T): S1 S5 S3 S2 S4 One false positive: {S 1, S 3 } vs. {S 2, S 4, S 5 } One false negative: {S 1, S 2 } vs. {S 3, S 4, S 5 } 10

Statistical Consistency Given longer and longer sequences (i.e. better and better approximations of the tree distances), does the result of the method converge to the correct tree? Distance methods: yes Maximum Parsimony: not always. A p q C B p q q D Felsenstein-zone 11

Com parison of False Positive Rates 100 80 False Positives 60 40 Buneman N. J. Parsimony 20 0 1000 2000 3000 Sequence Length sequence length vs. false positive rate 93 taxon tree (from 500 taxon rbcl dataset) maximum substitution probability p(e) is 0.48 20 experiments per point 12

ComparisonofFalseNegativeRates 100 80 False Negatives 60 40 Buneman N. J. Parsimony 20 0 1000 2000 3000 Sequence Length sequence length vs. false negative rate 93 taxon tree (from 500 taxon rbcl dataset) maximum substitution probability p(e) is 0.48 20 experiments per point 13

Perform ance of N eighbor-joining 100 False Positives=False Negatives 80 60 40 20 200 400 800 1600 3200 0 0.2 0.4 0.6 P(e) P(e) vs. FP by Length 93 taxon tree (from 500 taxon rbcl dataset) maximum mutation probability p(e) vs. FP (=FN) rate 20 experiments per point 14

BigTreesareHardtoInfer Distance-based methods (e.g. Neighbor-Joining, 3-approximation, Buneman tree, split decomposition) are fast, but degrade in accuracy with high evolutionary divergence. Sequence-based methods (e.g. maximum Parsimony and maximum likelihood) do not degrade, but are computationally expensive. Parsimony does best if all branches are short, so that large numbers of taxa may be needed for accurate tree reconstruction using Parsimony. Year-long Parsimony analyses (Rice et al.) of large divergent datasets are infeasible for most researchers. 15

TheDisk-CoveringMethod(DCM) A divide-and-conquer approach based on the idea of covering given sequence data with small overlapping disks Each disk contains a small number of taxa. Taxa within a disk are very similar. Apply given base-method to subproblems. Use overlap to merge subtrees to obtain final tree. 16

TheDCMAlgorithm Input: distances and sequences Choose base-method (e.g. Parsimony or NJ) For a given threshold w: Compute threshold graph G Vertices are taxa Join two vertices if their distance threshold Compute triangulation G of threshold graph Produce perfect elimination scheme Makes the following step easy: Apply base-method to all maximal cliques in G Merge trees guided by perfect elimination scheme Infer consensus of {T w }. 17

MergingTwoTrees Given trees on two overlapping sets of taxa, e.g. {1,2,3,4,5,6} and {1,2,3,4,7}. 1 2 1 2 3 4 6 1 2 3 7 4 5 3 5 4 6 1 2 3 7 4 1 3 7 2 5 6 4 To merge the two trees together, first transform them (through edge contractions) so that they induce the same subtrees on their shared leaves and then combine them. 18

Neighbor-Joiningvs.DCM-NJ 100 False Positives 80 False Positives 60 40 NJ disc-nj 20 0 5000 10000 Sequence Length 93 taxon tree maximum mutation probability p(e) = 0.48 10 experiments per point Greedy asymmetric median tree, i.e. consensus over all trees {T w }. 19

Neighbor-Joiningvs.DCM-NJ False Negatives 100 80 False Negatives 60 40 NJ disc-nj 20 0 5000 10000 Sequence Length 93 taxon tree maximum mutation probability p(e) = 0.48 10 experiments per point Greedy asymmetric median tree, i.e. consensus over all trees {T w }. 20

NJ,Bio-NJ,Weighbor&DCM-NJ False Positives 50 40 Performance on 135 taxon tree, p(e)=0.64 "NJ" "Bio-NJ" "Weighbor" "DCM-NJ" False Postives 30 20 10 0 0 500 1000 1500 2000 2500 3000 Sequence Length 135 taxon tree maximum mutation probability p(e) = 0.64 3-5 experiments per point Greedy asymmetric median tree of a small subset of {T w }. Work in progress... 21

NJ,Bio-NJ,Weighbor&DCM-NJ False Negatives 50 40 Performance on 135 taxon tree, p(e)=0.64 "NJ" "Bio-NJ" "Weighbor" "DCM-NJ" False Negatives 30 20 10 0 0 500 1000 1500 2000 2500 3000 Sequence Length 135 taxon tree maximum mutation probability p(e) = 0.64 3-5 experiments per point Greedy asymmetric median tree of a small subset of {T w }. Work in progress... 22

ChoosingtheThresholdforDCM-NJ Choice of threshold is ruled by two factors: The accuracy of NJ degrades on subproblems with increasing threshold w. For small thresholds, the merger of subproblems is not uniquely defined. 120 100 "FP" "FN" "MaxSize" "NJ" Error, Size 80 60 40 20 0 1.5 2 2.5 3 3.5 4 4.5 Threshold Value 135 taxon tree, p(e) = 0.64, sequence length 300 23

ThresholdandMergeStep Let T be a model tree and d an estimated distance matrix. A short quartet around an internal edge e is a set of four taxa a, a, b, b that lie in the four subtrees induced by e, of minimal width. a e b a b Theorem If the threshold w is chosen large enough such that every short quartet induces a four-clique in G, then every merger is unique and a DCM method will recover the model tree T, if the base method is accurate on the base problems. 24

Sequence Lengths Required for A ccuracy The length of biological sequences obtainable for phylogenetic analysis is bounded by a few thousand base pairs, so the question how sequence length affects performance is critical. The sequence lengths that suffice for accuracy of distance methods such as Neighbor-Joining or the Buneman Tree grow exponentially in the divergence of the model tree. (Atteson 1997, Erdös et al. 1997) For DCM-boosted distances methods we can show: For almost all trees, polylogarithmic length suffices for accuracy with high probability, and polynomial length suffices for all trees with high probability. 25

DCMvs.ShortQuartetMethod P. Erdös, M. Steel, L. Székely and T. Warnow (1997) introduced the Short Quartet Method (SQM), the first method known to require only polylogarithmic length sequences for complete accuracy with high probability. Drawback: SQM returns nothing, if complete accuracy is unachievable. 100 80 False Negatives 60 40 20 DYC DCM-Buneman 0 5000 10000 15000 20000 Sequence Length Average performance (5 experiments per point) of the SQM compared with DCM-Buneman, on a 35 taxon tree with maximum p(e) equal to 0.04. For each dataset, SQM returns either 0% or 100% false negatives. 26

OtherMethodswithSimilarTheoretical Properties M. Cryan, L.A. Goldberg and P.W. Goldberg, Evolutionary Trees can be Learned in Polynomial Time in the Two-State General Markov Model, FOCS 98. M. Csürös and M.-Y. Kao, Recovering Evolutionary Trees Through Harmonic Greedy Triplets, Recomb 99. 27

Conclusion and Future Research By reduction to small and closely related-datasets, the DCM-method can substantially improve the accuracy and/or time requirements of phylogenetic tree reconstruction methods for large and divergent datasets. Future research will focus on: using non-iid evolutionary models Parsimony and MLE using model trees that stress the methods explicitly exploring the application to multiple sequence alignment investigating real data sets and determining and how they are likely to respond to this approach Optimal tree refinement: returning binary trees with edge weights reporting expected Robinson-Foulds error developing a public version of the software. 28