The Generalized Neighbor Joining method

Similar documents
Using algebraic geometry for phylogenetic reconstruction

Reconstructing Trees from Subtree Weights

PHYLOGENETIC ALGEBRAIC GEOMETRY

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic invariants versus classical phylogenetics

Constructing Evolutionary/Phylogenetic Trees

Introduction to Algebraic Statistics

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

ALGORITHMS FOR RECONSTRUCTING PHYLOGENETIC TREES FROM DISSIMILARITY MAPS

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

Metric learning for phylogenetic invariants

Phylogenetic Tree Reconstruction

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Algebraic Statistics Tutorial I

Theory of Evolution Charles Darwin

arxiv: v1 [q-bio.pe] 27 Oct 2011

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Open Problems in Algebraic Statistics

USING INVARIANTS FOR PHYLOGENETIC TREE CONSTRUCTION


Consistency Index (CI)

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Letter to the Editor. Department of Biology, Arizona State University

arxiv: v1 [q-bio.pe] 3 May 2016

Phylogeny Tree Algorithms

EVOLUTIONARY DISTANCES

Evolutionary Tree Analysis. Overview

Geometry of Phylogenetic Inference

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetic Algebraic Geometry

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

A Study on Correlations between Genes Functions and Evolutions. Natth Bejraburnin. A dissertation submitted in partial satisfaction of the

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Reconstruire le passé biologique modèles, méthodes, performances, limites

Algorithms in Bioinformatics

Theory of Evolution. Charles Darwin

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Constructing Evolutionary/Phylogenetic Trees

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

The least-squares approach to phylogenetics was first suggested

TheDisk-Covering MethodforTree Reconstruction

Phylogenetic Geometry

ALGEBRA: From Linear to Non-Linear. Bernd Sturmfels University of California at Berkeley

Phylogenetic Networks, Trees, and Clusters

Phylogeny of Mixture Models

Chapter 7: Models of discrete character evolution

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Phylogenetics: Likelihood

Introduction to Molecular Phylogeny

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Molecular Evolution & Phylogenetics

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetic inference

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Phylogeny: traditional and Bayesian approaches

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Reading for Lecture 13 Release v10

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

A few logs suce to build (almost) all trees: Part II

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

O 3 O 4 O 5. q 3. q 4. Transition

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

BINF6201/8201. Molecular phylogenetic methods

Reconstruction of certain phylogenetic networks from their tree-average distances

Reconstructibility of trees from subtree size frequencies

Phylogenetics: Building Phylogenetic Trees

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

BMI/CS 776 Lecture 4. Colin Dewey

Jed Chou. April 13, 2015

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Evolutionary Models. Evolutionary Models

Quartet Inference from SNP Data Under the Coalescent Model

66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

A Minimum Spanning Tree Framework for Inferring Phylogenies

Recent Advances in Phylogeny Reconstruction

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Lecture 9 : Identifiability of Markov Models

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

7. Tests for selection

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Phylogeny: building the tree of life

Phylogenetics. BIOL 7711 Computational Bioscience

Indispensable monomials of toric ideals and Markov bases

Effects of Gap Open and Gap Extension Penalties

A concise proof of Kruskal s theorem on tensor decomposition

Workshop III: Evolutionary Genomics

Dr. Amira A. AL-Hosary

The Mathematics of Phylogenomics

Molecular Evolution and Phylogenetic Tree Reconstruction

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Transcription:

The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1

Challenge We would like to assemble the fungi tree of life. Francois Lutzoni and Rytas Vilgalys Department of Biology, Duke University 1500+ fungal species http://ocid.nacse.org/research/aftol/about.php data mining 2

Many problems to be solved... http://tolweb.org/tree?group=fungi Zygomycota is not monophyletic. The position of some lineages such as that of Glomales and of Engodonales-Mortierellales is unclear, but they may lie outside Zygomycota as independent lineages basal to the Ascomycota- Basidiomycota lineage (Bruns et al., 1993). data mining 3

Phylogeny Phylogenetic trees describe the evolutionary relations among groups of organisms. data mining 4

Constructing trees from sequence data Ten years ago most biologists would have agreed that all organisms evolved from a single ancestral cell that lived 3.5 billion or more years ago. More recent results, however, indicate that this family tree of life is far more complicated than was believed and may not have had a single root at all. (W. Ford Doolittle, (June 2000) Scientific American). Since the proliferation of Darwinian evolutionary biology, many scientists have sought a coherent explanation from the evolution of life and have tried to reconstruct phylogenetic trees. data mining 5

Methods to reconstruct a phylogenetic tree from DNA sequences include: The maximum likelihood estimation (MLE) methods: They describe evolution in terms of a discrete-state continuous-time Markov process. The substitution rate matrix can be estimated using the expectation maximization (EM) algorithm. (for eg. Dempster, Laird, and Rubin (1977), Felsenstein (1981)). Distance based methods: It computes pair-wise distances, which can be obtained easily, and combinatorially reconstructs a tree. The most popular method is the neighbor-joining (NJ) method. (for eg. Saito and Nei (1987), Studier and Keppler (1988)). data mining 6

However The MLE methods: An exhaustive search for the ML phylogenetic tree is computationally prohibitive for large data sets. The NJ method: The NJ phylogenetic tree for large data sets loses so much sequence information. Goal: Want an algorithm for phylogenetic tree reconstruction by combining the MLE method and the NJ method. Want to apply methods to very large datasets. Note: An algebraic view of these discrete stat problems might help solve this problem. data mining 7

The generalized neighbor-joining mathod The GNJ method: in 2005, Levy, Y., and Pachter introduced the generalized neighbor-joining (GNJ) method, which reconstructs a phylogenetic tree based on comparisons of subtrees rather than pairwise distances The GNJ method is a method combined with the MLE method and the NJ method. The GNJ method uses more sequence information: the resulting tree should be more accurate than the NJ method. The computational time: polynomial in terms of the number of DNA sequences. data mining 8

The GNJ method MJOIN is available at http://bio.math.berkeley.edu/mjoin/. data mining 9

Distance Matrix A distance matrix for a tree T is a matrix D whose entry D ij stands for the mutation distance between i and j. 1 3 4 4 1 1 a 3 1 r b 1 2 3 2 6 2 5 data mining 10

Distance Matrix 1 2 3 4 5 6 1 0 6 8 9 12 11 2 6 0 6 7 10 9 3 8 6 0 3 6 5 4 9 7 3 0 5 4 5 12 10 6 5 0 5 6 11 9 5 4 5 0 Table 1: Distance matrix D for the example. data mining 11

Definitions Def. A distance matrix D is a metric iff D satisfies: Symmetric: D ij = D ji and D ii = 0. Triangle Inequality: D ik + D jk D ij. Def. D is an additive metric iff there exists a tree T s.t. Every edge has a positive weight and every leaf is labeled by a distinct species in the given set. For every pair of i, j, D ij = the sum of the edge weights along the path from i to j. Also we call such T an additive tree. data mining 12

Neighbor Joining method Def. We call a pair of two distinct leaves {i, j} a cherry if there is exactly one intermediate node on the unique path between i and j. Thm. [Saitou-Nei, 1987 and Studier-Keppler, 1988] Let A R n n such that A ij = D(ij) (r i + r j )/(n 2), where r i := n k=1 D(ik). {i,j } is a cherry in T if A i j is a minimum for all i and j. Neighbor Joining Method: Input. A tree matric D. Output. An additive tree T. Idea. Initialize a star-like tree. Then find a cherry {i, j} and compute branch length from the interior node x to i and from x to j. Repeat this process recursively until we find all cherries. data mining 13

Neighbor Joining Method 1 2 3 4 8 7 6 5 2 5 4 1 2 1 5 1 2 2 5 2 1 8 8 8 8 8 7 7 7 7 7 6 6 6 6 6 5 5 5 5 5 4 4 4 4 4 3 3 3 3 3 2 2 2 2 2 1 1 1 1 1 Y X Y X Y X X Y Y X Z data mining 14

The GNJ method Extended the Neighbor Joining method with the total branch length of m-leaf subtrees. Increasing 2 m n 2, since there are more data, a reconstructed tree from GNJ method gets closer to the true tree than the Saito-Nei NJ method. The time complexity of GNJ method is O(n m ). Note: If m = 2, then GNJ method is the Neighbor Joining method with pairwise distances. data mining 15

Notation and definitions Notation. Let [n] denote the set {1, 2,...,n} and ( [n] m) denote the set of all m-element subsets of [n]. Def. A m-dissimilarity map is a function D : ( [n] m) R 0. In the context of phylogenetic trees, the map D(i 1, i 2,...,i m ) measures the weight of a subtree that spans the leaves i 1,i 2,...,i m. Denote D(i 1 i 2...i m ) := D(i 1,i 2,...,i m ). data mining 16

Weights of Subtrees in T k x2 i j x1 l D(ijkl) is the total branch length of the subtree in green. Also D(x 1 x 2 ) is the total branch length of the subtree in pink and it is also a pairwise distance between x 1 and x 2. data mining 17

Thm. [Levy, Y., Pachter, 2005] Let D m be an m-dissimilarity map on n leaves of a tree T, D m : ( [n] m) R 0 corresponding m-subtree weights, and define S(ij) := X ( [n]\{i,j} m 2 ) D m (ijx). Then S(ij) is a tree metric. Furthermore, if T is based on this tree metric S(ij) then T and T have the same tree topology and there is an invertible linear map between their edge weights. Note. This means that if we reconstruct T, then we can reconstruct T. data mining 18

Neighbor Joining with Subtree Weights Input: n DNA sequences and an integer 2 m n 2. Output: A phylogenetic tree T with n leaves. 1. Compute all m-subtree weights via the ML method. 2. Compute S(ij) for each pair of leaves i and j. 3. Apply Neighbor Joining method with a tree metric S(ij) and obtain additive tree T. 4. Using a one-to-one linear transformation, obtain a weight of each internal edge of T and a weight of each leaf edge of T. data mining 19

Complexity Lemma. [Levy, Pachter, Y.] If m 3, the time complexity of this algorithm is O(n m ), where n is the number of leaves of T and if m = 2, then the time complexity of this algorithm is O(n 3 ). Sketch of Proof: If m 3, the computation of S(ij) is O(n m ) (both steps are trivially parallelizable). The subsequent neighbor-joining is O(n 3 ) and edge weight reconstruction is O(n 2 ). If m = 2, then the subsequent neighbor-joining is O(n 3 ) which is greater than computing S(ij). So, the time complexity is O(n 3 ). Note: The running time complexity of the algorithm is O(n 3 ) for both m = 2 and m = 3. data mining 20

Cherry Picking Theorem Thm. [Levy, Pachter, Y.] Let T be a tree with n leaves and no nodes of degree 2 and let m be an integer satisfying 2 m n 2. Let D : ( [n] m) R 0 be the m-dissimilarity map corresponding to the weights of the subtrees of size m in T. If Q D (a b ) is a minimal element of the matrix Q D (ab) = ( ) n 2 m 1 X ( [n i j] m 2 ) D(ijX) X ( [n i] m 1) D(iX) X ( [n j] m 1) D(jX) then {a, b } is a cherry in the tree T. Note. The theorem by Saitou-Nei and Studier-Keppler is a corollary from Cherry Picking Theorem. data mining 21

Simulation Results With the Juke Cantor model. data mining 22

Consider two tree models... Modeled from Strimmer and von Haeseler. a/2 a a b b b b a/2 a b a a b b a/2 a a b b b b a/2 a b a b a b T1 T2 data mining 23

We generate 500 replications with the Jukes-Cantor model via a software evolver from PAML package. The number represents a percentage which we got the same tree topology. l a/b m=2 m=3 m=4 fastdnaml 500 0.01/0.07 68.2 76.8 80.4 74.8 0.02/0.19 54.2 61.2 73.6 55.6 0.03/0.42 10.4 12.6 23.8 12.6 1000 0.01/0.07 94.2 96 97.4 96.6 0.02/0.19 87.6 88.6 96.2 88 0.03/0.42 33.4 35 52.4 33.6 Table 2: Success Rates for the model T 1. data mining 24

l a/b m=2 m=3 m=4 fastdnaml 500 0.01/0.07 84.4 86 85.6 88.4 0.02/0.19 68.2 72 73.2 88.4 0.03/0.42 18.2 29.2 36.2 87.4 1000 0.01/0.07 95.6 97.8 97.4 99.4 0.02/0.19 88.4 89.6 93.4 99.8 0.03/0.42 40 48.2 57.6 96.6 Table 3: Success Rates for the model T 2. data mining 25

A unifying framework: Algebraic Statistics data mining 26

What is Algebraic Statistics? Algebraic Statistics is to apply computational commutative algebraic techniques to statistical problems. The algebraic view of discrete statistical models has been applied in many statistical problems, including: conditional inference [Diaconis and Sturmfels 1998] disclosure limitation [Sullivant 2005] the maximum likelihood estimation [Hosten et al 2004] parametric inference [Pachter and Sturmfels 2004] phylogenetic invariants [Allman and Rhodes 2003, Eriksson 2005, etc]. data mining 27

Algebraic statistical models An algebraic statistical model arises as the image of a polynomial map f : R d R m, θ = (θ 1,...,θ d ) ( p 1 (θ),p 2 (θ),...,p m (θ) ). The unknowns θ 1,...,θ d represent the model parameters. In the view of algebraic geometry, statistical models are algebraic varieties, sets of points where all given polynomials vanish at the same time. Note: The phylogenetic models are also algebraic varieties. Note: The MLE problem is a polynomial optimization problem over the image of f. data mining 28

Jukes-Cantor Model Consider the Jukes-Cantor (JC) model. The JC model has substitution rate matrix: Q = 3α α α α α 3α α α α α 3α α α α α 3α where α 0 is a parameter. The corresponding substitution matrix equals 1 + 3e 4αt 1 e 4αt 1 e 4αt 1 e 4αt θ(t) = 1 1 e 4αt 1 + 3e 4αt 1 e 4αt 1 e 4αt 4 1 e 4αt 1 e 4αt 1 + 3e 4αt 1 e 4αt 1 e 4αt 1 e 4αt 1 e 4αt 1 + 3e 4αt data mining 29

However, they are not polynomials... But we can do the following: Introduce the new two parameters π i = 1 4 (1 e 4α it i ) and µ i = 1 4 (1 + 3e 4α it i ). These parameters satisfy the linear constraint µ i + 3π i = 1, and the branch length t i of the ith edge can be recovered as follows: 3α i t i = 1 4 log det( θ i) = 3 4 log(1 4π i). data mining 30

The parameters are simply the entries in the substitution matrix θ i = µ i π i π i π i π i µ i π i π i π i π i µ i π i π i π i π i µ i. The Jukes Cantor model on the tree T with r edges and n leaves is the polynomial map f : R r R 4n. data mining 31

Example Suppose we have an unrooted tree T with leaves {1, 2, 3} with letters Σ = {A, C, G, T } at a single site. Want to estimate all parameters. G This model is a three-dimensional algebraic variety, given as the image of a trilinear map f : R 3 R 64. C A data mining 32

Example cont Let p 123 be the probability of observing the same letter at all three leaves, p ij the probability of observing the same letter at the leaves i,j and a different one at the third leaf, and p dis the probability of seeing three distinct letters. p 123 = µ 1 µ 2 µ 3 + 3π 1 π 2 π 3, p dis = 6µ 1 π 2 π 3 + 6π 1 µ 2 π 3 + 6π 1 π 2 µ 3 + 6π 1 π 2 π 3, p 12 = 3µ 1 µ 2 π 3 + 3π 1 π 2 µ 3 + 6π 1 π 2 π 3, p 13 = 3µ 1 π 2 µ 3 + 3π 1 µ 2 π 3 + 6π 1 π 2 π 3, p 23 = 3π 1 µ 2 µ 3 + 3µ 1 π 2 π 3 + 6π 1 π 2 π 3. data mining 33

All 64 coordinates of f are given by these five trilinear polynomials, namely, f AAA = f CCC = f GGG = f TTT = 1 4 p 123, f ACG = f ACT = = f GTC = f AAC = f AAT = = f TTG = f ACA = f ATA = = f TGT = f CAA = f TAA = = f GTT = 1 24 p dis, 1 12 p 12, 1 12 p 13, 1 12 p 23. This means that the Jukes Cantor model is the image of the simplified map f : R 3 R 5, ( (µ 1,π 1 ),(µ 2,π 2 ),(µ 3,π 3 ) ) (p 123,p dis,p 12,p 13,p 23 ). data mining 34

Characterize the image of f Do the following linear change of coordinates: q 111 = p 123 + 1 3 p dis 1 3 p 12 1 3 p 13 1 3 p 23 = (µ 1 π 1 )(µ 2 π 2 )(µ 3 π 3 ) q 110 = p 123 1 3 p dis + p 12 1 3 p 13 1 3 p 23 = (µ 1 π 1 )(µ 2 π 2 )(µ 3 + 3π 3 ) q 101 = p 123 1 3 p dis 1 3 p 12 + p 13 1 3 p 23 = (µ 1 π 1 )(µ 2 + 3π 2 )(µ 3 π 3 ) q 011 = p 123 1 3 p dis 1 3 p 12 1 3 p 13 + p 23 = (µ 1 + 3π 1 )(µ 2 π 2 )(µ 3 π 3 ) q 000 = p 123 + p dis + p 12 + p 13 + p 23 = (µ 1 + 3π 1 )(µ 2 + 3π 2 )(µ 3 + 3π 3 ). This model is the hypersurface in 4 whose ideal equals P f = q 000 q 2 111 q 011 q 101 q 110. data mining 35

MLE with the JC model Suppose (U 123, U dis, U 12,U 13,U 23 ) is the observed data. Then the MLE with the JC model for T is max p U 123 123 pu dis dis pu 12 12 pu 13 13 pu 23 23 subject to θ Θ. data mining 36

Note: The image of f is a hyper-surface over R 5. θ parameter space f p probability simplex data Figure 1: θ is the global maxima and p is an image under f. data mining 37

Commutative algebraic methods to phylogenetics. Using the algebraic techniques with the JC model with triplets, interval arithmetics, and the GNJ method, one can reconstruct a phylogenetic tree from DNA sequences (Sainudiin and Y. 2005). One can find more tree invariants with the JC model, the Kimura 2- parameter model (K80), and the Kimura 3-parameter model (K81) at http://www.math.tamu.edu/ lgp/small-trees/small-trees.html. Using these invariants and the GNJ method one can reconstruct a phylogenetic tree from DNA sequences (Contois and Levy, 2005). One can find more applications of algebra to computational biology at our new book Algebraic Statistics for Computational Biology edited by Pachter and Sturmfels, Cambridge University Press 2005. data mining 38

D. Levy (Math, Berkeley), L. Pachter (Math, Berkeley), and R. Yoshida, Beyond Pairwise Distances: Neighbor Joining with Phylogenetic Diversity Estimates the Molecular Biology and Evolution, Advanced Access, November 9, (2005). A. Hobolth (Bioinformatics, NCSU) and R. Yoshida, Maximum likelihood estimation of phylogenetic tree and substitution rates via generalized neighbor-joining and the EM algorithm, Algebraic Biology 2005, Computer Algebra in Biology, edited by H. Anai and K. Horimoto, vol. 1 (2005) p41-50, Universal Academy Press, INC.. (Also available at arxiv:q-bio.qm/0511034.) R. Sainudiin (Statistics, Oxford) and R. Yoshida, Applications of Interval Methods to Phylogenetic trees Algebraic Statistics for Computational Biology edited by Lior Pachter and Bernd Sturmfels, (2005) Cambridge University Press, p359-374. data mining 39

Thank you... data mining 40