Walks in Phylogenetic Treespace

Similar documents
Exploring Treespace. Katherine St. John. Lehman College & the Graduate Center. City University of New York. 20 June 2011

Finding the best tree by heuristic search

Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees

CS/COE

The Complexity of the uspr Distance

On the Subnet Prune and Regraft distance

Molecular Evolution & Phylogenetics

July 18, Approximation Algorithms (Travelling Salesman Problem)

Evolutionary Tree Analysis. Overview

SPR Distance Computation for Unrooted Trees

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

More on NP and Reductions

Notes 3 : Maximum Parsimony

2 Notation and Preliminaries

NP Completeness. CS 374: Algorithms & Models of Computation, Spring Lecture 23. November 19, 2015

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

arxiv: v1 [cs.cc] 9 Oct 2014

CS Data Structures and Algorithm Analysis

P versus NP. Math 40210, Spring September 16, Math (Spring 2012) P versus NP September 16, / 9

Algorithms and Data Structures 2016 Week 5 solutions (Tues 9th - Fri 12th February)

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

Who Has Heard of This Problem? Courtesy: Jeremy Kun

Phylogenetics. BIOL 7711 Computational Bioscience

Solving the Maximum Agreement Subtree and Maximum Comp. Tree problems on bounded degree trees. Sylvain Guillemot, François Nicolas.

Bounds on the Traveling Salesman Problem

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Phylogenetic Tree Reconstruction

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Intro to Contemporary Math

arxiv: v2 [math.co] 7 Jan 2016

University of Toronto Department of Electrical and Computer Engineering. Final Examination. ECE 345 Algorithms and Data Structures Fall 2016

Department of Mathematics and Statistics University of Canterbury Private Bag 4800 Christchurch, New Zealand

VIII. NP-completeness

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)}

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

CS 320, Fall Dr. Geri Georg, Instructor 320 NP 1

CS 301: Complexity of Algorithms (Term I 2008) Alex Tiskin Harald Räcke. Hamiltonian Cycle. 8.5 Sequencing Problems. Directed Hamiltonian Cycle

On the mean connected induced subgraph order of cographs

Research Collection. Grid exploration. Master Thesis. ETH Library. Author(s): Wernli, Dino. Publication Date: 2012

NP and Computational Intractability

K-center Hardness and Max-Coverage (Greedy)

Lecture 15 - NP Completeness 1

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

k-protected VERTICES IN BINARY SEARCH TREES

Solving the Hamiltonian Cycle problem using symbolic determinants

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

NP-Complete Problems and Approximation Algorithms

6 Euler Circuits and Hamiltonian Cycles

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

The Lefthanded Local Lemma characterizes chordal dependency graphs

arxiv: v1 [cs.ds] 2 Oct 2018

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Cographs; chordal graphs and tree decompositions

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Phylogeny. November 7, 2017

Problem Set 3 Due: Wednesday, October 22nd, 2014

Branching. Teppo Niinimäki. Helsinki October 14, 2011 Seminar: Exact Exponential Algorithms UNIVERSITY OF HELSINKI Department of Computer Science

Constructing Evolutionary/Phylogenetic Trees

Characterization of Fixed Points in Sequential Dynamical Systems

Maximum Agreement Subtrees

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

Classifying Coloring Graphs

arxiv: v1 [q-bio.pe] 1 Jun 2014

Chapter 3: Phylogenetics

Prime Labeling of Small Trees with Gaussian Integers

CS1800: Mathematical Induction. Professor Kevin Gold

Evolutionary Trees. Evolutionary tree. To describe the evolutionary relationship among species A 3 A 2 A 4. R.C.T. Lee and Chin Lung Lu

X X (2) X Pr(X = x θ) (3)

BIOL 1010 Introduction to Biology: The Evolution and Diversity of Life. Spring 2011 Sections A & B

NATIONAL UNIVERSITY OF SINGAPORE CS3230 DESIGN AND ANALYSIS OF ALGORITHMS SEMESTER II: Time Allowed 2 Hours

PRIME LABELING OF SMALL TREES WITH GAUSSIAN INTEGERS. 1. Introduction

Algorithms and Complexity Theory. Chapter 8: Introduction to Complexity. Computer Science - Durban - September 2005

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

CSCI3390-Lecture 18: Why is the P =?NP Problem Such a Big Deal?

Determining conditions sufficient for the existence of arc-disjoint hamiltonian paths and out-branchings in tournaments

Reconstruction of certain phylogenetic networks from their tree-average distances

Minimum-Dilation Tour (and Path) is NP-hard

Reachability relations and the structure of transitive digraphs

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

1.1 The (rooted, binary-character) Perfect-Phylogeny Problem

Lecture 24 : Even more reductions

Worst Case Performance of Approximation Algorithm for Asymmetric TSP

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

Reachability relations and the structure of transitive digraphs

Cycle lengths in sparse graphs

The P = NP Problem Bristol Teachers Circle, June P.D.Welch, University of Bristol

Theory of Evolution. Charles Darwin

The maximum agreement subtree problem

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

RESEARCH PROBLEMS FROM THE 18TH WORKSHOP 3IN edited by Mariusz Meszka

The path of order 5 is called the W-graph. Let (X, Y ) be a bipartition of the W-graph with X = 3. Vertices in X are called big vertices, and vertices

ECS122A Handout on NP-Completeness March 12, 2018

Types of triangle and the impact on domination and k-walks

TheDisk-Covering MethodforTree Reconstruction

Standard Diraphs the (unique) digraph with no vertices or edges. (modulo n) for every 1 i n A digraph whose underlying graph is a complete graph.

2. A vertex in G is central if its greatest distance from any other vertex is as small as possible. This distance is the radius of G.

Transcription:

Walks in Phylogenetic Treespace lan Joseph aceres Samantha aley John ejesus Michael Hintze iquan Moore Katherine St. John bstract We prove that the spaces of unrooted phylogenetic trees are Hamiltonian for two popular search metrics: Subtree Prune and Regraft (SPR) and Tree isection and Reconnection (TR). Further, we make progress on two conjectures of ryant on searching phylogenetic treespace: treespace under the Nearest Neighbor Interchange (NNI) metric has a 2-walk, and there exists SPR neighborhoods without complete NNI walks. 1 Introduction Finding the evolutionary history, or phylogeny, for a set of species is a core activity in biology and is used for classifying species, building the Tree of Life, designing the flu vaccine, and determining the origins of viruses such as HIV [?]. These phylogenies, are often modelled by unrooted, leaflabelled trees [?] and finding the optimal tree for biological data is NP-hard [?,?]. To improve these computationally expensive searches, we focus on the underlying space of trees, under metrics induced by popular operations used to search the space of all trees with n leaves. We make progress towards solving two conjectures of ryant [?] on the shortest walk of the full treespace and of a SPR neighborhood under the NNI metric. We show that there is a Hamiltonian circuit of treespace for the SPR and TR metrics and for the NNI metric for n < 8. Further, we show there exists a ept. of Math & omputer Science, Lehman ollege, ity University of New York, ronx, New York, 10468; orresponding author: stjohn@lehman.cuny.edu; Support provided by NSF Grant #09-20920. Supported by an undergraduate research fellowship from the Louis Stokes lliance for Minority Participation in Research, NSF #07-03449. 1

u v u v u v Figure 1: The two possible NNI operations on an internal edge (u, v). 2-walk for treespace under the NNI-metric. Towards the second conjecture of ryant, we show that for every n > 5, there is an SPR neighborhood for which there is no NNI walk. Prior work has focused on the complexity of calculating the metrics (all three are NP-hard to compute [?,?,?,?]) as well as calculating the diameter of the space. y work of Li et al. and asgupta et al. [?,?], the diameter for NNI is Θ(n lg n), while llen and Steel showed that the diameter of the space for SPR and TR is Θ(n). astert et al. [?] examined the space of trees under NNI in terms of coherent algebras and spectrum. Motivated by the computationally expensive searches of treespace, ryant [?] posed the questions of how many NNI moves are needed to visit all trees, and similarly how many NNI moves are needed to traverse an SPR neighborhood. 2 ackground We briefly outline the background needed for this paper. For a more thorough overview, see Semple and Steel [?] and orman, Leiserson, and Rivest [?]. 2.1 Phylogenetic Trees and istances The evolutionary relationship between various biological organisms can be represented as a leaflabelled tree where the tree is rooted if the evolutionary origin is known. We focus on binary (or fully resolved ), unrooted trees. The leaf labels often represent N or protein sequences, and the internal nodes correspond to speciation events. Finding the optimal evolutionary tree for a set of species is NP-hard [?,?]. iologically inspires operations are used to heuristically search the space of all trees. These search heuristics often produce different evolutionary trees on the same 2

G G SPR TR F F G F (i) (ii) (iii) Figure 2: The trees on the left and center differ by a single SPR move. The tree on the right differs by a single TR move from the center tree. set of species. In order to compare phylogenies, several metrics for measuring distance have been defined in literature. The Nearest Neighbor Interchange (NNI) is a distance metric introduced independently by asgupta et al. [?] and Li et al. [?]. n NNI operation swaps two subtrees that are separated by an internal edge in order to generate a new tree [?] (see Figure??). efinition 1. The NNI distance, nni (T 1, T 2 ), between two trees T 1 and T 2 is defined as the minimum number of NNI operations required to change one tree into the other. The complexity of computing the NNI distance was open for over 25 years, and was proven to be NP-complete by llen and Steel [?]. For a tree with n uniquely labeled leaves, there are n 3 internal branches. Thus, there are 2(n 3) NNI rearrangements for any tree. The trees that are a single rearrangement of a given tree form its 1-neighborhood under the NNI metric. The most popular move used to search treespace is the Subtree-Prune-and-Regraft (SPR). Roughly, an SPR move prunes a selected subtree and then reattaches it on an edge selected from the remaining tree (see Figure??). efinition 2. The SPR distance, SP R = (T 1, T 2 ), between two trees is the minimal number of SPR moves needed to transform the first tree into the second tree (see Figure??). The calculation of SPR distances has been proven NP-complete for both rooted and unrooted trees [?,?]. pproximation algorithms for SPR on rooted trees exist [?,?]. generalization of the SPR is the Tree-isection-Reconnnection (TR) operation. Roughly, a TR move breaks an edge 3

0 5 14 12 6 1 8 9 2 7 3 4 13 10 11 Figure 3: Walks of Treespace: Left: The solid edges are a 1-walk (Hamiltonian path) of 5-species treespace under the NNI metric. Right: For 6-species treespace, an SPR neighborhood with the NNI connections, illustrating that no NNI 1-walks exists. of the tree and then selects two edges on the resulting subtrees and connects the selected edges by a new edge. efinition 3. The TR distance, T R = (T 1, T 2 ), between two trees is the minimal number of TR moves needed to transform the first tree into the second tree (see Figure??). We note that NNI SPR TR [?]. That is every NNI move is also an SPR move which is also a TR move. We use this fact to extend our results from the SPR to TR metrics. ach metric induces a discrete metric space on the set of n leaf trees. Since these spaces are defined both by the number of leaves in the underlying trees as well as the metric chosen, we will refer to these spaces as the n-species treespace under the said metric. The majority of this paper concentrates on the n-species treespace under the NNI metric and SPR metric. Figure?? show the 5-species treespace under the NNI metric. 2.2 Walks We focus on the shortest paths needed to visit all trees in a given treespace. In particular, we are interested in the length of the shortest path that visits all trees with the goal of bounding searches for the optimal phylogenetic tree. efinition 4. n undirected graph G has a k-walk if a walk along the edges can visit every vertex at most k times. The special case of k = 1 is often referred to as a Hamiltonian path: 4

1 5 2 3 4 Figure 4: Theorem 1: Left: the circuit that visits all trees in 4-species space. enter: the expansion of each tree by a new leaf (the sets S i) and the connection between them. Right: Hamiltonian circuit of 5-species space under the SPR (and TR) metric. efinition 5. n undirected graph G has a Hamiltonian path if a walk along the edges can visit every vertex exactly once. Hamiltonian cycle is a walk along the edges of an undirected graph that visits every vertex exactly once and ends at the beginning vertex, also know as a Hamiltonian circuit. Hamiltonian path is traceable and can be verified in polynomial time and finding a Hamiltonian cycle is NP-complete. [?]. 2-walk is also traceable and can be verified in polynomial time. When a Hamiltonian path is unavailable, a 2-walk can be used to reduce the maximum number of times a single tree is evaluated by a search algorithm. 3 Walks of Treespace We show that for every n, n-species treespace under the popular SPR and TR metrics has a Hamiltonian circuit. Further, we show that under the NNI metric, for every n, n-species treespace has a 2-walk. 3.1 Hamiltonian ircuits under the SPR and TR Metric Since every SPR move is also a TR move, it suffices to show that there is a Hamiltonian path for treespace under the SPR metric. 5

Theorems 1. For every n, there exists an Hamiltonian circuits of the n-species treespace under the SPR and TR metrics. Proof. Proof by induction on n. For n = 4, there are exactly three trees, and they form a complete graph under the SPR metric, and thus have a Hamiltonian circuit. For n > 4: assume true for n-species treespace and show for (n + 1)-species treespace. Let t 1, t 2,..., t (2n 5)!! be the trees of the n-species treespace. ach (n+1)-species trees can be uniquely generated by adding a n+1 leaf to an edge of one of the n-species trees [?]. Let S i be the set of (n+1)- species trees generated from t i for i = 1, 2,..., (2n 5)!!. We note that each S i forms a complete graph under the SPR metric (see Figure 3). y inductive hypothesis, there exists a Hamiltonian circuit, H n, on the n-species treespace which will form the backbone of a Hamiltonian circuit for the (n + 1)-species treespace. To create a Hamiltonian circuit for (n + 1)-species treespace, we replace each t i with the set S i. Since S i is a complete graph, we can find a Hamiltonian path with any starting and ending points. efine the Hamiltonian circuit H n+1 as follows: For each S i, let h i be the Hamiltonian path that begins with the tree with 1 and n + 1 as sibling pairs and ends with the tree n and n + 1 as sibling pairs. egin the circuit of S 1 by traversing the trees in h 1. ttach the last tree in h 1 to the last tree in h 2. This is possible, since by construction, the new leaf is attached in identical positions, and these two trees can be connected by the first edge in H n. ontinue by following the path h 2 in reverse until reaching the first tree in its path. y similar construction, H n+1 can be extended to traverse the first (2n 5)!! 1 sets of trees. Since the number of these sets is odd, we need to do something slighly different for the last two sets. For S (2n 5)!! 1, choose h (2n 5)!! 1 to be the Hamiltonian path from the tree with 1 and n + 1 as a sibling pair and the tree with 2 and n + 1 as a sibling pair. nd for the last set, S (2n 5)!!, choose h (2n 5)!! to be the Hamiltonian path from the tree with 2 and n + 1 as a sibling pair and the tree with n and n + 1 as a sibling pair. We continue the circuit H n+1 across h (2n 5)!! 1 and h (2n 5)!!, connecting the last tree in h (2n 5)!! to the first tree in h 1. It is easy to check that this circuit visits every tree and is thus a Hamiltonian circuit of (n + 1)- species space under the SPR metric. Since SPR TR, the result also holds for TR. 6

3.2 2-Walk under the NNI Metric For the NNI metric, we can generalize the argument of Theorem??. We rely on the fact that every set of (n + 1)-leaf trees generated from a single n leaf trees (the sets S i above) can be traversed by a 2-walk. We note that these sets S i for NNI are not complete graphs (unlike for SPR and TR) and further there exists sets for n > 6 where no Hamiltonian path is possible. Thus, our argument for a Hamiltonicity for treespaces under the SPR and TR metrics will not extend to NNI, and the existence of a Hamiltonian walk through an n-species treespace under the NNI metric is currently unknown. We can however proof the existence of a 2-walk through any n-species treespace. Theorem 1. For every n, there exists a 2-walk of the n-species treespace under the NNI metric. Proof. Proof by Induction on n. For n = 4, there are 3 trees connected in a triangle and thus have a 2-walk. For n > 4: Let W n be a NNI 2-walk of the n-species treespace. s in Theorem 1, let S i be the set of (n + 1)-species trees generated from t i for i = 1, 2,..., (2n 5)!!. We note that by construction, each tree in S i corresponds to an edge of the tree t i. Let P be a planar tree drawing of t i and let i be a closed curve such that i is within ɛ of P for sufficiently small ɛ > 0. i induces a 2-walk of the set T i, by following the curve counter-clockwise around the nodes of the tree. Following a similar construction to the proof of Theorem 1, we can build a new 2-walk W n+1 of the n+1-species space that visits all the elements of the space. 4 NNI-Walks on SPR Neighborhoods ryant [?] asked what is the shortest NNI walk that visits every tree in an SPR neighborhood. We make progress on this conjecture by showing that for every n, there exists an SPR neighborhood without a NNI 1-walk. Theorem 2. For every n 6, there exists an SPR-neighborhood of the n-species treespace, that does not have a Hamiltonian path. 7

Proof. Figure?? shows an SPR-neighborhood of 6-species treespace where no NNI 1-walk exists since 8 nodes of degree 2 (the highlighed triangles ) only two of which could be traversed in any path that visits nodes only once. For larger n, we can find similar regions of the SPR neighborhood that preclude any Hamiltonian path of the SPR neighborhood. For n 6, let T = (... (((1, 2), 3), 4),... n 1), n), T 1 = (... (((n, 1), 2),... n 2), n 1), T 2 = (... (((n, 2), 1),... n 2), n 1), and T 3 = (... (((1, 2), n), 3),..., n 2), n 1). T 1, T 2 and T 3 are each a single SPR move from T and thus part of the SPR neighborhood of T. We note that d NNI (T 1, T 2 ) = d NNI (T 2, T 3 ) = d NNI (T 3, T 1 ) = 1 and, by analysis by cases, no other NNI neighbors of T 1 and T 2 occur in the SPR neighborhood. So, any path from T 1 and T 2 to the rest of the space must pass through T 3. We refer to T 1, T 2, and T 3 as isolated triangles in the NNI graph of the neighborhood. Note that 3 other isolated triangles exist namely the trees resulting by moving the subtree consisting of n 1 to neighbor 1 and 2, those resulting by moving the subtree consisting of 1 to neighbor n 1 and n, and those resulting by moving the subtree consisting of 2 to neighbor n 1 and n. To visit all of these isolated triangles in the same path would require visiting nodes twice. Thus, no Hamiltonian path exists. 5 cknowledgments We would like to thank the members of the Treespace Working Group at UNY for helpful insights and conversations: nn Marie lcocer, Kadian rown, ric Ford, Kaitlin Hansen, aniele Ippolito, Jinnie Lee, Ling Li, and Oliver Mendez. 8