ALGORITHMS FOR RECONSTRUCTING PHYLOGENETIC TREES FROM DISSIMILARITY MAPS

Similar documents
The Generalized Neighbor Joining method

Reconstructing Trees from Subtree Weights

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Matroid Optimisation Problems with Nested Non-linear Monomials in the Objective Function

Evolutionary Tree Analysis. Overview

On improving matchings in trees, via bounded-length augmentations 1

Week 4-5: Binary Relations

Combinatorial Aspects of Tropical Geometry and its interactions with phylogenetics

Extreme Point Solutions for Infinite Network Flow Problems

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

55 Separable Extensions

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Maximal and Maximum Independent Sets In Graphs With At Most r Cycles

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

Dissimilarity maps on trees and the representation theory of GL n (C)

Dissimilarity Vectors of Trees and Their Tropical Linear Spaces

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

arxiv: v1 [math.ac] 11 Dec 2013

(II.B) Basis and dimension

Course 311: Michaelmas Term 2005 Part III: Topics in Commutative Algebra

Week 4-5: Binary Relations

arxiv:math/ v1 [math.co] 20 Apr 2004

Transportation Problem

Section 2. Basic formulas and identities in Riemannian geometry

Non-context-Free Languages. CS215, Lecture 5 c

Proof of Theorem 1. Tao Lei CSAIL,MIT. Here we give the proofs of Theorem 1 and other necessary lemmas or corollaries.

The Dirichlet Problem for Infinite Networks

Week 15-16: Combinatorial Design

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

A Geometric Approach to Graph Isomorphism

Structure of rings. Chapter Algebras

RANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA

Phylogenetic Networks, Trees, and Clusters

Molecular Evolution and Phylogenetic Tree Reconstruction

Theorem 5.3. Let E/F, E = F (u), be a simple field extension. Then u is algebraic if and only if E/F is finite. In this case, [E : F ] = deg f u.

Facets for Node-Capacitated Multicut Polytopes from Path-Block Cycles with Two Common Nodes

Binary Search Trees. Lecture 29 Section Robb T. Koether. Hampden-Sydney College. Fri, Apr 8, 2016

Induced subgraphs of graphs with large chromatic number. VI. Banana trees

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Information Theory and Statistics Lecture 2: Source coding

On the number of cycles in a graph with restricted cycle lengths

Quivers of Period 2. Mariya Sardarli Max Wimberley Heyi Zhu. November 26, 2014

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

PRIMARY DECOMPOSITION FOR THE INTERSECTION AXIOM

Reconstruction of certain phylogenetic networks from their tree-average distances

Regular Expressions. Definitions Equivalence to Finite Automata

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Faithful Tropicalization of the Grassmannian of planes

Formal power series rings, inverse limits, and I-adic completions of rings

Theory of Computation p.1/?? Theory of Computation p.2/?? Unknown: Implicitly a Boolean variable: true if a word is

MATH 326: RINGS AND MODULES STEFAN GILLE

Probabilities on cladograms: introduction to the alpha model

Identifying an m-ary Partition Identity through an m-ary Tree

MINORS OF GRAPHS OF LARGE PATH-WIDTH. A Dissertation Presented to The Academic Faculty. Thanh N. Dang

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Characterization of 2 n -Periodic Binary Sequences with Fixed 2-error or 3-error Linear Complexity

Sergey Norin Department of Mathematics and Statistics McGill University Montreal, Quebec H3A 2K6, Canada. and

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

Reckhow s Theorem. Yuval Filmus. November 2010

Four-coloring P 6 -free graphs. I. Extending an excellent precoloring

Combinatorial Optimisation, Problems I, Solutions Term /2015

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Toppling Conjectures

A proof of Atiyah s conjecture on configurations of four points in Euclidean three-space

Notes on induction proofs and recursive definitions

arxiv: v1 [cs.cc] 9 Oct 2014

Sets and Functions. (As we will see, in describing a set the order in which elements are listed is irrelevant).

The Complexity of Constructing Evolutionary Trees Using Experiments

INITIAL COMPLEX ASSOCIATED TO A JET SCHEME OF A DETERMINANTAL VARIETY. the affine space of dimension k over F. By a variety in A k F

Modern Discrete Probability Branching processes

Reference: Chapter 14 of Montgomery (8e)

Assignment 5: Solutions

past balancing schemes require maintenance of balance info at all times, are aggresive use work of searches to pay for work of rebalancing

Pattern Popularity in 132-Avoiding Permutations

Lecture 1: September 25, A quick reminder about random variables and convexity

Proof Techniques (Review of Math 271)

Great Theoretical Ideas in Computer Science. Lecture 4: Deterministic Finite Automaton (DFA), Part 2

A MATROID EXTENSION RESULT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

arxiv: v1 [math.co] 22 Jan 2013

Phylogeny Tree Algorithms

Math 530 Lecture Notes. Xi Chen

Bayesian Networks: Independencies and Inference

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

CS 151. Red Black Trees & Structural Induction. Thursday, November 1, 12

Spanning Paths in Infinite Planar Graphs

AVL Trees. Properties Insertion. October 17, 2017 Cinda Heeren / Geoffrey Tien 1

PHYS 4390: GENERAL RELATIVITY NON-COORDINATE BASIS APPROACH

Computer Science & Engineering 423/823 Design and Analysis of Algorithms

On the mean connected induced subgraph order of cographs

A Phylogenetic Network Construction due to Constrained Recombination

Unavoidable subtrees

European Journal of Combinatorics

1 Basic Definitions. 2 Proof By Contradiction. 3 Exchange Argument

Binary Search Trees. Motivation

Haplotyping as Perfect Phylogeny: A direct approach

8. Prime Factorization and Primary Decompositions

Transcription:

ALGORITHMS FOR RECONSTRUCTING PHYLOGENETIC TREES FROM DISSIMILARITY MAPS DAN LEVY, FRANCIS EDWARD SU, AND RURIKO YOSHIDA Manuscript, December 15, 2003 Abstract. In this paper we improve on an algorithm by Pachter-Speyer for reconstruction of a phylogenetic tree from its size-m subtree weights. We provide an especially efficient algorithm for reconstruction from 3-weights. 1. Introduction Let [n] denote the set {1, 2,..., n} and ( [n] m) denote the set of all m-element subsets of [n]. A m-dissimilarity map is a function D : ( [n] m) R 0, which may be thought of as a measure of how dissimilar a set of m elements are. Note that the map is non-negative, and we further assume it is symmetric in all the arguments. In the context of phylogenetic trees, the map D(i 1, i 2,..., i m ) may measure the weight of a subtree that spans the leaves i 1, i 2,..., i m. In this paper, we show how to construct a tree from dissimilarity map information, assuming the map comes from a tree. This follows work of Pachter-Speyer [?], who gave theoretical conditions under which a tree may be constructed from knowledge of the dissimilarity map. They show that a tree may be reconstructed from its m-dissimilarities if and only if n > 2m 2. Their method proceeds by first finding the splits, then using the fact that the topology can be recovered from the splits, then finding the weights of the edges. In contrast with their paper, we construct the topology of the tree and its edge weights at the same time. 2. Notation Instead of writing D(i 1, i 2,..., i m ), we will often omit the commas and write D(i 1 i 2 i m ). Also, we will concatenate sets as well as points in this notation, so if R represents a subset of leaves of size m 2, and i, j are other leaves, then we can write D(ijR) for the function D applied to the set {i, j} R. We write [ijr] for the subtree spanned by the leaves in {i, j} R. We say {i, j} or ij is a cherry if there is exactly one intermediate node on the unique path between i and j. We say that ij is a sub-cherry in the subtree T if ij is a cherry in the subtree T. When T is understood, we will just say that ij is a sub-cherry. Call the node where ij meets the rest of the tree the cherry node of ij. Partially supported by NSF Grant DMS-0301129. The authors thank Bernd Sturmfels and Lior Pachter for encouragement and a seminar that inspired this work. 1

2 DAN LEVY, FRANCIS E. SU, AND RURIKO YOSHIDA Four-point condition for m-dissimilarity maps. We now show an analogue of the fourpoint condition for dissimilarity maps that come from trees. Theorem 1. Given a binary tree T, let i, j, k, l be distinct leaves and R a subset of leaves of size m 2 not containing i, j, k, l, and D a m-dissimilarity map representing m-subtree weights. Then in the three quantities (1) {D(ijR) + D(klR), D(ikR) + D(jlR), D(ilR) + D(jkR)} the maximum will be achieved at least twice. The maximum will be achieved three times if and only if no pair of leaves in i, j, k, l forms a sub-cherry in [ijklr]. Proof. There are essentially six ways that the subtree [R] could relate to the quartet subtree [ijkl]: (1) The subtree [R] could hang off a leaf of [ijkl]. (2) The subtree [R] could intersect the interior of a leaf of [ijkl]. (3) The subtree [R] could hang off the splitting edge of [ijkl]. (4) The subtree [R] could intersect the interior of the splitting edge of [ijkl]. (5) The subtree [R] could intersect the interiors of the splitting edge and two leaf edges of one sub-cherry of [ijkl]. (6) The subtree [R] could intersect the interiors of all edges of [ijkl]. See Figure?? for illustrations of these cases. Only in the final case will all three quantities in (1) be equal. As an immediate corollary, we have a criterion that will help us determine whether a pair of leaves i, j form a sub-cherry in a subtree of size m + 2. Theorem 2. Given a binary tree T, let R be a subset of leaves of size m 2, and i, j, k, l leaves not in R, and D representing m-subtree weights. Then (2) D(ijR) + D(klR) < D(ikR) + D(jlR) = D(ilR) + D(jkR) if and only at least one of ij or kl is a sub-cherry in [ijklr]. Proof. Consider the six figures in Figure?? and note that except for the final case, either ij or kl is a sub-cherry of [ijklr]. Define the difference: (3) d(ij, kl, R) = 1 (D(ikR) + D(jlR) + D(ilR) + D(jkR)) (D(ijR) + D(klR)). 2 Note that order of the arguments of d matters. One can check that d(ij, kl, R) is symmetric in the first two leaves, the next two leaves, and d(ij, kl, R) = d(kl, ij, R). Also, Theorem 3. For any dissimilarity map D and d defined as in (3), the following identity holds: (4) d(ij, kl, R) + d(ik, jl, R) + d(il, jk, R) = 0. Proof. It is easy to check that the sum of these quantities is 0 by inspecting the right-hand sides: d(ij, kl, R) = 1 (D(ikR) + D(jlR) + D(ilR) + D(jkR)) (D(ijR) + D(klR)). 2 d(ik, jl, R) = 1 (D(ijR) + D(klR) + D(ilR) + D(jkR)) (D(ikR) + D(jlR)). 2 d(il, jk, R) = 1 (D(ikR) + D(jlR) + D(ijR) + D(klR)) (D(ilR) + D(jkR)). 2

RECONSTRUCTING TREES FROM DISSIMILARITY MAPS 3 Thus the above identity holds even if the data for the dissimilarity map D does not come from a tree. If the data for D does arise from a tree, then in any identity of the form (4), either all three terms will be zero, or exactly one will be positive and the other two will be negative and equal. For convenience, we rephrase Theorem 2 in terms of our function d: Theorem 4. Let i, j, k, l be leaves of a tree and R a subtree of size m 2 that does not contain i, j, k, l. Then d(ij, kl, R) > 0 if and only if at least one of ij or kl is a sub-cherry in [ijklr]. We can use this to locate sub-cherries with respect to R, i.e., test if ij is a sub-cherry in [ijr]. First, find some pair kl that is not a sub-cherry in [klr] by finding some d(kl, pq, R) <= 0. Then we can use d(ij, kl, R) to determine if ij is a sub-cherry in [ijklr], and hence in [ijr]. What is more, the quantity d(ij, kl, R) actually measures something! Theorem 5. In a binary tree, if d(ij, kl, R) > 0 and if kl is not a sub-cherry in [klr], then d(ij, kl, R) measures the distance between [klr] and the cherry node of ij in [ijklr]. Proof. This follows from considering the cases in Figure??. Case 6 cannot occur because d(ij, kl, R) > 0. Cases 3 and 4 cannot occur because kl is a sub-cherry in [klr]. In all the other cases, one can check that d(ij, kl, R) measures the distance between [klr] and the cherry node of ij in [ijklr]. 3. Reconstructing trees from 3-dissimilarity maps We consider the case of reconstructing trees from 3-weights. (In this section, the letter m will refer to a leaf.) Note that a 5-leaf tree always consists of two cherries and a middle leaf along the path between them. We need the following useful fact: Theorem 6. In a binary tree, if d(ij, kl, m) > 0, then we have a split (i, j; k, l) and d(ij, kl, m) measures the distance between the cherry nodes of ij and kl in the subtree [ijklm], regardless of the choice of m. Proof. The proof is similar to that of Theorem 5. Case 6 cannot occur because d(ij, kl, R) > 0. Cases 2, 4, 5. cannot occur because R is size 1. kl is a sub-cherry in [klr]. In all the other cases, one can check that d(ij, kl, R) measures the distance between [klr] and the cherry node of ij in [ijklr]. We can use this idea to reconstruct the tree from 3-weights. Here s the algorithm. Algorithm 7. (Binary Tree Reconstruction from Three Weights) (0) Input: n leaves (n > 4), all 3-weights of a tree T. Output: a rooted tree T, as a linked list of nodes, with associated variables at each node that record important information (like parent, children, etc.).

4 DAN LEVY, FRANCIS E. SU, AND RURIKO YOSHIDA (1) Choose any five leaves i, j, k, l, m. Consider the fifteen quantities: d(ij, kl, m), d(ik, jl, m), d(il, kj, m), d(mj, kl, i), d(mk, jl, i), d(ml, kj, i), d(im, kl, j), d(ik, ml, j), d(il, km, j), d(ij, ml, k), d(im, jl, k), d(il, mj, k), d(ij, km, l), d(ik, jm, l), d(im, kj, l). (All others possibilities are equal to one of these by the symmetries of the function d.) For any dissimilarity map D, the sum of the numbers in each row will add to 1, by (4). Also, it will be the case that, for any i, j, k, l, m, d(ij, kl, m) = 1 2 [d(ij, km, l) + d(ij, ml, k)] + 1 2 [d(mj, kl, i) + d(im, kl, j)]. If the data for D comes from a tree, then exactly one quantity in each row will be positive (the other two will be negative and equal); moreover, when these 5 positive quantities are ordered by magnitude, they will satisfy x 1 > x 2 = x 3 > x 4 = x 5, where x 1 = x 3 + x 5. Suppose the largest positive quantity is achieved by d(ij, kl, m). Then we know that tree [ijklm] must look like: Figure 1 The picture shows that ij and kl are cherries in [ijklm], with cherry nodes a and b, respectively, and and m is incident to the path between a and b at r The edge [a, r] has weight d(ij, ml, k) = d(ij, km, l) and the edge [r, b] has weight d(im, kl, j) = d(mj, kl, i). (If the data is imperfect, then use the average of d(ij, ml, k) and d(ij, km, l) for the weight of [a, r], and the average of d(im, kl, j) and d(mj, kl, i) for the weight of [r, b]. From the observation above, the sum of these averages will be d(ij, kl, m) as desired.) We may figure out the weights of the remaining leaf edges using this Lemma. Lemma 8. If e i is any leaf edge, the weight of e i can be computed by choosing leaves j, k, l such that in the subtree [ijkl], ij is a sub-cherry and the edge e i is precisely the path in [ijkl] from i to the cherry node of ij. Let m be any other leaf. Then the weight of e i is given by E(i, j, klm) = 1 (D(ijk) + D(ijl) + D(kli) + D(klj) d(ij, kl, m)) D(jkl). 3 Proof. The main idea here is that E(i, j, klm) = D 4 (ijkl) D(jkl), where D 4 is the 4-weight. In any subtree [ijklm], we can determine the 4-weight from the 3-weights and d by this formula: D 4 (ijkl) = 1 (D(ijk) + D(ijl) + D(kli) + D(klj) d(ij, kl, m)). 3 Note that the choice of the leaf m does not matter, because of Lemma 6. (2) Now, think of r as the root of this tree and direct all the edges away from r. We can represent this tree by a data structure in which nodes point to other nodes. To each node x (except r) are associated the following variables: P (x), L(x), R(x), W (x), g x, p x, q x, which are, respectively, the parent of x, the left child of x, the right child of x, the weight of the edge that points to x, and the name of the descendant of x that follows

RECONSTRUCTING TREES FROM DISSIMILARITY MAPS 5 the left leaves at every stage; the final two variables are the names of two leaves that lie on two different branches (other than this one) that emanate from the parent of x. The node r will be represented in 3 different ways depending on which tree it roots call them r 1, r 2, r 3. Here s how we initialize them: Let P (r 1 ) = a, L(r 1 ) = b, R(r 1 ) = null, W (r 1 ) = null, g r1 = k, p r1 = i, q r1 = j. Let P (r 2 ) = b, L(r 2 ) = a, R(r 2 ) = null, W (r 2 ) = null, g r2 = i, p r2 = j, q r2 = k. Let P (r 3 ) = a, L(r 3 ) = m, R(r 3 ) = null, W (r 3 ) = null, g r3 = k, p r3 = i, q r3 = j. For the nodes of our initial 5-leaf tree: Let P (m) = r 3, L(m) = null, R(m) = null, W (m) = E(m, l, ijk), g m = m, p m = i, q m = k. Let P (a) = r 2, L(a) = i, R(a) = j, W (a) = (d(ij, ml, k) + d(ij, km, l))/2, g a = i, p a = l, q a = m. Let P (b) = r 1, L(b) = k, R(b) = l, W (b) = (d(im, kl, j) + d(mj, kl, i))/2, g b = k, p b = m, q b = i. Let P (i) = a, L(i) = null, R(i) = null, W (i) = E(i, j, klm), g i = i, p i = p a, q i = j. Let P (j) = a, L(j) = null, R(j) = null, W (j) = E(j, i, klm), g j = j, p j = p a, q j = i. Let P (k) = b, L(k) = null, R(k) = null, W (k) = E(k, l, ijm), g k = k, p k = p b, q k = l. Let P (l) = b, L(l) = null, R(l) = null, W (l) = E(l, k, ijm), g l = l, p l = p b, q l = k. This constructs an initial 5-leaf tree on leaves i, j, k, l, m, with root at node r. We still have to figure out where all the other leaves go. (3) We now define a subroutine Sub(x, z, t) which will take t and determine on which edge below x it lies in the direction of z, where z is a child of x. (a) Input: x, z, t (b) Output: Places t on an edge below x and modifies data structures appropriately. (c) Let test = d(p z q z, tg z, q x ). (For imperfect data, also let test2 = d(p z q z, tg R(z), q x ).) If test > 0 then: (i) If test < W (z) (for imperfect data, check if W (z) >> test = test2), then insert t in the edge from x to z: create node t and fill in the data for t later. Create new node y: let P (y) = x, L(y) = z, R(y) = t, W (y) = test, g y = g z, p y = p x, q y = q z. Check whichever of x s children was z; replace it with y. Then fill in the data for t: let P (t) = y, L(t) = null, R(t) = null, W (t) = E(t, g z, q x p z q z ), g t = t, p t = p x, q t = g z. Then re-assign: P (z) = y, W (z) = W (z) test, q z = t. (ii) If test = W (z) (for imperfect data, check if test2 >> test = W (z)), then perform Sub(z, R(z), t). (iii) If test > W (z) (for imperfect data, check if test >> test2 = W (z)), then perform Sub(z, L(z), t). (iv) Return (d) If test <= 0 then there s an error, this part of the subroutine should never be reached! (e) Return (this should never be reached either). (4) Now we begin to add leaves to our 5-leaf tree. For any t not yet in the current tree, we test in which of three directions from r it lies. Let test = d(it, km, j). Let test2 = d(im, kt, l).

6 DAN LEVY, FRANCIS E. SU, AND RURIKO YOSHIDA If test > 0, then perform Sub(r 2, a, t). If test2 > 0, then perform Sub(r 1, b, t). Otherwise (since both test1, test2 < 0) perform Sub(r 3, m, t). Repeat for each t. (5) The above process will construct 3 trees with with roots r 1, r 2, r 3. Identifying these three roots, and paste just the left branches of each tree under these roots together will yield the complete tree under the root r. When all is said and done, this algorithm should be very fast. In fact, the time complexity of this algorithm is O(n 2 ). Theorem 9. The time complexity of Algorithm 7 is O(n 2 ), where n is the number of leaves of the tree. Proof. First we want to show that the number of interior nodes of the tree is n 2. Let T be the tree and let L(T ) be the set of leaves of T. At Step (2) we have 3 interior nodes and 5 leaves, namely i, j, k, l, m. For each t L(T )\{i, j, k, l, m}, we add one interior node if we find a place for a leaf t. There are n 5 many t L(T )\{i, j, k, l, m}. Thus we have (n 5) + 3 = n 2 many interior nodes. So, we have 2n 2 nodes in total. Notice that Step (1) and Step (2) take only the time complexity O(1). Also one notices that at Step (3), computing each edge weight takes O(1). Since T is a binary tree with the root which has three children, at Step (4) we call Step (3) recursively to find a place to insert t at most 2n 2 times for each t. So, inserting t into the tree T takes time complexity O(n). Since there are n 5 many t L(T )\{i, j, k, l, m}, we have the total time complexity to reconstruct a tree T from the data (2n 2) (n 5) = O(n 2 ). It avoids computing splits, Buneman indices, using Splitstree, and then constructing weights, as the Pachter-Speyer paper would have done. We might even get some data, and check how good this works, comparing reconstruction from 2-weights, to deriving 3-weights from 2-weights and then reconstruct from 3-weights, as Pachter had suggested. Example 10. Suppose we have n = 6 and a binary tree T such as in Figure 2. Step 1: First we choose leaves 1, 2, 3, 4, 5. Then we have 15 values: d(23, 45, 1), d(35, 24, 1), d(34, 25, 1), d(13, 45, 2), d(15, 34, 2), d(14, 35, 2), d(12, 45, 3), d(15, 24, 3), d(14, 25, 3), d(12, 35, 4), d(13, 25, 4), d(15, 23, 4), d(12, 34, 5), d(13, 24, 5), d(14, 23, 5). Then we notice that d(12, 45, 3) = 4, d(12, 34, 5) = d(12, 34, 5) = 3 and d(13, 45, 2) = d(23, 45, 1) = 1 and other values are negative. So we know that (1, 2; 4, 5) splits, and we know that the distance between cherry node of 12 and the cherry node of 45 is 4, the distance between cherry node of 13 and the cherry node of 45 is 1, and the distance between cherry node of 12 and the cherry node of 34 is 3. Let i = 1, j = 2, m = 3, k = 4, l = 5. Step 2: Initialize: Let P (r 1 ) = a, L(r 1 ) = b, R(r 1 ) = null, W (r 1 ) = null, g r1 = 4, p r1 = 1, q r1 = 2.

RECONSTRUCTING TREES FROM DISSIMILARITY MAPS 7 Figure 2. Example 10 Let P (r 2 ) = b, L(r 2 ) = a, R(r 2 ) = null, W (r 2 ) = null, g r2 = 1, p r2 = 2, q r2 = 4. Let P (r 3 ) = a, L(r 3 ) = m, R(r 3 ) = null, W (r 3 ) = null, g r3 = 4, p r3 = 1, q r3 = 2. Let P (3) = r 3, L(3) = null, R(3) = null, W (3) = E(3, 5, 124) = 1, g 3 = 3, p 3 = 1, q 3 = 4. Let P (a) = r 2, L(a) = i, R(a) = j, W (a) = (d(12, 35, 4) + d(12, 34, 5))/2 = 3, g a = 1, p a = 5, q a = 3. Let P (b) = r 1, L(b) = k, R(b) = l, W (b) = (d(13, 45, 2) + d(23, 45, 1))/2 = 1, g b = 4, p b = 3, q b = 1. Let P (1) = a, L(1) = null, R(1) = null, W (1) = E(1, 2, 453) = 4, g 1 = 1, p 1 = p a = 5, q 1 = 2. Let P (2) = a, L(2) = null, R(2) = null, W (2) = E(2, 1, 453) = 2, g 2 = 2, p 2 = p a = 5, q 2 = 1. Let P (4) = b, L(4) = null, R(4) = null, W (4) = E(5, 4, 123) = 1, g 4 = 4, p 4 = p b = 3, q 4 = 5. Let P (5) = b, L(5) = null, R(5) = null, W (5) = E(5, 4, 123) = 4, g 5 = 5, p 5 = p b = 3, q 5 = 4. Step 4: For t = 6, we have test = d(16, 43, 2) < 0 and test2 = d(13, 46, 5) = 1 > 0. So we proceed Sub(r 1, b, 6). Sub(r 1, b, 6): x = r 1, z = b, t = 6. So we have test = d(31, 64, 2) = 1 and W (b) = 1. This means that since test = W (b), we proceed Sub(b, R(b), 6) = Sub(b, 5, 6). Sub(b, 5, 6): x = b, z = 5, t = 6. So we have test = d(34, 65, 1) = 1 and W (5) = 4. Since test < W (5), we intert an interior node y and the node t = 6. Set P (y) = b, L(y) =

8 DAN LEVY, FRANCIS E. SU, AND RURIKO YOSHIDA 5, R(y) = 6, W (y) = test = 1, g y = g 5, p y = p b, q y = q 5 and set P (6) = y, L(6) = null, R(6) = null, W (6) = E(6, g 5, q b p 5 q 5 ) = E(6, 5, 134) = 2, g 6 = 6, p 6 = p b = 3, q 6 = g 5 = 5. Then set P (5) = y, W (5) = W (5) test = 3, q 5 = 6. Since there is no anymore leaf in L(T )\{1, 2, 3, 4, 5}, we are done. 4. Reconstructing trees from m-dissimilarity maps In this section, we sketch how to use higher-order map information to reconstruct trees. We wish to find i, j, k, l, R so that ij is a sub-cherry in [ijklr] but kl is not. So, we can search randomly for an i, j, k, l, R for which d(ij, kl, R) > 0. (Can we do this in a better way to help in the later steps?) We can then determine which pair is a sub-cherry by taking any t R, and letting R = R t + j. If d(it, kl, R ) > 0, then we claim kl is a sub-cherry in [ijklr] = [itklr ]. This is because either kl or it is a sub-cherry, but if it is a sub-cherry then ij is not a sub-cherry so that kl is a sub-cherry. If the condition does not hold, then automatically ij is a sub-cherry. Now that we have found a sub-cherry (wlog suppose it is ij), then we should check if kl is a sub-cherry. If it is not, we are done. If it is, then take any t R, then kt is not a sub-cherry. 4.1. Constructing the Tree. We now assume that ij is a sub-cherry but kl is not. Given t / [ijklr] we wish to figure out where it goes, i.e., where the path from t to [ijklr] first touches [ijklr]; we say that t is incident to [ijklr] at that point. Note that there is an internal node c whose three edges are incident to i, to j, and to [klr]. Call these edges [ic], [jc], [xc]. We give a criterion for determining where t is incident to [ijklr]. There are four cases: (1) If d(it, kl, R) < d(ij, kl, R), then t is incident to [ic] at distance d(ij, kl, R) d(it, kl, R) from c. (2) If d(it, kl, R) > d(ij, kl, R), then t is incident to [xc] at distance d(it, kl, R) d(ij, kl, R) from c. (3) If d(it, kl, R) = d(ij, kl, R), then t is incident to [jc] at distance d(ij, kl, R) d(jt, kl, R) from c. (4) If d(it, kl, R) = 0 then t is incident to the subtree [klr]. 4.2. Resolving bubbles. We perform the above step for each t not in [ijklr]. This clusters all leafs into sub-tree bubbles incident to [ijklr] at various places. If any bubbles are size 1, we are done with them. Bubbles not of size one can be resolved by repeating this procedure. In particular, note that any two elements in the same bubble, say t 1 and t 2 are a sub-cherry in the tree spanned by [t 1 t 2 klr] so each element in the bubble may be resolved recursively. 4.3. Reconstruction Algorithm. Step 1: Choose a set R of m+2 leaves and find ij a sub-cherry in [R ]. If the data is a tree, such a sub-cherry must always exist. Let R = R ij. Step 2: For each t / R, check if t is a sub-cherry in [Rit]. If it is, resolve its location as discussed in the previous section. If it is not a sub-cherry, set it aside. Denote the set of resolved leaves (including i and j) by A and the set of unresolved leaves (including R) by B.

RECONSTRUCTING TREES FROM DISSIMILARITY MAPS 9 Step 3: If A m we may choose a subset R of A of such that R = m and any two elements s, t B will be a sub-cherry in the tree [tsr]. Hence, relative to this new R, we may resolve every element in B by repeating step 2 with this new R. Step 4: If A < m then B m. We vary R of size m over B and compute the ij cherry distance to R. If it is greater than the ij cherry distance to our previous R, return to step 2. We may now assume that A < m and for all R B, the ij cherry distance is the same. From this, we may conclude that we have a situation as in Figure [?] and that B 1 < m and B 2 < m. Step 5: Let A = a < m. Choose b = m + 2 a leaves from B and call the set of leaves R. Note that m a 1 b 3. Hence, either B 1 or B 2 has at least two leaves, so there must be a sub-cherry kl in [R A]. In particular, the sub-cherry is either in B 1 or B 2, so without loss of generality, we assume kl B 1. Step 6: Place all elements of B R that form sub-cherries with k or l (as we did for i and j in step 2). Step 7: As in step 4, vary the subset R over the leaves we have yet to resolve. If the kl cherry distance increases, return to step 6. When we reach the point that the kl cherry distance is the same for any choice of R, we claim that R B 2. If not, then B 2 m a 1 and we already have that B 1 m 1 and so 2m 1 A + B 1 + B 2 a + (m 1) + (m a 1) 2m 2 a contradiction. But, since R B 2 implies that we have placed every leaf in A and B 1. In particular, we have placed at least m leaves since A + B 1 m. Step 8: Choose R A B 1 of size m, then every pair of elements not yet accounted for, say s, t will be a sub-cherry in [str]. Step 9: In this way we have placed all our leaves and found the edge weights associated to each internal edge. We may find the leaf edges, then, in the same manner described in Pachter-Speyer. Acknowledgments. [1] References Department of Mathematics, University of California, Berkeley, CA E-mail address: levyd@math.berkeley.edu Department of Mathematics, Harvey Mudd College, Claremont, CA 91711 E-mail address: su@math.hmc.edu Department of Mathematics, University of California, Davis, CA E-mail address: ruriko@math.ucdavis.edu