Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

Size: px
Start display at page:

Download "Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT"

Transcription

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 6, Numbers 3/4, 1999 Mary Ann Liebert, Inc. Pp Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction DANIEL H. HUSON, 1 SCOTT M. NETTLES, 2 and TANDY J. WARNOW 3 ABSTRACT The evolutionary history of a set of species is represented by a phylogenetic tree, which is a rooted, leaf-labeled tree, where internal nodes represent ancestral species and the leaves represent modern day species. Accurate (or even boundedly inaccurate) topology reconstructions of large and divergent trees from realistic length sequences have long been considered one of the major challenges in systematic biology. In this paper, we present a simple method, the Disk-Covering Method (DCM), which boosts the performance of base phylogenetic methods under various Markov models of evolution. We analyze the performance of DCM-boosted distance methods under the Jukes Cantor Markov model of biomolecular sequence evolution, and prove that for almost all trees, polylogarithmic length sequences suf ce for complete accuracy with high probability, while polynomial length sequences always suf ce. We also provide an experimental study based upon simulating sequence evolution on model trees. This study con rms substantial reductions in error rates at realistic sequence lengths. Key words: algorithms, evolution, phylogenetic trees, clustering, Jukes Cantor, biomolecular data, chordal graphs. 1. INTRODUCTION THE EVOLUTION OF BIOMOLECULAR SEQUENCES is usually modeled as a Markov process operating on a rooted binary tree. A biomolecular sequence at the root of the tree evolves down the tree, each edge of the tree introducing point mutations, thereby generating sequences at the leaves of the tree. The phylogenetic tree reconstruction problem is to take the sequences that occur at the leaves of the tree, and infer, as accurately as possible, the tree that generated the sequences. Under most Markov models of sequence evolution, locating the root is impossible. Consequently, the primary objective in phylogenetic analysis is to recover the branching process, as represented by the unrooted leaf-labeled topology of the evolutionary tree, and the secondary objective is to estimate the parameters of the evolutionary process. While inferring the parameters of the evolutionary process is of interest to both biologists and statisticians, this is relatively easy once an accurate estimate of the topology of the true tree is obtained; consequently, biologists have mostly focused upon the simpler question of inferring the tree topology. Methods for inferring or estimating the evolutionary history of biomolecular sequences are evaluated according to the accuracy of this topology 1 Celera Genomics, Rockville, Maryland. 2 Department of Electrical and Computer Engineering, University of Texas, Austin, Texas. 3 Department of Computer Science, University of Texas, Austin, Texas. 369

2 370 HUSON ET AL. estimation. Indeed, this is one of the most important problems for computational biology as a whole, because of the centrality of evolutionary studies to biology (Dobzhansky, 1993). [Evolutionary trees are often the basis of multiple sequence alignment algorithms (Gus eld, 1991; Gus eld and Wang, 1996; Hein, 1989), protein structure prediction routines (Rost and Sander, 1993), and other problems in biology.] Experimentally investigating the performance of phylogenetic methods by simulating sequence evolution on different model trees in order to determine how the sequence length affects the accuracy of the topology prediction is central to systematic biology studies (see, for example, Hillis, 1996; Hillis et al., 1994; Huelsenbeck, 1995; Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Saitou and Imanishi, 1989; Schöniger and von Haeseler, 1995; Sourdis and Nei, 1996; Strimmer and von Haeseler, 1996). The focus on sequence length in these studies is primarily practical: biomolecular sequences are not particularly long (those used for phylogenetic tree reconstruction purposes are typically bounded by 1000 nucleotides, often by much smaller numbers), and sequence lengths of 5000 nucleotides are generally considered to be unusually long (Puruis and Quicke, 1997). Furthermore, experimental and anecdotal evidence suggests that methods are very sensitive to sequence lengths, and that different methods can have great differences in their accuracy on realistic sequence lengths. In an earlier paper (Huson et al., 1998b), we showed that a critical factor affecting the accuracy of different methods is the maximum evolutionary distance in the tree, which we call the divergence of the tree. Theoretical studies (Berry and Gascuel, 1997; Erd os et al., 1997a,b, 1999) have bounded the convergence rates of various polynomial time distance methods [Neighbor-Joining (Saitou and Nei, 1987), a simple and very popular clustering technique used by systematic biologists, and more sophisticated methods developed by the theoretical computer science community (Agarwala et al., 1996; Ambainis et al., 1997; Farach et al., 1995), and shown that these bounds grow exponentially in the divergence. This exponential dependency upon the divergence translates to superpolynomial convergence rates for almost all trees (Erd ós et al., 1997b), suggesting that accuracy may be poor if these methods are used to infer tree topologies under conditions of high divergence. However, these are upper bounds, and may be pessimistic. Our paper has several contributions. First, we introduce the concept of fast convergence under a model of evolution, such as the Jukes Cantor model. By this we mean that for all model trees, a method will with high probability recover the tree from sequences that grow only polynomially in the number of leaves, once we bound the mutation probabilities on the edges of the tree. We present a general technique for designing methods that are fast converging, and we present a new fast-converging method. This particular method is actually a boosted version of a classical method, called the Buneman Method; thus, we also introduce a general technique for boosting the performance of phylogenetic methods. This is the Disk-Covering Method (DCM). In its exact form DCM is not polynomial time, but has provable performance guarantees when used with certain base methods. For example, we can prove that under the Jukes Cantor Markov model (Jukes and Cantor, 1969), the DCM Buneman method is fast converging. In its heuristic form, DCM is polynomial time, and our experimental studies indicate substantial reductions in error rates for DCM-boosted polynomial time methods. The rest of this paper is organized as follows. In Section 2 we present de nitions and some basic results about convergence rates of standard methods. We also present a general technique for establishing fast convergence. In Section 3, we describe the fast-converging method, DCM-Buneman, and provide the proof of fast convergence. In Section 4, we describe DCM more generally, and describe how it can be used with other distance-based methods. In Section 5, we present the results of a preliminary experimental study. Finally, in Section 6, we present open problems and discuss related research Markov models of evolution 2. BASICS Most Markov models of evolution assume that the sites in the sequence evolve identically and independently (i.i.d.), and for the most part, the results that can be proven about any Markov model of i.i.d. site evolution can be proven about any other such model. Because of this, we will describe our results in terms of the Jukes Cantor model. T. Jukes and C. Cantor (1969) introduced a very simple model of DNA sequence evolution. [Please see the excellent review articles by Felsenstein (1988) and Swofford et al. (1996) for more information about statistical aspects of phylogenetic inference.]

3 DISK-COVERING 371 De nition 1. Let T be a xed rooted tree with leaves labeled 1 n. The Jukes Cantor Markov model of evolution makes the following assumptions: 1. The sites (positions within the sequence) evolve identically and independently (i.i.d.) down the tree from the root. The probability of an event on edge e is independent of the events outside the subtree below e. 2. The possible states for each site are A, C, T, G (denoting the four possible nucleotides), and for each site, the state at the root is drawn from a distribution (typically uniform). 3. For each edge e D (u, v) 2 E (T ), with u the parent of v, if the state of a site is different at u than at v, then the probability that v has any particular state of the three remaining states is equal. 4. Associated to each edge e in the tree T is a Poisson random variable X e for the number of mutations of a randomly selected site on that edge. We let l e denote the expectation of X e. We will sometimes refer to the pair (T, l) as a Jukes Cantor tree. We are interested in estimating the topology of the Jukes Cantor tree, and will study the accuracy of different methods for this problem. Since the number of changes on each edge in uences the sequence length that is needed to estimate the tree, we will be interested in the performance of different phylogenetic estimators when f D min e l e and g D max e l e are xed, but arbitrary Topological accuracy While complete accuracy is the objective, partial accuracy is the rule, especially when trees are large and/or highly divergent (that is, contain evolutionarily distant pairs of taxa). For this reason, quanti cation of degrees of accuracy have been proposed. The most favored such quanti cation is as follows. De nition 2. Let T be the (unrooted) true tree, and let T 0 be the (unrooted) inferred tree, both leaf-labeled by a set S of taxa. Let e be an edge in T, and let p e denote the bipartition induced on S by the removal of e from T. Let C (T ) D fp e : e 2 E (T )g and let C (T 0 ) be equivalently de ned. Then T D T 0 if and only if C (T ) D C (T 0 ). Any bipartition p 2 C (T ) C (T 0 ) is called a false negative (FN) and any bipartition in C (T 0 ) C (T ) is said to be a false positive (FP). Thus, a false negative is an edge in the true tree that is missing from the inferred tree, while a false positive is an edge in the inferred tree that is not in the true tree. We will say that an edge e 2 E (T ) is recovered in T 0 (or by the method M which produces T 0 ) if the bipartition in C (T ) associated to e appears in C (T 0 ). The FN rate is the ratio of FN and the number of edges in T, and similarly the FP rate is the ratio of FP and the number of edges in T 0. Usually the model (true) tree is binary, and hence has n 3 edges, where n is the number of leaves. Also, many distance methods, such as Neighbor-Joining (Saitou and Nei, 1987) and the method of Agarwala et al. (1996) (which obtains a 3-approximation to the L 1 -nearest tree problem), always produce binary trees. However, the Buneman Tree method (Buneman, 1971) often produces unresolved (i.e., not binary) trees Performance criteria The assumption of Markov models of evolution allows the performance of phylogenetic tree reconstruction methods to be assessed, either through analytical means or by experiments. There are several criteria that have been considered to be of fundamental importance by statisticians working in evolutionary tree reconstruction (Felsenstein, 1988): ² statistical consistency (with respect to a given model of evolution), which is said to be given, if the probability of recovering the leaf-labeled tree converges to 1 as the sequence length increases; ² convergence rate (with respect to a given model of evolution), which is the rate at which the probability (of recovering the leaf-labeled tree) goes to 1, as the sequence length increases; and ² accuracy (with respect to a given model of evolution), the expected number of topological errors at a given sequence length. The reference to the model of evolution is important, as some methods are statistically consistent under some models but not under others. Under the Jukes Cantor model, many distance methods (and all the methods we consider in this paper) are statistically consistent. However, if the assumption of i.i.d. site evolution (or that the sites evolve under different

4 372 HUSON ET AL. rates but from a known distribution) is relaxed, then positive results become less likely. For example, under more general models of evolution, even maximum likelihood (under the correct model!) can be statistically inconsistent (Steel et al., 1994). The theorems we will develop in this paper are stated in terms of Jukes Cantor site evolution, but apply more generally to other models for which statistical consistency of distance-based methods has been established Distance-based reconstruction We begin with some de nitions. De nition 3. A matrix D is called additive (or said to be a tree-metric) if there exists a tree T with positive edge weighting w such that D i j D di T j :D P e2p i j w(e), where P i j is the path in T between leaves i and j. Additive matrices have the nice property that they correspond uniquely to positively edge-weighted trees (Buneman, 1971), and that given D, the tree T and the weighting w can be recovered in polynomial time (Waterman et al., 1977). Let E (T ) denote the set of edges of T. We represent the evolutionary process by a set fx e : e 2 E (T )g of Poisson processes, where X e is the Poisson random variable for the number of mutations of a random site on the edge e. Let X i j D P e2p i j X e. Then X i j is a Poisson random variable. Let l i j D [X i j ]. Then, l is an additive matrix. We will call l i j the expected evolutionary distance, or true distance between i and j. Clearly, l i j D P e2p i j l e, where l e D [X e ]. Because the matrix l is additive, given l the tree T can be constructed in polynomial time using a number of different distance-based methods. Furthermore, a distance correction transformation exists for the Jukes Cantor Markov model of site evolution, so that arbitrarily good approximations to the matrix l can be obtained if the sequence length is unboundedly large (Felsenstein, 1988; Warnow, 1996). De nition 4. The Jukes Cantor distance correction is given as follows: d i j :D 3 4 log h i j where h i j is the normalized Hamming distance, or H (i, j )/ k, where H (i, j) is the Hamming distance between sequences i and j, and k is the sequence length. It is not hard to show that for all Jukes Cantor trees (T, l) and for all i, j, as k! 1, that d i j! l i j. Consequently, the following two-step process is a statistically consistent distance-based approach to Jukes Cantor tree reconstruction: First, compute the Jukes Cantor distance matrix d, then, map d (using some distance method M such as Neighbor-Joining) to a (nearby) additive matrix M (d) D D. Since accuracy in phylogenetic reconstruction is based upon comparisons between unrooted tree topologies, we will say that the method is accurate on input d if D and l de ne the same unrooted leaf-labeled tree, even if they assign different weights to the edges of the tree Some distance methods In this paper we demonstrate the value of DCM boosting on three different distance-based methods, all of which are statistically consistent for the Jukes Cantor model when used with properly corrected distances. We now describe these three methods Buneman. The Buneman Tree method was originally suggested by Peter Buneman (1971), and polynomial time algorithms for this method were obtained in Bandelt and Dress (1992) and Berry and Gascuel (1997). This method takes as input a dissimilarity matrix d, and computes a tree as follows. First, the topology on every quartet of taxa is inferred using the Four-Point method, as follows: The Four-Point Method (FPM) computes trees on four-leaf subsets only. Given the 4 4 dissimilarity matrix on i, j, k, l, the topology i j j kl is returned (meaning i, j are separated from k, l by an edge) if d i j C d kl < minfd i k C d j l, d il C d j k g. If the minimum of the three pairwise sums is not unique, then the FPM returns the star tree (that is, the tree with one interior node, and all leaves adjacent to that interior node).

5 DISK-COVERING 373 Given a set Q of trees, one on each quartet of leaves, the Buneman Tree is de ned to be the maximally resolved tree satisfying the following condition: ² for all quartets i, j, k, l if T restricted to i, j, k, l induces a binary tree (instead of a star), then the tree in Q on i, j, k, l is the same binary tree. Such a tree always exists, since the star tree satis es this constraint. What is nice is that the maximally resolved tree with this property is unique, as the following lemma shows. Lemma 1. Let d be an input dissimilarity matrix, and let Q be the set of four-leaf trees de ned by the FPM, and let T be the Buneman Tree de ned by d. Then C(T ) is the set of bipartitions ( A, B) de ned by ² For all fa, a 0 g µ A and fb, b 0 g µ B, the tree a, a 0 j b, b 0 2 Q The Agarwala et al. Method. The L 1 -nearest tree problem is to nd, for input dissimilarity matrix d, an additive matrix D minimizing L 1 (d, D). This is an NP-hard optimization problem (Farach et al., 1995), but can be 3-approximated in polynomial time. Thus, a 3-approximation algorithm for the L 1 -nearest tree problem takes as input a dissimilarity matrix d, and returns an additive metric D 0 such that L 1 (d, D 0 ) 3L 1 (d, D) for all additive metrics D. There are several 3-approximation algorithms for this problem, of which the one by Agarwala et al. (1996) is the rst (to our knowledge) with a provable performance guarantee Neighbor-Joining. This is a polynomial time method very much favored in the systematic biology community for use with large data sets. See Saitou and Nei (1987) for the rst paper on this method and Atteson (1997) for a proof of statistical consistency and a bound on its convergence rate Convergence rates of some distance methods We begin with the only established upper bounds on the convergence rates of these three methods. This analysis relates the convergence rate to the divergence of a tree, which we now de ne. De nition 5. Let T be an arbitrary Jukes Cantor tree and let l e be the expected number of mutations of a random site on an edge e. Then l i j D P e2p i j l e, where P i j is the path in the tree T between leaves i and j. We de ne the divergence of the tree T to be l max D max i j fl i j g. Note that l max is unbounded even when n (the number of leaves) is bounded, due to changes that are not observed. Theorem 1. Let T be a Jukes Cantor tree with n leaves, Let l e be the expected number of changes of a random site on edge e, and let f D min e l e. Let d i j be an estimation of l i j, the expected number of changes of a random site on the path between i and j. Let e D L 1 (d, l). 1. The Neighbor-Joining method is guaranteed to be accurate if e < f / The Buneman Tree method is guaranteed to be accurate if e < f / 2. Furthermore, every edge e 2 E (T ) such that l e > 2e is recovered by the Buneman Tree Method. 3. Any 3-approximation algorithm for the L 1 -nearest tree problem is guaranteed to be accurate if e < f / 8. Furthermore, if l e > 8e, then the edge e is recovered by the algorithm. For each method above, there is a constant C that depends upon f and d such that if the sequence length k exceeds C log ne O (lmax ) then with probability at least 1 d, the method is accurate given sequences of length k generated on T. Proof. (1) was proven in Atteson (1997), and (2) and (3) were proven in Erd ós et al. (1999) and Huson et al. (1998b). The bound on the convergence rate is based upon the suf cient condition for accuracy, established in items (1) (3) above, and is given in Erd ós et al. (1999).

6 374 HUSON ET AL. Here is an intuition about why this result should be true. Recall that l i j D [X i j ]. As X i j is a Poisson random variable, its expectation is the same as its variance. Consequently, l i j D Var[X i j ], and when the variance is high, errors in estimating l i j are also high. These results prove statistical consistency, and also provide upper bounds on the sequence length that suf ces for accuracy with high probability for these three methods. They also provide bounds on the falsenegative rate for the Buneman Tree Method, and for any 3-approximation algorithm for the L 1 -nearest tree. The term l max is the important term in this upper bound, and it is bounded by g diam(t ), where g D max e l e, and where diam(t ) denotes the length of the longest path (measured in terms of the number of edges) in the tree T. The diameter of a tree on n leaves can be as much as n 1, or as small as O (log n), but under the uniform distribution it is typically V( p n), as was shown in Erd ós et al. (1999). Therefore, this theorem shows that the convergence rate is bounded by a function of n that is at worst exponential, and is typically superpolynomial Fast convergence Theorem 1 gives an upper bound on the sequence length that suf ces for accuracy for three different distance methods, but the upper bound is high. This is unsatisfactory, because we would like methods to converge more quickly to the true tree than those upper bounds would suggest. We therefore de ne fast convergence as follows: De nition 6. A method is said to be fast converging for the Jukes Cantor model if for all xed f, g with 0 < f g, and all Jukes Cantor trees T with f l e g for all edges e, and all d > 0, there is a constant C that depends upon f, g, and d, and a polynomial p(n) so that if k exceeds C p(n) then the method recovers the true tree topology from sequences of length k with probability at least 1 d. Until recently, there were no proofs of fast convergence, and in fact, no bounds on the convergence rates of any method. However, in recent years, there have been several papers providing bounds on the convergence rates for various distance-based methods (Ambainis et al., 1997; Erd ós et al., 1997a, 1999; Farach and Kannan, 1996), and the introduction of the rst provably fast-converging methods (Csuros and Kao, 1999; Erd ós et al., 1997a,b, 1999; Cryan et al., 1998) Proving fast convergence The fast-converging methods have a similar structure, in that they use only close relationships to determine the tree. The proofs of fast convergence differ somewhat, but there is in general a common idea behind the proofs, which suggest a common proof technique. We now introduce this proof technique. De nition 7. Let e i j D jd i j l i j j and let e(q) D maxfe i j : min(d i j, l i j ) qg. If q is large, then e(q) will converge to 0 more slowly than if q is small, as we now show: Theorem 2. Let T be a Jukes Cantor tree. For all w > 0, d > 0, and y > 0, there exists a constant C that depends upon d and y, such that the sequence length that suf ces for e(w) < y with probability at least 1 d is C log ne O (w). Proof. The proof follows from results obtained in the proofs of Theorems 8 and 9 from Erd ós et al. (1999). The proof of Theorem 8 shows the following: For all Jukes Cantor model trees (T, l), and for all d > 0, q > 0, y > 0, there exists a constant C that depends upon d and y, such that if the sequence length k exceeds C log ne O (q), then Prob[jd i j l i j j < y 8i, j with l i j q] > 1 d. Similarly the proof of Theorem 9 shows the following: For all Jukes Cantor model trees (T, l), and for all d > 0, q > 0, y > 0, there exists a constant C that depends upon d and y, such that if the sequence length k exceeds C log ne O (q), then Prob[jd i j l i j j < y 8i, j with d i j q] > 1 d. The proof of the theorem follows from these two observations. Corollary 1. Let M be a xed phylogenetic method and assume that there exist functions y( f ), A(g), and F (g, n), so that for each Jukes Cantor tree (T, l) on n leaves with 0 < f l e g for all edges e,

7 DISK-COVERING 375 ² y( f ) D H( f ), ² F (g, n) D O(A(g) log n), and ² for all input dissimilarity matrices d, whenever e[f (g, n)] < y( f ) then M is correct on input d. Then M is fast converging for the Jukes Cantor model. Proof. By Theorem 2, the sequence length that suf ces for e(f (g, n)) < y( f ) is O(log n e O (F (g,n)) ). But this is O(log n n O ( A(g)) ), which is bounded by a polynomial in n since we have bounded g. Note that the smaller A(g) is, the smaller the degree of the polynomial. Thus, a proof technique for establishing fast convergence under the Jukes Cantor model is to show that functions such as y, A, and F exist. The rst methods that were proven fast converging were the Short Quartet Methods (Erd ós et al., 1997a,b, 1999), followed by the Harmonic Greedy Triplets method (Csuros and Kao, 1999), a method proposed by Cryan et al. (1998), and the fast-converging method in this paper. The techniques used to prove fast convergence in these papers are not alike, but the ones in Erd ós et al. (1997b, 1999) can be restated as using the technique above, and we believe the other proofs can also be so restated. The new fast-converging method we present is obtained by using a general technique for boosting the performance of phylogenetic methods, which we call the Disk-Covering Method, or DCM. 3. DCM BUNEMAN, A FAST CONVERGING METHOD The basic structure of DCM Buneman is quite similar to the basic structure of the Short Quartet Methods, from which it is derived. These methods have two phases, and take as input a dissimilarity matrix d. During the rst phase a collection of trees is constructed, one for each q 2 fd i j g. The input to the reconstruction of T q is the submatrix of d consisting only of those entries for which d i j q. Note that for q < q 0, the input to the reconstruction of T q is more reliable than the input to the reconstruction of T q 0, because e(q) e(q 0 ). On the other hand, the amount of data given as input to the reconstruction of T q 0 is more than the amount of data given as input to the reconstruction of T q. Thus, the trees T q will differ as q ranges over d i j, because they will be based upon differing quantities and qualities of data, and the task of the second phase is to select, from the returned trees, one that made best use of the input. The particular instantiations of the two phases differ between these different fast-converging methods. In the Short Quartet Methods, when reconstructing T q, we infer a tree on each quartet of leaves if its maximum interleaf distance is bounded by q. A unique tree on the entire set of leaves is then sought that agrees with every quartet tree in the input; if no such tree could be found, then T q was left unde ned. The analysis of the Short Quartet Method showed that when the sequences were long enough, then for all q such that T q is de ned, T q would be the true tree with high probability. The Short Quartet Methods are polynomial time and fast converging, but by design either reconstruct the true tree or fail (with high probability) to reconstruct anything. The fast-converging method that we have designed is very similar to the Short Quartet Method, but it has two distinct advantages: rst, it always reconstructs a tree, and second, it has much better performance in experimental performance studies. Thus, while it does not provide any theoretical advantage over the Short Quartet Methods, it provides empirical advantages. Furthermore, the fast converging method we present is the result of applying the Disk-Covering Method (or DCM) to a simple polynomial time method. DCM is a very general phylogenetic method booster, which can be used with any phylogenetic reconstruction method. Our experimental performance study shows that DCM boosting improves accuracy at realistic sequence lengths of many distance-based methods Phase I of DCM Buneman We now describe how we compute each T q, as q ranges over the entries of d i j. Much of the algorithm for computing T q is based upon graph-theoretic concepts and results, and so we will begin with some graph theory.

8 376 HUSON ET AL Graph-theoretic material. De nition 8. or more. A graph is triangulated (or chordal ) if no subset of nodes induces a cycle of size four There are many theoretical results established about triangulated graphs, and many NP-hard problems become solvable in polynomial time when restricted to triangulated graphs. The following can be found in Buneman (1974) and Golumbic (1980). Lemma 2. Every triangulated graph is the intersection graph of subtrees of a tree, and vice versa. Every triangulated graph G has a simplicial elimination ordering, v 1, v 2,..., v n ; this is an ordering of the nodes so that the set X i D fv j : j > i and (v i, v j ) 2 E g forms a clique [i.e., for all fv k, v l g µ X i, (v k, v l ) 2 E ]. The maximal cliques (cliques that cannot be enlarged by the addition of any further vertices) in G are of the form fv i g [ X i ; hence there are at most n maximal cliques and these can be found in O (n 2 ) time. Given a triangulated graph, a simplicial elimination ordering for the graph can be found in O(n 2 ) time, and from it the maximal cliques can also be found in that time. We now de ne threshold graphs. De nition 9. Let d be an n n dissimilarity matrix (i.e., a symmetric matrix that is 0 on the diagonal) and let q be any real number. The threshold graph Thresh(d, q) is de ned as follows. The vertex set is 1, 2,..., n and (i, j ) is an edge if and only if d i j q. Lemma 3. If d is an additive matrix, then Thresh(d, q) is triangulated. Proof. (We are not the rst to observe this fact, but we provide a proof because it is extremely simple.) By Lemma 2, to prove a graph is triangulated, it suf ces to prove that it is isomorphic to such an intersection graph. Let d be an arbitrary additive matrix, and let (T, w) be the edge-weighted tree associated uniquely to d. Let q > 0 be given. Add intermediate vertices to the edges of T and reweight the edges so that the path distances between leaf pairs are unchanged, but so that for every pair of leaves u, v in T if d u,v > q/ 2 then there is a node x in the enlarged tree T 0 so that d T 0 (u, x ) D q/ 2 and d T 0 (x, v) D d T 0 (u, v) q/ 2. Now let X u denote the subtree of T 0 of distance at most q/ 2 of u. Note that X u \ X v 6D ; if and only if d u,v q, and that the threshold graph Thresh(d, q) is identical to the intersection graph of the X u, as u ranges over the leaves of T. Consequently Thresh(d, q) is triangulated Constructing T q. We now describe how we compute a particular T q. ² Step 1: Compute Thresh(d, q). ² Step 2: Triangulate Thresh(d, q): Add edges to Thresh(d, q) to make it triangulated, while minimizing the weight of the largest edge added. (The weight of edge i, j is given by d i, j.) We call the resultant triangulated graph Thresh (d, q). ² Step 3: Compute Buneman Trees for all maximal cliques in Thresh (d, q). Each maximal clique de nes a subset of the taxa (for example, as represented by the DNA sequences at the associated leaves of the tree). We compute the Buneman Tree for each such subset of the taxa. ² Step 4: Merge the subtrees into a supertree. We now discuss the speci c techniques we use to implement the various steps, and their computational complexity. Computing the threshold graph is polynomial time, but minimally triangulating the threshold graph is NP-hard (McMorris et al., 1994). [In practice, the triangulation can be obtained using greedy heuristics, and these will generally perform well when Thresh(d, q) is close to triangulated. Because we compute d using the Jukes Cantor distance calculation, for long sequences d is close to additive, and so Thresh(d, q) will be close to triangulated. Thus, polynomial time techniques can be used to triangulate the threshold graph without too much loss in performance, as our experiments suggest.] Calculating maximal cliques in triangulated graphs is polynomial time (Golumbic, 1980). The merger of the subtrees into a supertree is the only remaining task, but we show that we can accomplish this merger in polynomial time and ensure accuracy in the supertree, when the subtrees are correct and based upon a large enough threshold graph.

9 DISK-COVERING Supertree Construction Algorithm (SCA). The construction of the supertree from the subtrees is an interesting problem, because we would like to ensure that if all the subtrees are correct (in that the true tree induces these subtrees when restricted to the subsets of leaves) then a supertree consistent with all the subtrees should be returned. However, this generalizes to the Subtree Compatibility Problem, which is NPcomplete (Steel, 1992). Thus we will need a special case of the Subtree Compatibility Problem if we are to solve this problem exactly and in polynomial time. (Note that while we were willing to accept a suboptimal triangulation, we are not willing to suboptimally construct the supertree, because we need to obtain the true tree, if possible, and not an incorrect tree.) Also, even if the subtrees are correct, they may not uniquely de ne the supertree (i.e., many different trees may be consistent with the set of subtrees). For this reason, the set of subsets must be de ned with care, with respect both to computational consequences as well as to uniqueness of the supertree compatible with the subtrees. We now describe how we compute a supertree from a set of subtrees. Our algorithm has the nice property that when applied to properly de ned inputs, it is guaranteed to reconstruct a unique supertree consistent with the inputs. We assume that the input to the supertree construction algorithm is a triangulated graph G and a collection of subtrees, one for each maximal clique in G. We let T C denote the subtree for clique C. If G is not connected, then the algorithm produces a forest (i.e., a tree on each component of G); thus we will assume that G is connected. Stage I: Preprocessing: First obtain a simplicial elimination ordering v 1, v 2,..., v n for G. Compute C i D fv i g [ X i, where X i D C(v i ) \ fv ic1, v ic2,..., v n g is the set of neighbors of v i that follow it in the simplicial elimination ordering. The set of maximal cliques is a subset of the set C 1, C 2,..., C n. For each C i, nd a maximal clique C containing C i and compute a tree for C i by deleting the leaves in C C i from T C. In this way, we associate a tree t i with every C i. Stage II: Construct the tree: For i D n 4, n 3,..., 1, compute the tree T i formed by merging t i and T ic1, using the Strict Consensus Subtree Merger method. Strict Consensus Subtree Merger. The Strict Consensus Subtree Merger method contracts a minimum set of edges in each tree in order to make them identical on the subtrees they induce on X. The strict consensus (Day, 1995) of the induced subtrees is de ned to be the maximally resolved tree that is a common contraction of the two subtrees. We will call this subtree on X the backbone. Merging the two trees together is then achieved by attaching the pieces of each tree appropriately to the different edges of the backbone. It is worth noting that the strict consensus subtree merger of two trees, while it always exists, may not be unique. In other words, it may be that some piece of each tree attaches onto the same edge of the backbone. We call this a collision. For example, in Fig. 1, the common intersection of the two leaf-sets is X D f1, 2, 3, 4g, and the strict consensus of the two subtrees induced by X is the 4-star. This is the backbone, it has four edges, and there is a collision on the edge of the backbone incident to leaf 4, but no collision on any other edge. Collisions are problematic, as the Strict Consensus Subtree Merger will potentially introduce false edges or lose true edges when they occur. However, as we will show, when the subtrees are correct and the threshold is selected to be large enough, then there are no collisions. In this case, the true tree is reconstructed. FIG. 1. Merging two trees together, by rst transforming them (through edge contractions ) so that they induce the same subtrees on their shared leaves.

10 378 HUSON ET AL. Theorem 3. Let G be a triangulated graph with n vertices, be the associated set of trees on each maximal clique, and assume that G and are given as input. Then SCA takes O (n 2 ) time. Proof. The proof of this is straightforward. Computing the perfect elimination ordering takes O(n 2 ) time, and the rest follows from the observation that merging two trees takes O(n) time since computing the strict consensus of two trees takes O(n) time (Day, 1995) Conditions under which T q is the true tree. We now describe the conditions under which the reconstructed tree T q is the true tree. We begin with some de nitions. De nition 10. Let (T, w) be a binary tree edge weighted by w : E (T )! C, and leaf labeled by the set S D f1, 2,..., ng of taxa. Let l be the additive distance matrix associated to T. Let e be an edge in T that is not incident to a leaf of T. Around e, there are four subtrees, A, B, C, and D. Let a, b, c, and d be four leaves in each of the four subtrees A, B, C and D, respectively, closest to e [where the distance between nodes p and q is measured as P e2p pq w(e)]. We call fa, b, c, dg a short quartet around e, and the collection of all short quartets around internal edges of T is denoted by Q short (T ). The maximum l i, j such that i and j are in a short quartet together is called the l-width(t ). The graph G sq on vertex set S D f1, 2,..., ng is de ned by (i, j ) 2 E (G sq ) if i and j are in some short quartet together. Theorem 4. Let T be a xed leaf-labeled tree, let G be a triangulated graph such that G sq µ G, and assume that the Buneman Tree Method applied to each maximal clique in G reconstructs the correct subtree (i.e., it reconstructs the subtree of T induced by the maximal clique). Let be the collection of Buneman Trees on maximal cliques of G, and let T be the tree obtained by applying SCA to (G, ). Then T D T. Proof. Let T be a tree whose leaves are labeled by S D fv 1, v 2,..., v n g. Let G be a triangulated graph on S, and let D ft A g, where T A is a tree on leaf set A for every maximal clique A in G. Let s D fv 1, v 2,..., v n g be a simplicial elimination ordering for G. Recall the de nitions of t i, T i, X i from the description of SCA. The proof proceeds by showing that T j fv i, v ic1,..., v n g D T i for all i. The base case requires that we show that T n 3 D T j fv n 3, v n 2, v n 1, v n g, but this follows trivially since we assume T n 3 is true. Now assume that T i D T j fv i, v ic1,..., v n g for some i 2 f1, 3,..., n 4g. Consider X i 1. Note that by de nition X i 1 D C(v i 1 ) \ fv i, v ic1,..., v n g, and that X i 1 forms the leaf set of the backbone of the strict consensus merger of t i 1 and T i. Also X i 1 is a clique, and so by assumption T i j X i 1 D t i 1 j X i 1. Consequently there is no edge contraction when we compute the backbone. To complete the proof that T i 1 D T j fv i 1, v i,..., v n g we need only show that there is no collision formed by the merger of the two trees. There can be a collision only if the backbone contains an edge onto which both v i 1 and some other v j 62 X i attach. Let e be the edge onto which v i 1 attaches, and suppose there is a collision on this edge e. Thus, some subtree t 0 of T i attaches onto e. (Note that in this case, these are true attachments in the sense that v i 1 and t 0 also attach to the path associated to e in the true tree.) Let the leaf set of T 0 be Y, and note Y µ fv i, v ic1,..., v n g X i 1. Let P be the path in T corresponding to the edge e and let its endpoints be a and b. Consider the subtree T 0 of T obtained by deleting all the nodes in T that are separated from a by the deletion of b, or vice versa, and let A a,b be the leaves of T 0. In other words, T 0 consists of the path P and all subtrees of T that attach to interior nodes of P. The following conditions are then true: 1. v i 1 2 A a,b and all leaves in t 0 are also in A a,b. 2. G sq restricted to A a,b is path connected. 3. X i 1 \ A a,b D ;. The proofs of (1) and (3) follow from the fact that T i and t i 1 are correct. Fact (2) can be proven by induction, and uses the fact that every short quartet in the true tree T induces a four-clique in G. Now, let P 0 be a path lying in G sq \ A a,b from v i 1 to some node in Y. Let y be the rst node from Y on P 0 ; by de nition, y 62 X i 1. By (3), the path from v i 1 to y lies entirely in v 1, v 2,..., v i 1, so that (v i 1, y) 2 E (G) [this follows from facts about simplicial elimination orderings, see (Golumbic, 1980)]. Consequently y 2 C(v i 1 ) \ fv i, v ic1,..., v n g D X i 1. However, this contradicts our earlier conclusion that y 62 X i 1. We now describe another condition under which T q is guaranteed to be the true tree.

11 DISK-COVERING 379 Theorem 5. Let (T, l) be a Jukes Cantor tree, d the input dissimilarity matrix, and G a triangulated graph with G sq µ G; thus every short quartet in T induces a four-clique in G. Furthermore assume that for every short quartet fi, j, k, lg in T that the Buneman Tree on fi, j, k, lg is T j fi, j, k, lg (i.e., the correct tree). If the supertree construction algorithm applied to Buneman Trees on the maximal cliques of G produces a binary tree T 0, then T 0 D T. Proof. We begin by citing the result proved by Erd ós et al. (1999). Lemma 4. Let (T, w) be an edge-weighted tree leaf labeled by S and let T 0 be a tree also leaf labeled by S. If for every short quartet fi, j, k, lg in T, the tree T 0 induces the same tree on fi, j, k, lg as T, then T D T 0. Now suppose that the supertree construction algorithm produces a binary tree T 0. In this case, every Buneman Tree on every maximal clique is binary, and there are no collisions during the merger. Therefore, T 0 agrees with T for every short quartet of T. Then by the above stated lemma, T D T Phase II of DCM Buneman In the previous sections we showed how to compute each T q, and also established two conditions under which T q would be the true tree. We now show how we select a particular tree T q to return as the output of the DCM Buneman Method. We select T q using the following rule: Return the most resolved T q (i.e., the one with the most internal edges), and if there are more than one such tree, then return the one associated to the largest q Performance guarantees of DCM Buneman Theorem 6. Let T be a Jukes Cantor model tree and let 0 < f l e g for all edges e. Recall that l-width(t) is the largest l-distance between two leaves in a short quartet. Then DCM Buneman is accurate on input d if e(l-width(t ) C 3 f / 2) < f / 2. Proof. Let q D l-width(t ) C 3 f / 2 and assume that e(q) < f / 2. Now consider the threshold graph Thresh[d, l-width(t ) C f / 2]. Since e(q) < f / 2, the following are true: ² G sq µ Thresh[d, l-width(t ) C f / 2]. ² Thresh[d, l-width(t ) C f / 2] µ Thresh[l, l-width(t ) C f ]. ² Thresh[l, l-width(t ) C f ] µ Thresh[d, l-width(t ) C 3 f / 2]. Since l is additive, Thresh[l, l-width(t ) C f ] is triangulated by Lemma 3. Hence, the minimal triangulation of Thresh[d, l-width(t ) C f / 2] is a subgraph of Thresh[d, l-width(t ) C 3 f / 2]. Consequently the Buneman Tree method computes the correct tree for every maximal clique in Thresh [d, l-width(t ) C f / 2]. By Theorem 4, the strict consensus subtree merger reconstructs the true tree. Hence there is at least one threshold p for which T p is the true tree, and it is p D l-width(t ) C f / 2. In Phase II of DCM Buneman we select the most resolved tree, and if there is more than one equally resolved tree, we select the tree associated to the largest threshold. So suppose there is a p 0 p such that T p 0 is also binary. By Theorem 5, they are identical, and both equal T. Thus if e(q) < f / 2 where q D l-width(t ) C 3 f / 2, then DCM Buneman reconstructs the true tree. Theorem 7. DCM Buneman is fast converging for the Jukes Cantor model. Proof. By Theorem 6 and Corollary 1, all we need to establish is that l-width(t ) C 3 f / 2 D O(g log n). Clearly l-width(t ) C 3 f / 2 D O(l-width(T )). Then l-width(t ) D O(g log n), as was shown in Erd ós et al. (1999). Thus, DCM Buneman is fast converging. A comparison between the convergence rates of the Buneman Tree Method and DCM Buneman is interesting. Let (T, l) be a xed Jukes Cantor model tree. The Buneman Tree Method is statistically consistent for the Jukes Cantor model of evolution, but the only established upper bounds on the convergence rate indicate that it converges from sequence lengths that grow exponentially in l max D max i j l i j, and so the Buneman

12 380 HUSON ET AL. Tree Method is not likely to be fast converging (see discussion following Theorem 1). However, for the same tree, the convergence rate of the DCM Buneman method is much faster, and in fact DCM Buneman is fast converging. The difference in convergence rates is obtained through restricting the attention to the small distances in the data set, rather than using all the distances. Thus DCM Buneman is a boosted version of the Buneman method. 4. DCM BOOSTING OTHER METHODS DCM Buneman is actually a special instantiation of a very general two-phase technique (called the Disk- Covering Method, or DCM), which can be used in conjunction with any phylogenetic method. In the rst phase, we construct a tree T q for each q 2 d i j, and in the second phase, we compute a tree on the entire set of taxa, by taking a consensus of the trees T q. When used in conjunction with the phylogenetic method M, we call this DCM M. The method M is used to reconstruct the subtrees on the maximal cliques of the triangulated threshold graph; and second, we design the second phase (taking the consensus of the trees T q ) to optimize the performance of the method Phase I We now describe how we perform the rst phase, in which a tree T q is computed for each q 2 fd i j g. Let S be the input set of sequences, and let q 2 fd i j g be the selected threshold. Let M be the base phylogenetic method. ² We construct the threshold graph Thresh(d, q). We triangulate each component of Thresh(d, q), minimizing the weight of the largest edge added, thus obtaining a triangulated graph Thresh (d, q). ² We compute the maximal cliques in Thresh (d, q), and compute a tree on each maximal clique using the method M. ² We apply the Supertree Construction Algorithm to the set of trees de ned for the maximal cliques, obtaining a tree T q if Thresh (d, q) is connected, and otherwise obtaining a forest F q. This rst phase can be modi ed to allow for the triangulation of the threshold graph to be done suboptimally, and in practice this is what we have done (greedy triangulations affect the performance of the method only very slightly, as our experiments show). The construction of a supertree from subtrees can be implemented in various ways, as well. We have elected to be quite conservative, and hence our Supertree Construction Algorithm employs the Strict Consensus Subtree Merger technique; however, this can also be modi ed Phase II In the second phase, we take the trees T q we have computed in Phase I and infer a consensus of these trees. We have experimented with using DCM boosting in conjunction with several distance based methods, including Neighbor-Joining (NJ), the Agarwala et al. (1996) algorithm that 3-approximates the L 1 -nearest tree, and the Buneman Tree method. Our experimental studies for all these methods indicate that for almost every small q, the tree T q has very low false-positive rates, typically close to 0. Consequently, almost all T q are either contractions of the true tree or close to being contractions of the true tree. (This is perhaps not at all surprising, since we designed the merger using the strict consensus technique, and this collapses edges that are not supported by every subtree!) This suggests the following implementation of Phase II: take all the trees T q, and compute the asymmetric median tree of these trees (Phillips and Warnow, 1996). We now de ne this consensus technique. De nition 11. The asymmetric median tree of a set of leaf-labeled trees D ft 1, T 2,..., T p g computes a tree T such that C (T ) µ [ i C (T i ), and such that if each c 2 C (T ) is weighted by the number of trees T i that contain c, then w(t ) D P c2c (T ) w(c) is maximum. The idea behind the asymmetric median tree is that when the input trees have low false-positive rates, the asymmetric median tree method recovers as many of the true edges as possible. Computing the asymmetric median tree is NP-hard, and so in practice we have implemented this using a greedy strategy (which is not guaranteed to nd an optimal solution). This greedy technique neverthless has good empirical performance,

13 DISK-COVERING 381 as our experimental study shows. This implementation of Phase II can be used with DCM Buneman as well, but under these conditions we do not have provable performance guarantees. 5. EXPERIMENTAL RESULTS We brie y describe a small portion of our experimental performance analysis. For additional performance results based upon simulating sequence evolution, see Huson et al. (1998a) Model trees and simulated datasets The basic model tree that we use has its topology and rates of evolution along the edges based upon reconstructions of the African Eve data set (Maddison et al., 1992) restricted to its human mitochondrial DNA sequences. We then scaled the rates of evolution on this basic model tree up, to produce a number of different trees on which there were high evolutionary rates. We used this larger set of model trees to generate several hundred different sets of DNA sequences using the ecat simulator (Rice, 1997), and using the Jukes Cantor model of evolution. Later, we will report on the performance of DCM boosting on one particular scaled up version (in which the largest probability of change on any edge equal to 0.48) of this basic tree Distance calculations We computed Jukes Cantor distance matrices for each data set. On some data sets the rate of evolution was high enough that some pairs of sequences differed in 75% or more of their positions. For such pairs of sequences, the standard Jukes Cantor distance calculation cannot be used since the log cannot be computed. For these pairs, we de ned d i j using the following version of the large value replacement technique (Swofford et al., 1996). We computed the maximum Jukes Cantor distance, multiplied that value by the number n of leaves in the matrix, and replaced all unde ned values by this large number. These matrices were then input to six different distance-based methods: Neighbor-Joining (NJ), the Agarwala et al. method, the Buneman Tree, and the DCM-boosted versions of these three methods Performance evaluation criteria We explored performance with respect to accuracy of the topology recovered by each method, by comparing the reconstructed tree to the model tree. Recall that this accuracy is quanti ed by examining false-negative (FP) rates and false-positive (FN) rates (see De nition 2). Recall that sequence lengths beyond 5000 nucleotides are considered unusually long for tree reconstruction, and that in general convergence to the true tree or acceptable error rates within 1000 nucleotides is thus the critical test of performance (all these methods will converge to the true tree given long enough sequences, since they are all statistically consistent under this model of evolution! The question is at what rate). Also, for systematic biology purposes, error rates below 5% can probably be tolerated, though of course this will depend upon the tree. Hence, we examined these experiments with the following speci c questions in mind: ² At what sequence length do we get an error rate below 5%? ² At what sequence length (if any) do we recover the true tree reliably? ² How well do the different methods do when restricted to typical length sequences (between 200 and 1200 nucleotides)? Since we are interested in how DCM boosting affects performance, we will speci cally address how DCM-boosted methods differ from their base methods with respect to these three questions Summary of experimental results We report on the results of a set of experiments on the African Eve tree with rates of evolution scaled up so that the maximum probability of change was This model tree is a good example of how DCM boosting affects performance when the tree is a dif cult one to reconstruct, due to the combination of large numbers of taxa and high divergence. Here are some of the basic observations about the performance of these six methods on this tree.

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

A few logs suce to build (almost) all trees: Part II

A few logs suce to build (almost) all trees: Part II Theoretical Computer Science 221 (1999) 77 118 www.elsevier.com/locate/tcs A few logs suce to build (almost) all trees: Part II Peter L. Erdős a;, Michael A. Steel b,laszlo A.Szekely c, Tandy J. Warnow

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Recent Advances in Phylogeny Reconstruction

Recent Advances in Phylogeny Reconstruction Recent Advances in Phylogeny Reconstruction from Gene-Order Data Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131 Department Colloqium p.1/41 Collaborators

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE. (extended abstract submitted to RECOMB '99)

LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE. (extended abstract submitted to RECOMB '99) LOWER BOUNDS ON SEQUENCE LENGTHS REQUIRED TO RECOVER THE EVOLUTIONARY TREE MIKL OS CS } UR OS AND MING-YANG KAO (extended abstract submitted to RECOMB '99) Abstract. In this paper we study the sequence

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.

More information

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1 Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 Lecturer: Wing-Kin Sung Scribe: Ning K., Shan T., Xiang

More information

Reconstructing Trees from Subtree Weights

Reconstructing Trees from Subtree Weights Reconstructing Trees from Subtree Weights Lior Pachter David E Speyer October 7, 2003 Abstract The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree

More information

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute

More information

Consistency Index (CI)

Consistency Index (CI) Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

More information

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

CS281A/Stat241A Lecture 19

CS281A/Stat241A Lecture 19 CS281A/Stat241A Lecture 19 p. 1/4 CS281A/Stat241A Lecture 19 Junction Tree Algorithm Peter Bartlett CS281A/Stat241A Lecture 19 p. 2/4 Announcements My office hours: Tuesday Nov 3 (today), 1-2pm, in 723

More information

The Generalized Neighbor Joining method

The Generalized Neighbor Joining method The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1 Challenge We would like to

More information

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Minimum evolution using ordinary least-squares is less robust than neighbor-joining Minimum evolution using ordinary least-squares is less robust than neighbor-joining Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu November

More information

1 Introduction The j-state General Markov Model of Evolution was proposed by Steel in 1994 [14]. The model is concerned with the evolution of strings

1 Introduction The j-state General Markov Model of Evolution was proposed by Steel in 1994 [14]. The model is concerned with the evolution of strings Evolutionary Trees can be Learned in Polynomial Time in the Two-State General Markov Model Mary Cryan Leslie Ann Goldberg Paul W. Goldberg. July 20, 1998 Abstract The j-state General Markov Model of evolution

More information

Properties of normal phylogenetic networks

Properties of normal phylogenetic networks Properties of normal phylogenetic networks Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu August 13, 2009 Abstract. A phylogenetic network is

More information

On improving matchings in trees, via bounded-length augmentations 1

On improving matchings in trees, via bounded-length augmentations 1 On improving matchings in trees, via bounded-length augmentations 1 Julien Bensmail a, Valentin Garnero a, Nicolas Nisse a a Université Côte d Azur, CNRS, Inria, I3S, France Abstract Due to a classical

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Key words. computational learning theory, evolutionary trees, PAC-learning, learning of distributions,

Key words. computational learning theory, evolutionary trees, PAC-learning, learning of distributions, EVOLUTIONARY TREES CAN BE LEARNED IN POLYNOMIAL TIME IN THE TWO-STATE GENERAL MARKOV MODEL MARY CRYAN, LESLIE ANN GOLDBERG, AND PAUL W. GOLDBERG. Abstract. The j-state General Markov Model of evolution

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Rani M. R, Mohith Jagalmohanan, R. Subashini Binary matrices having simultaneous consecutive

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Distances that Perfectly Mislead

Distances that Perfectly Mislead Syst. Biol. 53(2):327 332, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490423809 Distances that Perfectly Mislead DANIEL H. HUSON 1 AND

More information

A Faster Algorithm for the Perfect Phylogeny Problem when the Number of Characters is Fixed

A Faster Algorithm for the Perfect Phylogeny Problem when the Number of Characters is Fixed Computer Science Technical Reports Computer Science 3-17-1994 A Faster Algorithm for the Perfect Phylogeny Problem when the Number of Characters is Fixed Richa Agarwala Iowa State University David Fernández-Baca

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

A Cubic-Vertex Kernel for Flip Consensus Tree

A Cubic-Vertex Kernel for Flip Consensus Tree To appear in Algorithmica A Cubic-Vertex Kernel for Flip Consensus Tree Christian Komusiewicz Johannes Uhlmann Received: date / Accepted: date Abstract Given a bipartite graph G = (V c, V t, E) and a nonnegative

More information

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances

Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances Ilan Gronau Shlomo Moran September 6, 2006 Abstract Reconstructing phylogenetic trees efficiently and accurately from distance estimates

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

arxiv: v1 [q-bio.pe] 1 Jun 2014

arxiv: v1 [q-bio.pe] 1 Jun 2014 THE MOST PARSIMONIOUS TREE FOR RANDOM DATA MAREIKE FISCHER, MICHELLE GALLA, LINA HERBST AND MIKE STEEL arxiv:46.27v [q-bio.pe] Jun 24 Abstract. Applying a method to reconstruct a phylogenetic tree from

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC

More information

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Boundary cliques, clique trees and perfect sequences of maximal cliques of a chordal graph

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Boundary cliques, clique trees and perfect sequences of maximal cliques of a chordal graph MATHEMATICAL ENGINEERING TECHNICAL REPORTS Boundary cliques, clique trees and perfect sequences of maximal cliques of a chordal graph Hisayuki HARA and Akimichi TAKEMURA METR 2006 41 July 2006 DEPARTMENT

More information

The Strong Largeur d Arborescence

The Strong Largeur d Arborescence The Strong Largeur d Arborescence Rik Steenkamp (5887321) November 12, 2013 Master Thesis Supervisor: prof.dr. Monique Laurent Local Supervisor: prof.dr. Alexander Schrijver KdV Institute for Mathematics

More information

The Complexity of Constructing Evolutionary Trees Using Experiments

The Complexity of Constructing Evolutionary Trees Using Experiments The Complexity of Constructing Evolutionary Trees Using Experiments Gerth Stlting Brodal 1,, Rolf Fagerberg 1,, Christian N. S. Pedersen 1,, and Anna Östlin2, 1 BRICS, Department of Computer Science, University

More information

Strongly chordal and chordal bipartite graphs are sandwich monotone

Strongly chordal and chordal bipartite graphs are sandwich monotone Strongly chordal and chordal bipartite graphs are sandwich monotone Pinar Heggernes Federico Mancini Charis Papadopoulos R. Sritharan Abstract A graph class is sandwich monotone if, for every pair of its

More information

Efficient Reassembling of Graphs, Part 1: The Linear Case

Efficient Reassembling of Graphs, Part 1: The Linear Case Efficient Reassembling of Graphs, Part 1: The Linear Case Assaf Kfoury Boston University Saber Mirzaei Boston University Abstract The reassembling of a simple connected graph G = (V, E) is an abstraction

More information

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1

BIOINFORMATICS. Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data. Jijun Tang 1 and Bernard M.E. Moret 1 BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 8 Scaling Up Accurate Phylogenetic Reconstruction from Gene-Order Data Jijun Tang 1 and Bernard M.E. Moret 1 1 Department of Computer Science, University of New

More information

arxiv: v1 [cs.cc] 9 Oct 2014

arxiv: v1 [cs.cc] 9 Oct 2014 Satisfying ternary permutation constraints by multiple linear orders or phylogenetic trees Leo van Iersel, Steven Kelk, Nela Lekić, Simone Linz May 7, 08 arxiv:40.7v [cs.cc] 9 Oct 04 Abstract A ternary

More information

Reconstruction of certain phylogenetic networks from their tree-average distances

Reconstruction of certain phylogenetic networks from their tree-average distances Reconstruction of certain phylogenetic networks from their tree-average distances Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu October 10,

More information

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS KT Huber, V Moulton, C Semple, and M Steel Department of Mathematics and Statistics University of Canterbury Private Bag 4800 Christchurch,

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

arxiv: v5 [q-bio.pe] 24 Oct 2016

arxiv: v5 [q-bio.pe] 24 Oct 2016 On the Quirks of Maximum Parsimony and Likelihood on Phylogenetic Networks Christopher Bryant a, Mareike Fischer b, Simone Linz c, Charles Semple d arxiv:1505.06898v5 [q-bio.pe] 24 Oct 2016 a Statistics

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

A Minimum Spanning Tree Framework for Inferring Phylogenies

A Minimum Spanning Tree Framework for Inferring Phylogenies A Minimum Spanning Tree Framework for Inferring Phylogenies Daniel Giannico Adkins Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-157

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

arxiv: v1 [cs.ds] 1 Nov 2018

arxiv: v1 [cs.ds] 1 Nov 2018 An O(nlogn) time Algorithm for computing the Path-length Distance between Trees arxiv:1811.00619v1 [cs.ds] 1 Nov 2018 David Bryant Celine Scornavacca November 5, 2018 Abstract Tree comparison metrics have

More information

Perfect matchings in highly cyclically connected regular graphs

Perfect matchings in highly cyclically connected regular graphs Perfect matchings in highly cyclically connected regular graphs arxiv:1709.08891v1 [math.co] 6 Sep 017 Robert Lukot ka Comenius University, Bratislava lukotka@dcs.fmph.uniba.sk Edita Rollová University

More information

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES SIMONE LINZ AND CHARLES SEMPLE Abstract. Calculating the rooted subtree prune and regraft (rspr) distance between two rooted binary

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Phylogenetics: Parsimony

Phylogenetics: Parsimony 1 Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University he Problem 2 Input: Multiple alignment of a set S of sequences Output: ree leaf-labeled with S Assumptions Characters are mutually independent

More information

The minimum G c cut problem

The minimum G c cut problem The minimum G c cut problem Abstract In this paper we define and study the G c -cut problem. Given a complete undirected graph G = (V ; E) with V = n, edge weighted by w(v i, v j ) 0 and an undirected

More information

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Benny Chor Michael Hendy David Penny Abstract We consider the problem of finding the maximum likelihood rooted tree under

More information

BIOINFORMATICS DISCOVERY NOTE

BIOINFORMATICS DISCOVERY NOTE BIOINFORMATICS DISCOVERY NOTE Designing Fast Converging Phylogenetic Methods!" #%$&('$*),+"-%./ 0/132-%$ 0*)543768$'9;:(0'=A@B2$0*)A@B'9;9CD

More information

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS PETER J. HUMPHRIES AND CHARLES SEMPLE Abstract. For two rooted phylogenetic trees T and T, the rooted subtree prune and regraft distance

More information

arxiv: v1 [cs.dm] 29 Oct 2012

arxiv: v1 [cs.dm] 29 Oct 2012 arxiv:1210.7684v1 [cs.dm] 29 Oct 2012 Square-Root Finding Problem In Graphs, A Complete Dichotomy Theorem. Babak Farzad 1 and Majid Karimi 2 Department of Mathematics Brock University, St. Catharines,

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

An Investigation of Phylogenetic Likelihood Methods

An Investigation of Phylogenetic Likelihood Methods An Investigation of Phylogenetic Likelihood Methods Tiffani L. Williams and Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131-1386 Email: tlw,moret @cs.unm.edu

More information

Realization Plans for Extensive Form Games without Perfect Recall

Realization Plans for Extensive Form Games without Perfect Recall Realization Plans for Extensive Form Games without Perfect Recall Richard E. Stearns Department of Computer Science University at Albany - SUNY Albany, NY 12222 April 13, 2015 Abstract Given a game in

More information

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign

CS 581 Algorithmic Computational Genomics. Tandy Warnow University of Illinois at Urbana-Champaign CS 581 Algorithmic Computational Genomics Tandy Warnow University of Illinois at Urbana-Champaign Today Explain the course Introduce some of the research in this area Describe some open problems Talk about

More information

Representations of All Solutions of Boolean Programming Problems

Representations of All Solutions of Boolean Programming Problems Representations of All Solutions of Boolean Programming Problems Utz-Uwe Haus and Carla Michini Institute for Operations Research Department of Mathematics ETH Zurich Rämistr. 101, 8092 Zürich, Switzerland

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

arxiv: v1 [q-bio.pe] 4 Sep 2013

arxiv: v1 [q-bio.pe] 4 Sep 2013 Version dated: September 5, 2013 Predicting ancestral states in a tree arxiv:1309.0926v1 [q-bio.pe] 4 Sep 2013 Predicting the ancestral character changes in a tree is typically easier than predicting the

More information

Fixed Parameter Algorithms for Interval Vertex Deletion and Interval Completion Problems

Fixed Parameter Algorithms for Interval Vertex Deletion and Interval Completion Problems Fixed Parameter Algorithms for Interval Vertex Deletion and Interval Completion Problems Arash Rafiey Department of Informatics, University of Bergen, Norway arash.rafiey@ii.uib.no Abstract We consider

More information

Branching. Teppo Niinimäki. Helsinki October 14, 2011 Seminar: Exact Exponential Algorithms UNIVERSITY OF HELSINKI Department of Computer Science

Branching. Teppo Niinimäki. Helsinki October 14, 2011 Seminar: Exact Exponential Algorithms UNIVERSITY OF HELSINKI Department of Computer Science Branching Teppo Niinimäki Helsinki October 14, 2011 Seminar: Exact Exponential Algorithms UNIVERSITY OF HELSINKI Department of Computer Science 1 For a large number of important computational problems

More information

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data Mary E. Cosner Dept. of Plant Biology Ohio State University Li-San Wang Dept.

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

18.5 Crossings and incidences

18.5 Crossings and incidences 18.5 Crossings and incidences 257 The celebrated theorem due to P. Turán (1941) states: if a graph G has n vertices and has no k-clique then it has at most (1 1/(k 1)) n 2 /2 edges (see Theorem 4.8). Its

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Cographs; chordal graphs and tree decompositions

Cographs; chordal graphs and tree decompositions Cographs; chordal graphs and tree decompositions Zdeněk Dvořák September 14, 2015 Let us now proceed with some more interesting graph classes closed on induced subgraphs. 1 Cographs The class of cographs

More information

THE MAXIMAL SUBGROUPS AND THE COMPLEXITY OF THE FLOW SEMIGROUP OF FINITE (DI)GRAPHS

THE MAXIMAL SUBGROUPS AND THE COMPLEXITY OF THE FLOW SEMIGROUP OF FINITE (DI)GRAPHS THE MAXIMAL SUBGROUPS AND THE COMPLEXITY OF THE FLOW SEMIGROUP OF FINITE (DI)GRAPHS GÁBOR HORVÁTH, CHRYSTOPHER L. NEHANIV, AND KÁROLY PODOSKI Dedicated to John Rhodes on the occasion of his 80th birthday.

More information

arxiv: v1 [math.co] 5 May 2016

arxiv: v1 [math.co] 5 May 2016 Uniform hypergraphs and dominating sets of graphs arxiv:60.078v [math.co] May 06 Jaume Martí-Farré Mercè Mora José Luis Ruiz Departament de Matemàtiques Universitat Politècnica de Catalunya Spain {jaume.marti,merce.mora,jose.luis.ruiz}@upc.edu

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Parameterized Complexity of the Sparsest k-subgraph Problem in Chordal Graphs

Parameterized Complexity of the Sparsest k-subgraph Problem in Chordal Graphs Parameterized Complexity of the Sparsest k-subgraph Problem in Chordal Graphs Marin Bougeret, Nicolas Bousquet, Rodolphe Giroudeau, and Rémi Watrigant LIRMM, Université Montpellier, France Abstract. In

More information

Computability-Theoretic Properties of Injection Structures

Computability-Theoretic Properties of Injection Structures Computability-Theoretic Properties of Injection Structures Douglas Cenzer 1, Valentina Harizanov 2 and Je rey B. Remmel 3 Abstract We study computability-theoretic properties of computable injection structures

More information

7 The structure of graphs excluding a topological minor

7 The structure of graphs excluding a topological minor 7 The structure of graphs excluding a topological minor Grohe and Marx [39] proved the following structure theorem for graphs excluding a topological minor: Theorem 7.1 ([39]). For every positive integer

More information

Distance Corrections on Recombinant Sequences

Distance Corrections on Recombinant Sequences Distance Corrections on Recombinant Sequences David Bryant 1, Daniel Huson 2, Tobias Kloepper 2, and Kay Nieselt-Struwe 2 1 McGill Centre for Bioinformatics 3775 University Montréal, Québec, H3A 2B4 Canada

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

The Minimum Rank, Inverse Inertia, and Inverse Eigenvalue Problems for Graphs. Mark C. Kempton

The Minimum Rank, Inverse Inertia, and Inverse Eigenvalue Problems for Graphs. Mark C. Kempton The Minimum Rank, Inverse Inertia, and Inverse Eigenvalue Problems for Graphs Mark C. Kempton A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for

More information

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University The Problem Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions

More information

A new algorithm to construct phylogenetic networks from trees

A new algorithm to construct phylogenetic networks from trees A new algorithm to construct phylogenetic networks from trees J. Wang College of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia, China Corresponding author: J. Wang E-mail: wangjuanangle@hit.edu.cn

More information

A Polynomial-Time Algorithm for Pliable Index Coding

A Polynomial-Time Algorithm for Pliable Index Coding 1 A Polynomial-Time Algorithm for Pliable Index Coding Linqi Song and Christina Fragouli arxiv:1610.06845v [cs.it] 9 Aug 017 Abstract In pliable index coding, we consider a server with m messages and n

More information

Preliminaries and Complexity Theory

Preliminaries and Complexity Theory Preliminaries and Complexity Theory Oleksandr Romanko CAS 746 - Advanced Topics in Combinatorial Optimization McMaster University, January 16, 2006 Introduction Book structure: 2 Part I Linear Algebra

More information

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E. Math 239: Discrete Mathematics for the Life Sciences Spring 2008 Lecture 14 March 11 Lecturer: Lior Pachter Scribe/ Editor: Maria Angelica Cueto/ C.E. Csar 14.1 Introduction The goal of today s lecture

More information

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals

Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals Maximum Likelihood Estimates for Binary Random Variables on Trees via Phylogenetic Ideals Robin Evans Abstract In their 2007 paper, E.S. Allman and J.A. Rhodes characterise the phylogenetic ideal of general

More information

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)}

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)} Preliminaries Graphs G = (V, E), V : set of vertices E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) 1 2 3 5 4 V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)} 1 Directed Graph (Digraph)

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

CMPUT 675: Approximation Algorithms Fall 2014

CMPUT 675: Approximation Algorithms Fall 2014 CMPUT 675: Approximation Algorithms Fall 204 Lecture 25 (Nov 3 & 5): Group Steiner Tree Lecturer: Zachary Friggstad Scribe: Zachary Friggstad 25. Group Steiner Tree In this problem, we are given a graph

More information

On the Block Error Probability of LP Decoding of LDPC Codes

On the Block Error Probability of LP Decoding of LDPC Codes On the Block Error Probability of LP Decoding of LDPC Codes Ralf Koetter CSL and Dept. of ECE University of Illinois at Urbana-Champaign Urbana, IL 680, USA koetter@uiuc.edu Pascal O. Vontobel Dept. of

More information