Reconciliation with Non-binary Gene Trees Revisited

Size: px
Start display at page:

Download "Reconciliation with Non-binary Gene Trees Revisited"

Transcription

1 Reconciliation with Non-binary Gene Trees Revisited Yu Zheng and Louxin Zhang National University of Singapore Abstract By reconciling the phylogenetic tree of a gene family with the corresponding species tree, it is possible to infer lineage-specific duplications and losses with high confidence and hence annotate orthologs and paralogs. However, the currently available reconciliation methods for non-binary gene trees are too computationally expensive to be applied on a genomic level. Here, we present an O(m + n) algorithm to reconcile an arbitrary gene tree with its corresponding species tree, where m and n are the number of nodes in the gene and species trees respectively. The improvement is achieved through two innovations: a fast computation of compressed child-image subtrees, and efficient reconstruction of irreducible duplication histories. This method will be a valuable tool to genome-wide studies of the evolution of individual gene families. 1 Introduction Given the importance of accurately annotated gene relationships in evolutionary and functional studies of biological systems [15, 23], significant efforts have been invested in developing methods to identify orthologs and paralogs [2, 6, 8, 13, 16, 21, 22]. A pair of genes in different species whose last common ancestor corresponds to a speciation event are orthologs [11]. Two genes (in the same or different species) that descend from a gene duplication event are paralogs. Knowing the orthologs and paralogs of species permits one to reconstruct the duplication history within a gene family. In practice, this is often done by reconciling the phylogenetic tree (the gene tree) of a family with the corresponding species tree, and inferring lineage-specific duplication and loss events [12, 16]. Although a plethora of reconciliation methods have been developed over the past two decades (see the review paper [7]), only recently has this reconciliation process been generalized to non-binary gene trees (see the survey articles [10, 24]). The ability to reconcile non-binary gene trees substantially expands the applications of this method in comparative genomics. First, it expands the range of tools: many widely-used phylogenetic programs such as MrBayes produce non-binary gene trees if there is not enough signal in the data to time the divergences. Moreover, reconciling non-binary gene trees obtained by contracting weak branches in binary gene trees produces more accurate duplication events than working directly on corresponding binary ones (our unpublished data). Second, it allows us to design fast heuristic programs for genomewide mapping of orthologs and paralogs. For example, SYNERGY implicitly assumes that the gene tree of every gene family is a star tree and heuristically reconciles the star gene tree with the input species tree, which relieves the substantial preprocessing burden of building binary gene trees for individual gene families [23]. It inspires us to work on bottom-up approach for reconciling non-binary gene trees. For binary gene trees and species trees, there is an accepted reconciliation process which has been proven to produce the unique duplication history with the fewest gene duplication and gene loss events [4, 14], and whose computational complexity is linear with respect to the number of nodes in the gene and species trees [5, 19, 25]. However, the uniqueness of the result is not so clear for non-binary gene trees, where reconcilia- 1

2 tion may produce different duplication histories for different cost models [26]. Furthermore, no one has yet designed a linear-time reconciliation algorithm for non-binary gene trees that is guaranteed to generate the history with the minimum number of duplication and loss events. Chang and Eulenstein developed the first algorithm for the problem [3], but their solution has cubic complexity. The dynamic programming algorithm of Durand et al. has the same worst-case time complexity, but can also solve the problem under any affine cost model [9]. Recently, a quadratic algorithm was proposed in [17]. All these methods are computationally intensive when applied on a genomic scale. In this paper, we present a linear-time algorithm that solves the problem. Our bottom-up approach can incorporate multiple sources of information on gene similarity, including sequence similarity and conserved gene order, and is efficient enough to be used on a genomic level. Hence, it provides a valuable framework for the genome-wide mapping of orthologs and paralogs in any group of species with a known phylogeny, while taking advantage of the rapid increase in fully sequenced genomes. The rest of this paper is divided into six sections. The reconciliation problem and different cost models are introduced in Section 2. Section 3 presents an algorithm to simultaneously compute all compressed child-image subtrees of the species tree in linear time, which immediately leads to an improved reconciliation method. Section 4 introduces the concept of irreducible duplication history. Section 5 presents a simple algorithm that takes O(m + n) operations to reconcile a gene tree of n nodes and the corresponding species tree of m nodes. In Section 6, we use simulated data to compare the time efficiency of our algorithm with other methods. We conclude with suggestions for future work. 2 Concepts and Notions 2.1 Definitions Let T = (V, E) be a rooted tree in which one node is designated the root and the branches are oriented away from the root. V is the set of all nodes, and E is the set of all branches (directed edges). For two nodes u, v V, v is the parent of u if (v, u) E. Further, v is an ancestor of u and equivalently u is a descendant of v, written v u, if the unique path from the root to u passes through v. We write v u if u = v or v u. For U V, lca(u) denotes the most recent common ancestor of the nodes in U. The depth of a node in T is the number of branches in the path from the root to it. In this paper, T denotes the number of nodes in T. p(u) denotes the parent of a non-root node u V (T ). Ch(u) denotes the set of children of u in T. V lf (T ) denotes the set of leaves (terminal nodes) of T. V o (T ) denotes the set of internal (non-leaf) nodes of T. T (u) denotes the subtree rooted at u, which consists of u and all descendants of u. T U denotes the subtree induced by a subset U V : the nodes of T U are V = {v V lca(u) v u U} and the edges of T U are E(T ) (V V ). A node v is said to be binary if it has two children. T is binary if every internal node is binary. If T is non-binary, a binary tree T is said to be a binary refinement of T if for every u T, there exists v T such that V lf (T (u)) = V lf (T (v)), or equivalently if T can be obtained from T by branch contraction. 2.2 Species trees A species tree is a rooted tree in which each leaf is associated with a unique species. For node u V lf (S), the branch (p(u), u) represents the species that labels u. For u V o (S), (p(u), u) represents the common ancestor of all the species that label the leaves in S(u) and u represents a speciation event. Here we assume that a species tree is binary, and that the branch entering the root represents the common ancestor of all the species in the tree; this is called the root branch (Figure 1A). 2

3 A C B D 3 Figure 1: A. A binary species tree over six species 1-6. B. A gene tree of nine genes: two each from species 2, 3 and 4, and one each from species 1, 5 and 6. The gene tree has two nonbinary nodes. The child-image subtree of g and its compressed version are shown in panels C and D. Here, λ(g 1 ) = u, λ(g 2 ) = 4, λ(g 3 ) = y, λ(g 4 ) = 3, and λ(g 5 ) = r. 2.3 Gene trees and gene duplication history The gene tree reconstructed from the DNA or protein sequences of a gene family represents evolutionary relationships in these genes. However, it may not explicitly represent the duplication history of the gene family. Without knowing the true orthologous and paralogous relationships in the family members, we do not need to distinguish the members that are sampled from the same species. Hence, we label each leaf in the gene tree that represents a gene with the species that hosts the gene today. Hence, in the resulting tree, leaves are not uniquely labeled in general. Also, gene trees do not need to be binary. Consider a family F of genes sampled from a collection X of species with a known phylogenetic tree S. Assume that F evolved from a unique ancestral gene through k gene duplications and m gene losses in ancestral species (that is, branches) of S (Figure 2A). We further assume that (i) each duplication event gives rise to one new copy of the involved gene; (ii) each copy, as well as the original duplicated gene, has exactly one descendant gene in an species, unless one of the m loss events occurs in the ancestors. The topology H of the duplication history H of F is a rooted tree whose leaves are labeled with genes. Since S is binary, each degree-2 node u V (H) corresponds to a gene loss, and each degree-3 node with children u and v represents a duplication if it does not correspond to a species tree node (Figure 2C). We use such types of trees to represent duplication histories. The duplication (resp. loss) cost d H (resp. l H ) of H is defined to be the number of duplication (resp. loss) events occurring in it. Its mutation cost is defined to be d H +l H. If we assign weights w d and w l respectively to duplication and loss events, the (w d, w l )-affine cost of H is defined to be w d d H + w l l H. 2.4 The reconciliation problem The duplication history H can be inferred by reconciling S and the gene tree G of F. The symbol g s denotes a gene g F in species s. For U V (G), λ(u) def = {λ(u) u U}. The lca reconciliation is the map λ : V (G) V (S) defined as: { s if g = g λ(g) = s V lf (G), lca (λ(ch(g))) if g V o (G). (1) If G is binary, λ induces the unique duplication history of F that has the minimum duplication and loss costs [4, 14]. In other words, it finds the most parsimonious evolution history. Furthermore, for g V o (G), g is inferred to be a duplication node if λ(g) {λ(g ) g Ch(g)}. The corresponding gene duplication event occurs in the branch (p(λ(g)), λ(g)) in S, and there is a gene loss occurring in each branch off the path from λ(g) to λ(g ) for each g Ch(g) in the inferred duplication history. We define the cost of the lca reconciliation of G and S to be the cost of the corresponding duplication history for each of the duplication, loss, and affine cost models. If G is non-binary, it is not clear how many duplication events can be inferred and where they should occur in the most parsimonious duplication history of F. The problem of reconciling an arbitrary gene tree G and a binary species tree 3

4 S is formulated as follows: Instance: The true gene tree G of a family of genes F, observed in species with a known species tree S. The reconciliation cost is c. Solution: A duplication history of F, represented as a binary tree G, with the cost min G BR(G) c(g, S), where BR(G) is the set of all binary trees that refine G. Note that V (G) V (T ) for every T BR(G). The lca reconciliation of T and S maps every node in G to the same node in S for any T BR(G). Therefore, we just need to infer the duplication history from each ancestral gene g to its children in the subtree S λ(ch(g)) (called its child-image subtree) (Figure 1C), for each g V o (G) separately. In the next section, we discuss our algorithm for non-binary nodes in G, which is identical to the simple rule mentioned above when applied to binary nodes. 3 Compressed Image Subtrees By definition, λ(g) is the root of S λ(ch(g)). If S λ(ch(g)) contains degree-2 nodes, its size can be much larger than Ch(g). To design a fast algorithm for reconciling G and S, we need to compress S λ(ch(g)) by contracting all degree-2 nodes except for those in λ(ch(g)) for each g (Figure 1D). The compressed version of S λ(ch(g)) is written I(g). Let P be a path from p 1 to p 2 in S λ(ch(g)) such that p 1 and p 2 are of degree 3 or in λ(ch(g)) and all the middle nodes are of degree 2 and not in λ(ch(g)). Note that any parsimonious duplication history from g to its children can only have gene loss events in the first branch of P, gene duplication events in the last branch of P, or both. It is also true that if the depths of p 1 and p 2 in S are known, we can compute the gene losses occurring in the branches leading away from P, when working on I(g). I(g) is obtained from S λ(ch(g)) by replacing each of such paths with a single branch. Importantly, I(g) 2 Ch(g) for each g and hence g V o(g) I(g) 2 G. Additionally, we have the following fact, whose proof is in Appendix A. Theorem 1 It takes linear time O( G + S ) to construct the compressed child-image subtrees of all the internal nodes of G in S. Finally, we assume that for each s I(g), its depth d(s) in S is computed and is stored in the data structure along with other information on node s. Note that d(p(s)) d(s) is the number of branches in the path from p(s) to s in S, which is used to compute the gene loss cost of the duplication history from g to its child genes in S. Theorem 1 leads immediately to an improved method for tree reconciliation. By implementing a dynamic programming algorithm to resolve different non-binary gene tree nodes in their corresponding I(g), we can compute an optimal reconciliation of G and S in time O( S + d 2 G ) in the affine cost model, where d is the maximum node degree of G (see Appendix B for details). 4 Irreducible Duplication Histories To develop a linear time algorithm for reconciling non-binary gene trees, we need to focus on a special type of duplication histories of gene families. In this section, we introduce this type of duplication histories. 4.1 Equivalence of gene duplication histories Consider a duplication history H from g to Ch(g) in the child-image subtree S λ(ch(g)). If duplication and loss occur in the same branch (Figure 2A), we can eliminate one duplication and one loss to obtain a new duplication history with fewer events (Figure 2B), because we do not distinguish the elements in Ch(g). Hence, the duplication history of Ch(g) with the smallest duplication cost does not allow both duplication and loss to occur in the same branch. We use n in H (e) and nout H (e) to denote the numbers of genes flowing into and out of a branch e. For u V ( ) S λ(ch(g)), we define: ω(u) def = {g Ch(G) : λ(g ) = u}. (2) 4

5 The following conditions hold for a duplication history with the minimum duplication cost: (C1) For each branch e, n in H (e) 1 and nout H (e) 1. If e is the root branch, n in H (e) = 1. (C2) For any leaf u, n out H (e) = ω(u) for e = (p(u), u). (C3) For any branches e = (u, v) and e = (v, w) in S λ(ch(g)), n out H (e) = nin H (e ) + ω(v). (C4) In every branch e, k duplications occur iff n out H (e) nin H (e) = k; similarly, l losses occur in e iff n in H (e) nout H (e) = k. Let {( ) Σ H = e, n in H(e), n out H (e) e E ( ) } S λ(ch(g)). (3) Two duplication histories H and H from g to Ch(g) are said to be equivalent if Σ H = Σ H. Clearly, any given value of Σ H may be achieved by a large number of histories with the same duplication and loss costs. In this work, we infer a duplication history by determining values of the three arguments defined in (3) for all branches. One benefit of taking this approach is that our method effectively outputs the full set of optimal duplication histories that reconcile the input gene and species trees. 4.2 Irreducible duplication histories λ(ch(g)) = D 0 D 1... D k, (4) where is the sum operation for multi- A C Gene Loss Duplication Figure 2: A. A duplication history that does not have the minimum duplication cost, in whose rightmost lineage, a duplication and a loss occur. B. An irreducible duplication history equivalent to the duplication history in panel A. Here the oldest gene lineage is colored red, the right copy in the first two leaves are the descendants of the gene duplicate produced in the left lineage, and the right copy in the rightmost leaf is the descendant of the duplicate produced in the root branch. C. The gene tree that represents the duplication history in panel B, in which circle nodes correspond to species tree nodes and square nodes are duplication nodes. sets 1, such that (i) k equals the number of duplication events in H; (ii) D 0 = V lf (S λ(ch(g)) ), representing the old- A duplication process copies an existing gene, giving rise to two versions of the gene. A duplication history from g to Ch(g) is irreducible if the ancestral gene representing g in the root branch does not experience any loss event, so that it has a descendant in every leaf of S λ(ch(g)) (the est gene lineage; (iii) D i red lineage in Figure 2B), and if every duplication event copies the corresponding descendant of this oldest gene. Note that a history with no duplication is also irreducible. Such limiting cases are called speciation histories. In general, several children of g may be mapped to the same leaf in S λ(ch(g)). We consider λ(ch(g)) to be a multiset, meaning that each element can have a multiplicity. It is not hard to see that an irreducible duplication history H from g to Ch(g) induces the following decomposition of λ(ch(g)) in S λ(ch(g)) : 5 B def = {x λ(ch(g)) : the gene copy made by E i has a descendant in x} for 1 i k, where E i is the i-th duplication event of H occurring in the branch entering lca(d i ). Conversely, such a decomposition of λ(ch(g)) defines uniquely an irreducible duplication history from g to Ch(g) in S λ(ch(g)). The following theorem is proved in Appendix C. Theorem 2 Every duplication history H from g to Ch(g) is equivalent to an irreducible duplication history H such that d H d H and l H l H. 1 The multiplicity of an element is equal to the sum of the multiplicities in the operands.

6 5 Linear Time Algorithm By the above theorem, in order to infer a duplication history with the minimum mutation cost, we need only to find a decomposition λ(ch(g))/v lf ( S λ(ch(g)) ) = D1 D 2 D k that minimizes k + l i, where l i is the loss cost of the speciation history defined by D i. This is because the number of gene losses in the speciation history defined by V lf ( S λ(ch(g)) ) is fixed. We refer to this as a minimum decomposition. Note that D 1 D 2 D k corresponds to the set of all child genes that are produced by duplication. For each leaf in S λ(ch(g)), all but one of the genes mapped to the leaf were produced by duplication; these duplicates are called redundant gene copies. The descendant of the oldest gene in each leaf is called the basal gene copy. We now present a linear-time algorithm for finding a minimum decomposition of the redundant gene copies by working on the compressed child-image subtree I(g). For the sake of clarity, we also assume that for any (u, v) E(I(g)), the difference of the depths of v and u in the species tree S is one. (We describe how to generalize to general cases later.) A rooted tree is called a defect tree if there is at least one degree-2 node in the middle of every path from the root to a leaf. It is a good tree if there is a root-to-leaf path in which all but the end nodes are of degree 3. Note that a speciation history is a subtree of I(g). Theorem 3 Let D : D 1 D 2 D k be the minimum decomposition of λ(ch(g))/v lf (I(g)). If D gives a duplication history such that redundant gene copies have the minimum gene loss cost, compared to all other duplication histories with the same mutation cost, then for each i, the speciation history I(g) Di satisfies: (1). The subtree T (u) below any degree-2 node u cannot be a defect tree. (2). I(g) Di must be a good tree. Theorem 3 is proved in Appendix D. It motivates us to design a bottom-up recursive algorithm for finding the minimum decomposition of λ(ch(g))/v lf (I(g)), thereby reconstructing the full duplication history from g to its children. By Theorem 3, any component in a minimal decomposition of λ(ch(g))/v lf (I(g)) induces a good tree that has a special structural property. Hence, for subset V V lf (I(g)), we use the induced subtree I(g) V to represent V. As such, we use a set of subtrees to represent a partial decomposition obtained at each internal node. At a leaf u V lf (I(G)), the partial decomposition consists of ω(u) singleton trees, which are considered good trees. Let u be a node with two children u 1 and u 2 in I(g). Consider a partial decomposition D 1 of [λ(ch(g)) V (T (u 1 ))]/V lf (I(g)) into b(u 1 ) trees and a partial decomposition D 2 of [λ(ch(g)) V (T (u 2 ))]/V lf (I(g)) into b(u 2 ) trees. We attempt to merge these two partial decompositions to obtain a decomposition of [λ(ch(g)) V (T (u))]/v lf (I(g)). By Theorem 3, each component of a minimum decomposition induces a good subtree. However, for a good subtree X and an internal node y, X I(g)(y) can be a defect tree. Hence, a partial decomposition may contain defect trees. We distinguish between defect trees and good trees. Assume that a(u 1 ) out of b(u 1 ) trees are good in D 1, and that a(u 2 ) out of b(u 2 ) trees are good in D 2, such that a(u 2 ) a(u 1 ). We merge D 1 and D 2 by considering the following two cases (Figure 3). 1. a(u 2 ) b(u 2 ) < a(u 1 ) b(u 1 ) (panel A in Figure 3). Merge a(u 2 ) pairs of good trees, b(u 2 ) a(u 2 ) pairs of good and defect trees, extend a(u 1 ) b(u 2 ) good trees from D 1, and discard b(u 1 ) a(u 1 ) defect trees from D 1. Further, add ω(u) singleton trees, which are good trees. 2. a(u 2 ) a(u 1 ) min{b(u 1 ), b(u 2 )} (panels B and C in Figure 3). Merge a(u 2 ) pairs of good trees, a(u 1 ) a(u 2 ) pairs of good and defect trees, min{b(u 1 ), b(u 2 )} a(u 1 ) pairs of defect trees, and discard b(u 2 ) b(u 1 ) defect trees from D 2 if b(u 2 ) > b(u 1 ) or 6

7 A C B D m is odd. For each u V (I(g)), we use b(u) to denote the number of trees in the decomposition obtained at u in which a(u) out of b(u) trees are good trees. For u and k 0. we define dist (k, [a(u), b(u)]) = min x k, x [a(u),b(u)] Figure 3: Schematic view of merging partial decompositions for the three possible cases (A-C) where u has two children, and also for the case when u has only one child (D). Good trees and defect trees are colored orange and blue respectively in decompositions D 1 (left) and D 2 (right). The ω(u) singleton trees added at the current node are not shown in each case. b(u 1 ) b(u 2 ) defect trees from D 1 otherwise. Add ω(u) singleton trees. Proposition 1 Let m 1 m 2 m 3 m 4 be the arrangement of {a 1, a 2, b 1, b 2 } from smallest to largest. Merging D 1 and D 2 produces ω(u) + m 2 good trees and m 3 m 2 defect trees to merge, and detects m 4 m 3 defect trees to discard. At an internal node u with only one child u 1 (panel D in Figure 3), we create ω(u) singleton trees, extend all good trees, and discard all the defect trees in the decomposition D 1 obtained at u 1. Using the above bottom-up merging procedure, we obtain a set of good and defect trees at the root of I(g). This set of trees defines a minimal decomposition of λ(ch(g))/v lf (I(g)). More specifically, each good tree corresponds to a component of the minimal decomposition. But each defect tree corresponds to k 2 components, where k equals the cardinality of the maximum incomparable degree-2 internal nodes in the tree. Similarly, each defect tree discarded at an internal nodes also corresponds to a set of components of the minimal decomposition. For m real numbers i 1, i 2,, i m, we use median{i 1, i 2,..., i m } to denote their median if k = median{k, a(u), b(u)} ω(u), (5) 0 if u is a leaf, f(u, k ) = C(u 1, k ) + k if Ch(u) = {u 1 }, u Ch(u) C(u, k ) if Ch(u) = {u 1, u 2 }, and C(u, k) = dist (k, [a(u), b(u)]) + f(u, k ). (6) Theorem 4 Let r be the root of I(g). The decomposition D r obtained by the above merging procedure determines a duplication history of redundant gene copies with the minimum mutation cost C(r, 0). Theorem 4 is proved in Appendix E. It suggests a two-step algorithm for reconstructing the evolution from g to its children in linear time (Figure 4). First, we compute the numbers of good and defect trees obtained at the internal nodes in I(g) by visiting all the nodes in order from leaf to root, which guarantees that we visit all the children of a node before the node itself. We then identify duplications and losses by computing the numbers of genes flowing into and out of the branches in I(g), top down from root to leaf. To take into account the basal gene copies, we add one to the numbers of ancestral gene copies flowing into and out of each branch. Figure 5 gives an example to illustrate this algorithm. Recall that we assume d(u) = d(p(u)) + 1 for each u V (I(g)) in the algorithm described above. It can be modified for general cases by (i) finding all maximal subtrees of I(g) that do not contain any branch (u, v) such that d(v) > d(u) + 2 in S and then (ii) for each subtree T found in (i), replacing every branch (u, v) such that d(v) = d(u) + 2 by the two-branch path between u and v in S and then applying the algorithm to the resulting subtree T. The complete version of this algorithm can be found in Appendix F. 7

8 Input An annotated compressed child-image subtree I(g); Output The nos. of genes flowing into and out of branches in I(g). 1. Traversing I(g) in post-order Compute a(u) and b(u) at node u: if (u is a leaf) { a(u) = ω(u) 1; b(u) = ω(u) 1; } else if (Ch(u) = {u 1, u 2 }) { max a = max(a(u 1 ), a(u 2 )); min b = min(b(u 1 ), b(u 2 )); a(u) = ω(u) + min(max a, min b); b(u) = ω(u) + max(max a, min b); } else if (Ch(u) = {u 1 }) { a(u) = ω(u); b(u) = a(u 1 ) + ω(u); } 2. Traversing I(g) in pre-order /* in(u) and out(u) denotes the */ /* nos. of genes flowing into */ /* and out of the branch (p(u), u) */ Compute in(u) and out(u) at node u: if (u is the root) { α(u) = 0; β(u) = ω(u) + a(u); } else { α(u) = β(p(u)) ω(p(u)); β(u) = median{α(u), a(u), b(u)}; } /* factor in the basal copy in */ /* each branch */ in(u) = 1 + α(u); out(u) = 1 + β(u); Figure 4: A linear-time algorithm for reconstructing the evolution from g to its children. Here we assume that d(u) = d(p(u)) + 1 for each u in I(g). The general version of this algorithm is in Appendix F. 6 Experimental Tests We compared a naive dynamic programming method (DP) (found in [9]) and a modified dynamic programming method (DP+C) (which applies the dynamic programming technique to the compressed child-image subtrees) to the proposed linear-time method (LT) using simulated A C t: w v x r y s z: B D a b t v w x y z s r Gene Loss Duplication Figure 5: Illustration of the reconciliation algorithm. (A) A compressed child-image tree I(g) with the redundant child genes of g (bullets) drawn beside their image nodes. (B). The values of a and b at internal nodes. (C) The trees to be merged at each node in I(g). Two good trees are obtained when the merging process terminates at the root. A subtree obtained at each node is good if its root is connected to at most one blue branch. (D) The duplication history of the redundant child gene copies in the compressed child-image tree, with the minimum mutation cost of 6. Note that the basal gene copies are not shown. data. For fair comparison, we also implemented DP. Our implementation of DP is slightly faster than the dynamic programming approach found in NOTUNG [9], but to be fair the latter has several other features, such as listing all the inferred optimal solutions. All three programs were run to reconcile nonbinary gene trees with the mutation cost, using the same machine (3.4GHz and 8G RAM). We measured their run times for 100 reconciliations between a non-binary tree containing 1.2n genes and its corresponding species trees over n species. For each size n, both the species tree and a binary gene tree with 1.2n leaves were generated using the Yule model. The leaves in the gene tree were labeled with random species se- 8

9 lected from a uniform distribution. Finally, a non-binary gene tree was obtained from the binary gene tree by contracting each edge with a fixed rate p. We examined 40 cases by allowing n to take 10 different values in the range from 1,000 to 10,000 and setting the edge contraction rate p to either 0.4, 0.6, 0.7 or 0.8 (Figure 6). We also ran LT on 20 different tree sizes in the range from 5,000 to 100,000, which are too large for the other two methods. The results was summarized in Figure S2 in Appendix G. which confirm that the run times of LT are linearly proportional to the size of the gene trees. LT is slightly faster than DP+C, and 5 to 20 times faster than DP for gene trees with thousands of genes. 7 Discussion and future work Here we present a linear-time algorithm to reconcile the non-binary gene tree of a gene family and the corresponding species tree to reconstruct the duplication history of the gene family with the minimum number of duplications and losses. Time LT DP + C DP Time LT DP + C DP The reconciliation times are an order of magnitude faster than others achieved using compressed child-image trees and working on irreducible duplication histories. Our approach has several important benefits. First, we do not consider incomplete lineage sorting (ILS) events, which may not be rare and hence cannot be ignored in certain circumstances [18, 20]. Since the effect of an ILS event on the divergence of gene and species trees is similar to that of a duplication event, the concepts proposed here can easily be extended to take into account ILS. Second, the output of our program is actually a class of optimal duplication histories, not an individual history. This is because the program assigns multiple duplications to each branch in the species trees, and these duplications can be arranged in different ways. Third, our linear-time algorithm is fast and hence is ideal for providing online service for tree reconciliation (see our TxT server Finally, our bottom-up approach can incorporate multiple sources of information on gene similarity, including sequence similarity and conserved gene order, when it is applied to genome-wide studies of the evolution of gene families. This is definitely an interesting future project No. of Species No. of Species References Time LT DP + C DP Time LT DP + C DP [1] Arvestad, L., Lagergren, J. et al.: The gene evolution model and computing its associated probabilities. J. ACM 56, 1-44 (2009) No. of Species No. of Species Figure 6: Comparison of three algorithms: dynamic programming (DP), dynamic programming with compressed child-image subtrees (DP+C), and the proposed linear time (LT) algorithm. Four figures are drawn for the four different edge contraction rates 0.4(top left), 0.6(top right), 0.7(bottom left), and 0.8 (bottom right). The run time is given in microseconds and the no. of species is in thousands. [2] Bansal, M.S., Alm, E.J., Kellis, M.: Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinform. 28:i283-i291 (2012) [3] Chang, W.C., Eulenstein, O.: Reconciling gene trees with apparent polynomies. In Proc. COCOON 06. pp [4] Chauve, C., El-Mabrouk, N.: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. In Proc. of RECOMB 09, pp (2009) 9

10 [5] Chen, K., Durand, D., Farach-Colton, M.: NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 7, (2000) [6] Chen, Z.Z., Deng, F., Wang, L.: Simultaneous identification of duplications, losses, and lateral gene transfers. IEEE/ACM TCBB 9: (2012) [7] Doyon J P, et al.: Models, algorithms and programs for phylogeny reconciliation. Briefings Bioinfrom. 12: (2011) [8] Dufayard J.-F. et al.: Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics 21: (2005) [9] Durand, D., Halldorsson, B., Vernot, B.: A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol. 13: (2006) [10] Eulenstein, O. et al.: Reconciling phylogenetic trees. In Evolution After Duplication (eds: K. Dittmar, D. Liberles), pp Wiley-Blackwell, New Jersey, USA (2010) [11] Fitch,W.M.: Distinguishing homologous from analogous proteins. Syst. Zool. 19: (1970) [12] Goodman, M. et al.: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 28: (1979) [13] Goodstadt, L., Ponting, C.: Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput. Biol. 2: e133 (2006) [14] Górecki, P., Tiuryn, J.: DLS-trees: a model of evolutionary scenarios. Theoret. Comput. Sci. 359: (2006) [15] Kellis, M. et al.: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: (2003) [16] Kristensen, D.M., Wolf, Y.I., Mushegian, A.R., Koonin, E.V.: Computational methods for gene orthology inference. Briefings Bioinform. 12: (2011) [17] Lafond, M., Swenson, K. M., El-Mabrouk, N. An optimal reconciliation algorithm for gene trees with polytomies. In Alg. in Bioinform., pp Springer (2012) [18] Pollard, et al.: Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet. 2(10), e173 (2006) [19] Schieber, B., Vishkin, U.: On finding lowest common ancestors: simplification and parallelization, SIAM J. Comput. 17: (1988) [20] Stolzer, M. et al.: Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics, 28(18), i409-i415 (2012) [21] Storm C, Sonnhammer E.: Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinform. 18:92-99 (2002) [22] Tatusov, R.L. et al.: A genomic perspective on protein families. Science 278: (1997) [23] Wapinski I. et al.: Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54-61 (2007) [24] Warnow, T.: Large-scale multiple sequence alignment and phylogeny estimation. In Models and Algorithms for Genome Evolution pp Springer, London, 2013 [25] Zhang, L.X.: On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J. Comput. Biol. 4: (1997) [26] Zheng, Y., Wu, T., Zhang L.X.: A lineartime algorithm for reconciliation of nonbinary gene tree and binary species tree, In Proc. COCOA 13, pp (2013) 10

11 Appendixes Appendix A: Proof of Theorem 1 By preprocessing a species tree, one can compute the lowest common ancestor of any two nodes in constant time (Schiebe and Vishkin, SIAM J. Computing, 17: (1988)), which leads to computing the images of all gene tree nodes under λ in linear time O( G + S ) (Zhang, JCB 4: (1997)). In the rest of this section, we assume lca(s, s ) can be computed in constant time for any s, s V (S). We also assume a linked list pre(s) that contains all the gene tree nodes mapped to s: that is, pre(s) = {g V (G) λ(g) = s} is available for use for each s V (S). Proposition S1 Let g V (G). Assume g 1, g 2,, g k is the arrangement of Ch(g) such that λ(g 1 ), λ(g 2 ),, λ(g k ) are visited from the earliest to latest in the post-order traversal of S. Then V (I(g)) = λ(ch(g)) {lca(λ(g i ), λ(g i+1 )) 1 i k 1}. Proof Let u V (I(g))/λ(Ch(g)). By definition, u has two children u 1 and u 2 in I(g) and there is at least one image node below each of u 1 and u 2. Without loss of generality, we may assume that the nodes in T (u 1 ) are visited before those in T (u 2 ) in the post-order traversal of S. Thus, λ(g j ) V (T (u 1 )) and λ(g j+1 ) V (T (u 2 )) for some j. This implies that u = lca(λ(g j ), λ(g j+1 )). Conversely, for any j, we let v be lca(λ(g j ), λ(g j+1 )). If v λ(g i+1 ), then λ(g j ) and λ(g i+1 ) are below the different children of v. Hence, by definition, v V (I(g)). Proposition S1 suggests that if the elements in Ch(g) are arranged properly, we can compute the node set of I(G) by simply applying the lca operation Ch(g) 1 times, taking O( Ch(g) ) steps because each lca operation takes constant time in the preprocessed species tree. We now present a linear time algorithm to compute I(g) for all g in G simultaneously. First, we use the following linear time algorithm to properly rearrange the children in Ch(g) for each g so that the assumption in Proposition S1 holds. Here, we assume an array of k pointers is used to represent a gene tree node of degree k, denoted by ptr(g), so that swapping two childrens positions can be done just by exchanging the corresponding pointer values. 11

12 Traverse G in a depth-first order set i h = 0 at each h V o (G); \* Arrange the children of each h V (G) according to *\ \* the positions of their images in the post-order traversal of S *\ Traverse S in post-order at each s V (S), do { for each h pre(s) {swap h and i p(h) -th child of p(h); i p(h) i p(h) + 1;} } Second, we compute an array B(s) that contains all the gene tree nodes g, such that s I(g) for each s S, traversing G in post-order. \* Compute B(s) = {h V (G) s I(h)} for all s V (S) *\ Traverse G in post-order at each h V (G), do { for each child h i Ch(h) {add h into B(λ(h i ))} for i = 1 to Ch(h) 1 {add h into B(lca(λ(h i ), λ(h i+1 ));} } Third, we store all the species tree nodes that comprise the compressed child-image subtree I(g) of g V (G) in a stack named ImageTreeNodes(g) and then to construct all I(g) s. 12

13 \* ImageTreeNodes(g) contains... *\ Traverse S by following its Euler tour at each s V (S), do { for each h B(s) {push s into ImageTreeNodes(h)} } Traverse G in post-order Construct I(h) at each h V o (G) by: make a copy a of the elm. e 1 popped from ImageTreeNodes(g); do { pop an elm. e 2 from ImageTreeNodes(g); if (e 2 does not has a copy) make a copy b of the elm. e 2 ; else assign the copy to b; if depth(e 1 ) < depth(e 2 ) { add an edge (a, b) if a and b are not connected; } else {add an edge (b, a) if a and b are not connected;} a b; } until ImageTreeNodes(g) is empty; We now analyze the time complexity of the algorithm. At Step 1, the sub-procedure of setting the counters of all gene tree nodes simply takes G operations. The sub-procedure of rearranging the children of all gene tree nodes takes S +2 G operations, as it visits each species tree node once and needs to do a swap and a counter increment for each branch in the input gene tree. At Step 2, the procedure for constructing B(s) for all species tree nodes requires two insertion operations for each branch in the input gene tree. Hence, it takes at most 2 G operations. At Step 3, since the total size of all the compressed childimage subtrees is 2 G, each of the traversal procedures takes 6 G operations. Hence, our algorithm take linear time to compute all the compressed child-image subtrees of the gene tree nodes. The algorithm outputs the compressed child-image subtree of each g V o (G) in S. If g is a binary node, g is a duplication node if and only if I(g) is a singleton or a two-node tree. If g is non-binary, we will infer the optimal binary refinement of g by working on I(g). Finally, we assume that for each s I(g), its depth d(s) in the species tree S is also computed and saved at s. Note that d(p(s)) d(s) is the number of edges in the path from p(s) to s in S, which is needed to compute the gene loss cost of a binary refinement of g in S. 13

14 7.1 Appendix B: An improved dynamic programming method Theorem 1 leads immediately to an improved method for tree reconciliation. Let A(u, k) be the cost of the optimal duplication history H of the child genes in I(g)(u) with k ancestral genes flowing into (p(u), u) in I(g) in the (w d, w l )-affine cost model, where k Ch(g). Obviously, the restriction of H in T (u ) must be optimal for each child u Ch(u). If u is a leaf in I(g), we set { w d if ω(u) > k w = w l if ω(u) k. There are d(u) d(p(u)) 1 nodes between p(u) and u in S. Assume that there are t genes flowing into the branch entering u in S, A(k, u) = min[(k t )w t l + (ω(u) t )w d + t w l c u ]. = w min{k, ω(u)} + w k ω(u). (7) where c u = d(u) d(p(u)) 1, w = min{w d + w l, c u w l }, and ω(u) is defined in (2). If u is a node with only one child u 1 in I(g), we assume that t genes flow into the branch entering u and x genes flow out u and t genes. We need to assume min[(k t )w t l + (ω(u) t )w d + t w l c u + k w l ]. = w min{k, ω(u)} + w k ω(u) + k w l gene duplication and loss events in the path from p(u) to u. Hence, for this case, A(k, u) = min [ w min{k, ω(u)} + w k ω(u) + 1 k Ch(g) k w l + A(u 1, k )]. (8) Similarly, if u is a node with two children u 1 and u 2 in I(g), we have: A(k, u) = min [ w min{k, ω(u)} + w k ω(u) + A(u 1, k ) + A(u 2, k )]. (9) 1 k Ch(g) By implementing a dynamic programming algorithm based on Eqn. (7)-(9) for reconciliation in I(g), we can compute an optimal reconciliation of G and S in time O( S + u V o(g) d2 I(g) ) = O( S + d 2 G ) under the affine cost model, where d is the largest degree of a node in G. The time required for finding the compressed child-image subtrees is factored into this estimate. 14

15 Appendix C: Proof of Theorem 2 Theorem 2 Every duplication history H from g to Ch(g) is equivalent to an irreducible duplication history H such that d H d H and l H l H. Proof We prove the statement by induction on the number of duplications occurring in H. If k = 0, so that H is a speciation history, then λ(ch(g)) = V lf (S λ(ch(g)) ), derived from the definition of S λ(ch(g)). Therefore, H itself is irreducible. Assume the statement is true for any duplication history with k 1 duplications. Consider the most recent duplication event E of H. Assume it occurs in a branch (p(u), u), u V ( ) S λ(ch(g)), suggesting that H has no duplication occurring in the subtree T (u). Each ancestral gene derived from E has at most one descendant gene copy in each leaf in T (u). Fix such an ancestral gene o and ( let DS(o) ) be the set of leaves that contain a descendant of o. Note that DS(o) V lf S λ(ch(g)). Removing E from the duplication history H results in a duplication history H. H has k 1 duplications and covers all the gene copies that are not descendants of o. By induction, λ(ch(g))/ds(o) = D 0 D 1... D k, k k 1. If D 0 = V ( ) lf S λ(ch(g)), then λ(ch(g)) = D 0 D 1... D k DS(o) is a desired decomposition. If D 0 V ( ) lf S λ(ch(g)), then S DS(o)/D is a forest subgraph of T (u). Assume it has m 0 tree components, say T 1, T 2,..., T m. Define. We thus have D 0 def = D 0 V lf (T 1 ) V lf (T 2 ) V lf (T m ) = V lf ( S λ(ch(g)) ), D k +1 = DS(o)/[V lf (T 1 ) V lf (T 2 ) V lf (T m )] λ(ch(g)) = D 0 D 1... D k D k +1. This decomposition defines a unique, irreducible history that is equivalent to H. By moving the leaves of all T i s from the last term to the first term, the gene loss cost of the speciation history defined by the first term decreases by m, whereas the gene loss cost of the speciation history defined by the last term increases by at most m. Hence, we have obtained a desired decomposition. 15

16 Appendix D: Proof of Theorem 3 Therorem 3 Let D : D 1 D 2 D k be the minimum decomposition of λ(ch(g))/v lf (I(g)). If D gives a duplication history such that redundant gene copies have the minimum gene loss cost, compared to all other duplication histories with the same mutation cost, then the speciation history I(g) Di satisfies the following properties for each i: (1). The subtree T (u) below any degree-2 node u cannot be a defect tree. (2). I(g) Di must be a good tree. Proof (1). Without loss of generality, assume that u is a node of degree 2 in I(g) D1. If T (u) is a defect tree (panel B in Figure S1), we consider a maximal set of incomparable degree-2 nodes {u 1,, u j } in T (u). We have: V lf (T (u)) = V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )). By replacing D 1 with {V lf (T (u 1 )), V lf (T (u 2 )),, V lf (T (u j )), D 1 /V lf (T (u))}, we obtain the following decomposition D : V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )) D 1 /V lf (T (u)) D 2 D k. It is easy to see that the duplication cost of D is equal to k+j. Further, by partitioning D 1 into V lf (T (u 1 )), V lf (T (u 2 )),, V lf (T (u j )), and D 1 /V lf (T (u)), the gene loss events occurring at the degree-2 nodes u 1, u 2,, u j, and u are eliminated, and a new gene loss is introduced at p(u) in the corresponding speciation history of D 1 /V lf (T (u)), as illustrated in Figure S1C. Hence, the mutation cost of D is equal to the mutation cost of D, but its gene loss cost is less than that of D. This contradicts the fact that D is a minimum decomposition of λ(ch(g))/v lf (I(g)) with the minimum gene loss cost. (2). Without loss of generality, we may assume that I(g) D1 is a defect tree. Consider a maximal set {u 1,, u j } of incomparable nodes of degree 2 in the tree. We have D 1 = V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )). By replacing D 1 with {V lf (T (u 1 )), V lf (T (u 2 )),, V lf (T (u j ))}, we obtain the following decomposition λ(ch(g))/v lf (I(g)): D : V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )) D 2 D k. The duplication cost of the corresponding speciation history of D is j 1 plus that of D. However, the gene loss cost of D is j less than that of D. Hence, the mutation cost of D is less than that of D. This contradicts the assumption that D is a minimum decomposition of λ(ch(g))/v lf (I(g)) having the minimum gene loss cost. 16

17 A B C Decomposition Figure S1: A. A defect tree in which degree-2 nodes are colored blue. B. A defect subtree (below u) inside a speciation history, in which {u 1, u 2, u 3 } is a maximal set of incomparable nodes of degree 2. C. The speciation history in (B) is decomposed into a duplication history with the same mutation cost but a smaller gene loss cost. 17

18 Appendix E: Proof of Theorem 4 Recall that a good tree has at least one root-to-leaf path not containing any degree-2 nodes. Defect trees are those that are not good. For convenience, we let a u = a(u), b u = b(u), a i = a(u i ) and b i = b(u i ) for i = 1, 2. We also set dist(k, [a, b]) = d(k, [a, b]). Lemma S1 For a child v of u, a v and b v are defined to be the numbers of good trees and defect trees respectively in the decomposition D v associated with v. The following facts are true for k defined in Eqn. (6). (1) If Ch(u) = {u 1 }, k = median{k ω(u), 0, a 1 }. (2) If Ch(u) = {u 1, u 2 }, k = median{k ω(u), a 1, b 1, a 2, b 2 }. Proof (1). The first statement is derived from the facts that a u = 0 + ω(u) and b u = a 1 + ω(u) if Ch(u) = {u 1 }. (2) Let m 1 m 2 m 3 m 4 be the arrangement of a 1, b 1, a 2, b 2 from smallest to largest as in Proposition 2. By the merging procedure, we have a u = m 2 + ω(u), b u = m 3 + ω(u). Hence, and Since {k ω(u), a u ω(u), b u ω(u)} = {k ω(u), m 2, m 3 }, {k ω(u), a 1, b 1, a 2, b 2 } = {k ω(u), m 1, m 2, m 3, m 4 }. m 1 max(m 2, k ω(u)) = median{k ω(u), m 2, m 3 } min(m 3, k ω(u)) m 4, median{k ω(u), m 1, m 2, m 3, m 4 } = median{k ω(u), m 2, m 3 }. This concludes the proof. Lemma S2 For any integer k 0 and u V (I(g)), C(u, k) = d(k, [a u, b u ]) + C(u, a u ). (10) Proof We prove the theorem by induction. For a leaf u, a u = b u = ω(u). By definition, C(u, k) = d(k, [a u, b u ]) = k a u and C(u, a u ) = d(a u, [a u, b u ]) = 0. Hence, Eqn. (10) holds. 18

19 We now assume that Eqn. (10) holds for the children of u. If u has only one child u 1, a u = ω(u) and b u = ω(u) + a 1, implying the part 1 of Lemma S1: By induction, Eqn. 10 holds for u 1. In particular, and Applying Inequality (11), we obtain: and k = median{k ω(u), 0, a 1 } a 1. (11) C(u 1, 0) = a 1 + C(u 1, a 1 ), C(u 1, k ) = d(k, [a 1, b 1 ]) + C(u 1, a 1 ) + k. C(u, k) = d(k, [a u, b u ]) + C(u 1, k ) = d(k, [a u, b u ]) + d(k, [a 1, b 1 ]) + k + C(u 1, a 1 ) = d(k, [a u, b u ]) + a 1 + C(u 1, a 1 ) d(k, [a u, b u ]) + C(u, a u ) = d(k, [a u, b u ]) + C(u 1, 0) = d(k, [a u, b u ]) + a 1 + C(u 1, a 1 ). If u has two children u 1, u 2, C(u, k) = d(k, [a u, b u ]) + C(u 1, k ) + C(u 2, k ) = d(k, [a u, b u ]) + d(k, [a i, b i ]) + C(u i, a i ). On the other hand, since median{a u ω(u), a u ω(u), b u ω(u)} = a u ω(u), d(k, [a u, b u ]) + C(u, a u ) = d(k, [a u, b u ]) + C(u 1, a u ω(u)) + C(u 1, a u ω(u)) = d(k, [a u, b u ]) + d(a u ω(u), [a i, b i ]) + C(u i, a i ). Without loss of generality, we may assume that a 2 a 1. We consider two cases to prove that d(k, [a i, b i ]) = d(a u ω(u), [a i, b i ]). If a 2 b 2 a 1 b 1, then b 2 k = median{k ω(u), a 1, b 1, a 2, b 2 } a 1, and thus d(k, [a i, b i ]) = k b 2 + a 1 k = a 1 b 2, 19

20 and d(a u ω(u), [a i, b i ]) = b 2 b 2 + a 1 b 2 = a 1 b 2. If a 2 a 1 min(b 1, b 2 ), then a 1 k = median{k ω(u), a 1, b 1, a 2, b 2 } min(b 1, b 2 )] and thus d(k, [a i, b i ]) = 0, and d(a u ω(u), [a i, b i ]) = 0. This concludes the proof of Lemma 2. For any subset of real number X and a real number r, f X (r) def = x r x X It is not hard to see that d(x, [i 1, i 2 ]) = 1 2 f {i 1,i 2 }(x) (i 2 i 1 ), x R. (12) Lemma S3 For any disjoint real intervals [i 1, i 2 ] and [i 3, i 4 ] and any i 5 R, d(x, [i 1, i 2 ]) + d(x, [i 2, i 3 ]) + d(x, i 5 ) d(m, [i 1, i 2 ] + d(m, [i 3, i 4 ]) + d(m, i 5 ), (13) for any x R, where m = median{i 1, i 2, i 3, i 4, i 5 }. Proof The inequality is derived from: d(x, [i 1, i 2 ]) + d(x, [i 2, i 3 ]) + d(x, i 5 ) = 1 2 f {i 1,i 2,i 3,i 4,i 5,i 5 }(x) [(i 2 i 1 ) + (i 4 i 3 )]. Note that the middle values minimize the sum of distances f {i1,i 2,i 3,i 4,i 5,i 5 }(x) (Dasgupta, Papadimitriou and Vazirani, Algorithms, page 86). Hence, as a middle value of {i 1, i 2, i 3, i 4, i 5, i 5 }, the median{i 1, i 2, i 3, i 4, i 5 } minimizes d(x, [i 1, i 2 ])+ d(x, [i 3, i 4 ]) + d(x, i 5 ). Lemma S4 For any real interval [i 1, i 2 ], and any i 3 R, d(x, [i 1, i 2 ]) + d(x, i 3 ) + x d(m, [i 1, i 2 ]) + d(m, i 3 ) + m, for any x R, where m = median{0, i 1, i 3 }. 20

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore Non-binary Tree Reconciliation Louxin Zhang Department of Mathematics National University of Singapore matzlx@nus.edu.sg Introduction: Gene Duplication Inference Consider a duplication gene family G Species

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

Comparative Genomics II

Comparative Genomics II Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31 Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods

More information

should be presented and explained in the combined species tree (Fitch, 1970; Goodman et al., 1979). The gene divergence can be the results of either s

should be presented and explained in the combined species tree (Fitch, 1970; Goodman et al., 1979). The gene divergence can be the results of either s On a Mirkin-Muchnik-Smith Conjecture for Comparing Molecular Phylogenies Louxin Zhang lxzhang@iss.nus.sg BioInformatics Center Institute of Systems Science Heng Mui Keng Terrace Singapore 119597 Abstract

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina 1 the data alignments and phylogenies for ~27,000 gene families from 140 plant species www.phytome.org publicly

More information

arxiv: v1 [cs.ds] 21 May 2013

arxiv: v1 [cs.ds] 21 May 2013 Easy identification of generalized common nested intervals Fabien de Montgolfier 1, Mathieu Raffinot 1, and Irena Rusu 2 arxiv:1305.4747v1 [cs.ds] 21 May 2013 1 LIAFA, Univ. Paris Diderot - Paris 7, 75205

More information

The Complexity of Constructing Evolutionary Trees Using Experiments

The Complexity of Constructing Evolutionary Trees Using Experiments The Complexity of Constructing Evolutionary Trees Using Experiments Gerth Stlting Brodal 1,, Rolf Fagerberg 1,, Christian N. S. Pedersen 1,, and Anna Östlin2, 1 BRICS, Department of Computer Science, University

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Properties of normal phylogenetic networks

Properties of normal phylogenetic networks Properties of normal phylogenetic networks Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu August 13, 2009 Abstract. A phylogenetic network is

More information

Improved maximum parsimony models for phylogenetic networks

Improved maximum parsimony models for phylogenetic networks Improved maximum parsimony models for phylogenetic networks Leo van Iersel Mark Jones Celine Scornavacca December 20, 207 Abstract Phylogenetic networks are well suited to represent evolutionary histories

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS CRYSTAL L. KAHN and BENJAMIN J. RAPHAEL Box 1910, Brown University Department of Computer Science & Center for Computational Molecular Biology

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Perfect Phylogenetic Networks with Recombination Λ

Perfect Phylogenetic Networks with Recombination Λ Perfect Phylogenetic Networks with Recombination Λ Lusheng Wang Dept. of Computer Sci. City Univ. of Hong Kong 83 Tat Chee Avenue Hong Kong lwang@cs.cityu.edu.hk Kaizhong Zhang Dept. of Computer Sci. Univ.

More information

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs

Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Optimal Tree-decomposition Balancing and Reachability on Low Treewidth Graphs Krishnendu Chatterjee Rasmus Ibsen-Jensen Andreas Pavlogiannis IST Austria Abstract. We consider graphs with n nodes together

More information

Covering Linear Orders with Posets

Covering Linear Orders with Posets Covering Linear Orders with Posets Proceso L. Fernandez, Lenwood S. Heath, Naren Ramakrishnan, and John Paul C. Vergara Department of Information Systems and Computer Science, Ateneo de Manila University,

More information

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family Review: Gene Families Gene Families part 2 03 327/727 Lecture 8 What is a Case study: ian globin genes Gene trees and how they differ from species trees Homology, orthology, and paralogy Last tuesday 1

More information

Analysis of Gene Order Evolution beyond Single-Copy Genes

Analysis of Gene Order Evolution beyond Single-Copy Genes Analysis of Gene Order Evolution beyond Single-Copy Genes Nadia El-Mabrouk Département d Informatique et de Recherche Opérationnelle Université de Montréal mabrouk@iro.umontreal.ca David Sankoff Department

More information

arxiv: v1 [cs.ds] 1 Nov 2018

arxiv: v1 [cs.ds] 1 Nov 2018 An O(nlogn) time Algorithm for computing the Path-length Distance between Trees arxiv:1811.00619v1 [cs.ds] 1 Nov 2018 David Bryant Celine Scornavacca November 5, 2018 Abstract Tree comparison metrics have

More information

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM MENG ZHANG College of Computer Science and Technology, Jilin University, China Email: zhangmeng@jlueducn WILLIAM ARNDT AND JIJUN TANG Dept of Computer Science

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Reconstruction of certain phylogenetic networks from their tree-average distances

Reconstruction of certain phylogenetic networks from their tree-average distances Reconstruction of certain phylogenetic networks from their tree-average distances Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu October 10,

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Gene Tree Parsimony for Incomplete Gene Trees

Gene Tree Parsimony for Incomplete Gene Trees Gene Tree Parsimony for Incomplete Gene Trees Md. Shamsuzzoha Bayzid 1 and Tandy Warnow 2 1 Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Algorithms for phylogeny construction

Algorithms for phylogeny construction Algorithms for phylogeny construction A Hybrid Micro-Macroevolutionary Approach to Gene Tree Reconstruction ICE-TCS Inaugural Symposium Bjarni V. Halldórsson April 30, 2005 1 Character based phylogeny

More information

arxiv: v1 [cs.cc] 9 Oct 2014

arxiv: v1 [cs.cc] 9 Oct 2014 Satisfying ternary permutation constraints by multiple linear orders or phylogenetic trees Leo van Iersel, Steven Kelk, Nela Lekić, Simone Linz May 7, 08 arxiv:40.7v [cs.cc] 9 Oct 04 Abstract A ternary

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

Cherry picking: a characterization of the temporal hybridization number for a set of phylogenies

Cherry picking: a characterization of the temporal hybridization number for a set of phylogenies Bulletin of Mathematical Biology manuscript No. (will be inserted by the editor) Cherry picking: a characterization of the temporal hybridization number for a set of phylogenies Peter J. Humphries Simone

More information

Evolution of Tandemly Arrayed Genes in Multiple Species

Evolution of Tandemly Arrayed Genes in Multiple Species Evolution of Tandemly Arrayed Genes in Multiple Species Mathieu Lajoie 1, Denis Bertrand 1, and Nadia El-Mabrouk 1 DIRO - Université de Montréal - H3C 3J7 - Canada {bertrden,lajoimat,mabrouk}@iro.umontreal.ca

More information

Perfect Sorting by Reversals and Deletions/Insertions

Perfect Sorting by Reversals and Deletions/Insertions The Ninth International Symposium on Operations Research and Its Applications (ISORA 10) Chengdu-Jiuzhaigou, China, August 19 23, 2010 Copyright 2010 ORSC & APORC, pp. 512 518 Perfect Sorting by Reversals

More information

Regular networks are determined by their trees

Regular networks are determined by their trees Regular networks are determined by their trees Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu February 17, 2009 Abstract. A rooted acyclic digraph

More information

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES MAGNUS BORDEWICH 1, CATHERINE MCCARTIN 2, AND CHARLES SEMPLE 3 Abstract. In this paper, we give a (polynomial-time) 3-approximation

More information

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION

RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION RECOVERING NORMAL NETWORKS FROM SHORTEST INTER-TAXA DISTANCE INFORMATION MAGNUS BORDEWICH, KATHARINA T. HUBER, VINCENT MOULTON, AND CHARLES SEMPLE Abstract. Phylogenetic networks are a type of leaf-labelled,

More information

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES SIMONE LINZ AND CHARLES SEMPLE Abstract. Calculating the rooted subtree prune and regraft (rspr) distance between two rooted binary

More information

Lecture notes for Advanced Graph Algorithms : Verification of Minimum Spanning Trees

Lecture notes for Advanced Graph Algorithms : Verification of Minimum Spanning Trees Lecture notes for Advanced Graph Algorithms : Verification of Minimum Spanning Trees Lecturer: Uri Zwick November 18, 2009 Abstract We present a deterministic linear time algorithm for the Tree Path Maxima

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Approximating the correction of weighted and unweighted orthology and paralogy relations

Approximating the correction of weighted and unweighted orthology and paralogy relations DOI 10.1186/s13015-017-0096-x Algorithms for Molecular Biology RESEARCH Open Access Approximating the correction of weighted and unweighted orthology and paralogy relations Riccardo Dondi 1*, Manuel Lafond

More information

2.5.2 Basic CNF/DNF Transformation

2.5.2 Basic CNF/DNF Transformation 2.5. NORMAL FORMS 39 On the other hand, checking the unsatisfiability of CNF formulas or the validity of DNF formulas is conp-complete. For any propositional formula φ there is an equivalent formula in

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Binary Decision Diagrams. Graphs. Boolean Functions

Binary Decision Diagrams. Graphs. Boolean Functions Binary Decision Diagrams Graphs Binary Decision Diagrams (BDDs) are a class of graphs that can be used as data structure for compactly representing boolean functions. BDDs were introduced by R. Bryant

More information

An 1.75 approximation algorithm for the leaf-to-leaf tree augmentation problem

An 1.75 approximation algorithm for the leaf-to-leaf tree augmentation problem An 1.75 approximation algorithm for the leaf-to-leaf tree augmentation problem Zeev Nutov, László A. Végh January 12, 2016 We study the tree augmentation problem: Tree Augmentation Problem (TAP) Instance:

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Solving the Tree Containment Problem for Genetically Stable Networks in Quadratic Time

Solving the Tree Containment Problem for Genetically Stable Networks in Quadratic Time Solving the Tree Containment Problem for Genetically Stable Networks in Quadratic Time Philippe Gambette, Andreas D.M. Gunawan, Anthony Labarre, Stéphane Vialette, Louxin Zhang To cite this version: Philippe

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS PETER J. HUMPHRIES AND CHARLES SEMPLE Abstract. For two rooted phylogenetic trees T and T, the rooted subtree prune and regraft distance

More information

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1 Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 Lecturer: Wing-Kin Sung Scribe: Ning K., Shan T., Xiang

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Minmax Tree Cover in the Euclidean Space

Minmax Tree Cover in the Euclidean Space Journal of Graph Algorithms and Applications http://jgaa.info/ vol. 15, no. 3, pp. 345 371 (2011) Minmax Tree Cover in the Euclidean Space Seigo Karakawa 1 Ehab Morsy 1 Hiroshi Nagamochi 1 1 Department

More information

Tree-average distances on certain phylogenetic networks have their weights uniquely determined

Tree-average distances on certain phylogenetic networks have their weights uniquely determined Tree-average distances on certain phylogenetic networks have their weights uniquely determined Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu

More information

Motif Extraction from Weighted Sequences

Motif Extraction from Weighted Sequences Motif Extraction from Weighted Sequences C. Iliopoulos 1, K. Perdikuri 2,3, E. Theodoridis 2,3,, A. Tsakalidis 2,3 and K. Tsichlas 1 1 Department of Computer Science, King s College London, London WC2R

More information

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline Phylogenetics Todd Vision iology 522 March 26, 2007 pplications of phylogenetics Studying organismal or biogeographic history Systematics ating events in the fossil record onservation biology Studying

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

Bounded Treewidth Graphs A Survey German Russian Winter School St. Petersburg, Russia

Bounded Treewidth Graphs A Survey German Russian Winter School St. Petersburg, Russia Bounded Treewidth Graphs A Survey German Russian Winter School St. Petersburg, Russia Andreas Krause krausea@cs.tum.edu Technical University of Munich February 12, 2003 This survey gives an introduction

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Online Sorted Range Reporting and Approximating the Mode

Online Sorted Range Reporting and Approximating the Mode Online Sorted Range Reporting and Approximating the Mode Mark Greve Progress Report Department of Computer Science Aarhus University Denmark January 4, 2010 Supervisor: Gerth Stølting Brodal Online Sorted

More information

Assignment 5: Solutions

Assignment 5: Solutions Comp 21: Algorithms and Data Structures Assignment : Solutions 1. Heaps. (a) First we remove the minimum key 1 (which we know is located at the root of the heap). We then replace it by the key in the position

More information

Generating p-extremal graphs

Generating p-extremal graphs Generating p-extremal graphs Derrick Stolee Department of Mathematics Department of Computer Science University of Nebraska Lincoln s-dstolee1@math.unl.edu August 2, 2011 Abstract Let f(n, p be the maximum

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

Binary Decision Diagrams

Binary Decision Diagrams Binary Decision Diagrams Binary Decision Diagrams (BDDs) are a class of graphs that can be used as data structure for compactly representing boolean functions. BDDs were introduced by R. Bryant in 1986.

More information

Realization Plans for Extensive Form Games without Perfect Recall

Realization Plans for Extensive Form Games without Perfect Recall Realization Plans for Extensive Form Games without Perfect Recall Richard E. Stearns Department of Computer Science University at Albany - SUNY Albany, NY 12222 April 13, 2015 Abstract Given a game in

More information

Basing Decisions on Sentences in Decision Diagrams

Basing Decisions on Sentences in Decision Diagrams Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence Basing Decisions on Sentences in Decision Diagrams Yexiang Xue Department of Computer Science Cornell University yexiang@cs.cornell.edu

More information

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D 7.91 Lecture #5 Database Searching & Molecular Phylogenetics Michael Yaffe B C D B C D (((,B)C)D) Outline Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood

More information

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms

A General Lower Bound on the I/O-Complexity of Comparison-based Algorithms A General Lower ound on the I/O-Complexity of Comparison-based Algorithms Lars Arge Mikael Knudsen Kirsten Larsent Aarhus University, Computer Science Department Ny Munkegade, DK-8000 Aarhus C. August

More information

MINORS OF GRAPHS OF LARGE PATH-WIDTH. A Dissertation Presented to The Academic Faculty. Thanh N. Dang

MINORS OF GRAPHS OF LARGE PATH-WIDTH. A Dissertation Presented to The Academic Faculty. Thanh N. Dang MINORS OF GRAPHS OF LARGE PATH-WIDTH A Dissertation Presented to The Academic Faculty By Thanh N. Dang In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in Algorithms, Combinatorics

More information

Path Graphs and PR-trees. Steven Chaplick

Path Graphs and PR-trees. Steven Chaplick Path Graphs and PR-trees by Steven Chaplick A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto Copyright

More information

Data Structures for Disjoint Sets

Data Structures for Disjoint Sets Data Structures for Disjoint Sets Advanced Data Structures and Algorithms (CLRS, Chapter 2) James Worrell Oxford University Computing Laboratory, UK HT 200 Disjoint-set data structures Also known as union

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

A fast algorithm to generate necklaces with xed content

A fast algorithm to generate necklaces with xed content Theoretical Computer Science 301 (003) 477 489 www.elsevier.com/locate/tcs Note A fast algorithm to generate necklaces with xed content Joe Sawada 1 Department of Computer Science, University of Toronto,

More information

arxiv: v1 [math.co] 28 Oct 2016

arxiv: v1 [math.co] 28 Oct 2016 More on foxes arxiv:1610.09093v1 [math.co] 8 Oct 016 Matthias Kriesell Abstract Jens M. Schmidt An edge in a k-connected graph G is called k-contractible if the graph G/e obtained from G by contracting

More information

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB I519 Introduction to Bioinformatics, 2015 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Whole genome comparison/alignment Build better phylogenies Identify polymorphism

More information

Binary Search Trees. Motivation

Binary Search Trees. Motivation Binary Search Trees Motivation Searching for a particular record in an unordered list takes O(n), too slow for large lists (databases) If the list is ordered, can use an array implementation and use binary

More information

1 Basic Definitions. 2 Proof By Contradiction. 3 Exchange Argument

1 Basic Definitions. 2 Proof By Contradiction. 3 Exchange Argument 1 Basic Definitions A Problem is a relation from input to acceptable output. For example, INPUT: A list of integers x 1,..., x n OUTPUT: One of the three smallest numbers in the list An algorithm A solves

More information

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)}

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)} Preliminaries Graphs G = (V, E), V : set of vertices E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) 1 2 3 5 4 V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)} 1 Directed Graph (Digraph)

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Disjoint Hamiltonian Cycles in Bipartite Graphs

Disjoint Hamiltonian Cycles in Bipartite Graphs Disjoint Hamiltonian Cycles in Bipartite Graphs Michael Ferrara 1, Ronald Gould 1, Gerard Tansey 1 Thor Whalen Abstract Let G = (X, Y ) be a bipartite graph and define σ (G) = min{d(x) + d(y) : xy / E(G),

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS. !! www.clutchprep.com CONCEPT: OVERVIEW OF EVOLUTION Evolution is a process through which variation in individuals makes it more likely for them to survive and reproduce There are principles to the theory

More information

On graphs having a unique minimum independent dominating set

On graphs having a unique minimum independent dominating set AUSTRALASIAN JOURNAL OF COMBINATORICS Volume 68(3) (2017), Pages 357 370 On graphs having a unique minimum independent dominating set Jason Hedetniemi Department of Mathematical Sciences Clemson University

More information

Session 5: Phylogenomics

Session 5: Phylogenomics Session 5: Phylogenomics B.- Phylogeny based orthology assignment REMINDER: Gene tree reconstruction is divided in three steps: homology search, multiple sequence alignment and model selection plus tree

More information

Models of Computation. by Costas Busch, LSU

Models of Computation. by Costas Busch, LSU Models of Computation by Costas Busch, LSU 1 Computation CPU memory 2 temporary memory input memory CPU output memory Program memory 3 Example: f ( x) x 3 temporary memory input memory Program memory compute

More information

Algorithms for efficient phylogenetic tree construction

Algorithms for efficient phylogenetic tree construction Graduate Theses and Dissertations Graduate College 2009 Algorithms for efficient phylogenetic tree construction Mukul Subodh Bansal Iowa State University Follow this and additional works at: http://lib.dr.iastate.edu/etd

More information

On improving matchings in trees, via bounded-length augmentations 1

On improving matchings in trees, via bounded-length augmentations 1 On improving matchings in trees, via bounded-length augmentations 1 Julien Bensmail a, Valentin Garnero a, Nicolas Nisse a a Université Côte d Azur, CNRS, Inria, I3S, France Abstract Due to a classical

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Lecture 1 : Data Compression and Entropy

Lecture 1 : Data Compression and Entropy CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for

More information

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees

An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees Francesc Rosselló 1, Gabriel Valiente 2 1 Department of Mathematics and Computer Science, Research Institute

More information

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016) The final exam will be on Thursday, May 12, from 8:00 10:00 am, at our regular class location (CSI 2117). It will be closed-book and closed-notes, except

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

NUMBERS WITH INTEGER COMPLEXITY CLOSE TO THE LOWER BOUND

NUMBERS WITH INTEGER COMPLEXITY CLOSE TO THE LOWER BOUND #A1 INTEGERS 12A (2012): John Selfridge Memorial Issue NUMBERS WITH INTEGER COMPLEXITY CLOSE TO THE LOWER BOUND Harry Altman Department of Mathematics, University of Michigan, Ann Arbor, Michigan haltman@umich.edu

More information

A new algorithm to construct phylogenetic networks from trees

A new algorithm to construct phylogenetic networks from trees A new algorithm to construct phylogenetic networks from trees J. Wang College of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia, China Corresponding author: J. Wang E-mail: wangjuanangle@hit.edu.cn

More information

Phylogenetics: Parsimony

Phylogenetics: Parsimony 1 Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University he Problem 2 Input: Multiple alignment of a set S of sequences Output: ree leaf-labeled with S Assumptions Characters are mutually independent

More information

SMT 2013 Power Round Solutions February 2, 2013

SMT 2013 Power Round Solutions February 2, 2013 Introduction This Power Round is an exploration of numerical semigroups, mathematical structures which appear very naturally out of answers to simple questions. For example, suppose McDonald s sells Chicken

More information

Pattern Popularity in 132-Avoiding Permutations

Pattern Popularity in 132-Avoiding Permutations Pattern Popularity in 132-Avoiding Permutations The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Rudolph,

More information

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.

More information