Reconciliation with Non-binary Gene Trees Revisited

Size: px

Start display at page:

Download "Reconciliation with Non-binary Gene Trees Revisited"

Clifton Sullivan
5 years ago
Views:

1 Reconciliation with Non-binary Gene Trees Revisited Yu Zheng and Louxin Zhang National University of Singapore Abstract By reconciling the phylogenetic tree of a gene family with the corresponding species tree, it is possible to infer lineage-specific duplications and losses with high confidence and hence annotate orthologs and paralogs. However, the currently available reconciliation methods for non-binary gene trees are too computationally expensive to be applied on a genomic level. Here, we present an O(m + n) algorithm to reconcile an arbitrary gene tree with its corresponding species tree, where m and n are the number of nodes in the gene and species trees respectively. The improvement is achieved through two innovations: a fast computation of compressed child-image subtrees, and efficient reconstruction of irreducible duplication histories. This method will be a valuable tool to genome-wide studies of the evolution of individual gene families. 1 Introduction Given the importance of accurately annotated gene relationships in evolutionary and functional studies of biological systems [15, 23], significant efforts have been invested in developing methods to identify orthologs and paralogs [2, 6, 8, 13, 16, 21, 22]. A pair of genes in different species whose last common ancestor corresponds to a speciation event are orthologs [11]. Two genes (in the same or different species) that descend from a gene duplication event are paralogs. Knowing the orthologs and paralogs of species permits one to reconstruct the duplication history within a gene family. In practice, this is often done by reconciling the phylogenetic tree (the gene tree) of a family with the corresponding species tree, and inferring lineage-specific duplication and loss events [12, 16]. Although a plethora of reconciliation methods have been developed over the past two decades (see the review paper [7]), only recently has this reconciliation process been generalized to non-binary gene trees (see the survey articles [10, 24]). The ability to reconcile non-binary gene trees substantially expands the applications of this method in comparative genomics. First, it expands the range of tools: many widely-used phylogenetic programs such as MrBayes produce non-binary gene trees if there is not enough signal in the data to time the divergences. Moreover, reconciling non-binary gene trees obtained by contracting weak branches in binary gene trees produces more accurate duplication events than working directly on corresponding binary ones (our unpublished data). Second, it allows us to design fast heuristic programs for genomewide mapping of orthologs and paralogs. For example, SYNERGY implicitly assumes that the gene tree of every gene family is a star tree and heuristically reconciles the star gene tree with the input species tree, which relieves the substantial preprocessing burden of building binary gene trees for individual gene families [23]. It inspires us to work on bottom-up approach for reconciling non-binary gene trees. For binary gene trees and species trees, there is an accepted reconciliation process which has been proven to produce the unique duplication history with the fewest gene duplication and gene loss events [4, 14], and whose computational complexity is linear with respect to the number of nodes in the gene and species trees [5, 19, 25]. However, the uniqueness of the result is not so clear for non-binary gene trees, where reconcilia- 1

2 tion may produce different duplication histories for different cost models [26]. Furthermore, no one has yet designed a linear-time reconciliation algorithm for non-binary gene trees that is guaranteed to generate the history with the minimum number of duplication and loss events. Chang and Eulenstein developed the first algorithm for the problem [3], but their solution has cubic complexity. The dynamic programming algorithm of Durand et al. has the same worst-case time complexity, but can also solve the problem under any affine cost model [9]. Recently, a quadratic algorithm was proposed in [17]. All these methods are computationally intensive when applied on a genomic scale. In this paper, we present a linear-time algorithm that solves the problem. Our bottom-up approach can incorporate multiple sources of information on gene similarity, including sequence similarity and conserved gene order, and is efficient enough to be used on a genomic level. Hence, it provides a valuable framework for the genome-wide mapping of orthologs and paralogs in any group of species with a known phylogeny, while taking advantage of the rapid increase in fully sequenced genomes. The rest of this paper is divided into six sections. The reconciliation problem and different cost models are introduced in Section 2. Section 3 presents an algorithm to simultaneously compute all compressed child-image subtrees of the species tree in linear time, which immediately leads to an improved reconciliation method. Section 4 introduces the concept of irreducible duplication history. Section 5 presents a simple algorithm that takes O(m + n) operations to reconcile a gene tree of n nodes and the corresponding species tree of m nodes. In Section 6, we use simulated data to compare the time efficiency of our algorithm with other methods. We conclude with suggestions for future work. 2 Concepts and Notions 2.1 Definitions Let T = (V, E) be a rooted tree in which one node is designated the root and the branches are oriented away from the root. V is the set of all nodes, and E is the set of all branches (directed edges). For two nodes u, v V, v is the parent of u if (v, u) E. Further, v is an ancestor of u and equivalently u is a descendant of v, written v u, if the unique path from the root to u passes through v. We write v u if u = v or v u. For U V, lca(u) denotes the most recent common ancestor of the nodes in U. The depth of a node in T is the number of branches in the path from the root to it. In this paper, T denotes the number of nodes in T. p(u) denotes the parent of a non-root node u V (T ). Ch(u) denotes the set of children of u in T. V lf (T ) denotes the set of leaves (terminal nodes) of T. V o (T ) denotes the set of internal (non-leaf) nodes of T. T (u) denotes the subtree rooted at u, which consists of u and all descendants of u. T U denotes the subtree induced by a subset U V : the nodes of T U are V = {v V lca(u) v u U} and the edges of T U are E(T ) (V V ). A node v is said to be binary if it has two children. T is binary if every internal node is binary. If T is non-binary, a binary tree T is said to be a binary refinement of T if for every u T, there exists v T such that V lf (T (u)) = V lf (T (v)), or equivalently if T can be obtained from T by branch contraction. 2.2 Species trees A species tree is a rooted tree in which each leaf is associated with a unique species. For node u V lf (S), the branch (p(u), u) represents the species that labels u. For u V o (S), (p(u), u) represents the common ancestor of all the species that label the leaves in S(u) and u represents a speciation event. Here we assume that a species tree is binary, and that the branch entering the root represents the common ancestor of all the species in the tree; this is called the root branch (Figure 1A). 2

3 A C B D 3 Figure 1: A. A binary species tree over six species 1-6. B. A gene tree of nine genes: two each from species 2, 3 and 4, and one each from species 1, 5 and 6. The gene tree has two nonbinary nodes. The child-image subtree of g and its compressed version are shown in panels C and D. Here, λ(g 1 ) = u, λ(g 2 ) = 4, λ(g 3 ) = y, λ(g 4 ) = 3, and λ(g 5 ) = r. 2.3 Gene trees and gene duplication history The gene tree reconstructed from the DNA or protein sequences of a gene family represents evolutionary relationships in these genes. However, it may not explicitly represent the duplication history of the gene family. Without knowing the true orthologous and paralogous relationships in the family members, we do not need to distinguish the members that are sampled from the same species. Hence, we label each leaf in the gene tree that represents a gene with the species that hosts the gene today. Hence, in the resulting tree, leaves are not uniquely labeled in general. Also, gene trees do not need to be binary. Consider a family F of genes sampled from a collection X of species with a known phylogenetic tree S. Assume that F evolved from a unique ancestral gene through k gene duplications and m gene losses in ancestral species (that is, branches) of S (Figure 2A). We further assume that (i) each duplication event gives rise to one new copy of the involved gene; (ii) each copy, as well as the original duplicated gene, has exactly one descendant gene in an species, unless one of the m loss events occurs in the ancestors. The topology H of the duplication history H of F is a rooted tree whose leaves are labeled with genes. Since S is binary, each degree-2 node u V (H) corresponds to a gene loss, and each degree-3 node with children u and v represents a duplication if it does not correspond to a species tree node (Figure 2C). We use such types of trees to represent duplication histories. The duplication (resp. loss) cost d H (resp. l H ) of H is defined to be the number of duplication (resp. loss) events occurring in it. Its mutation cost is defined to be d H +l H. If we assign weights w d and w l respectively to duplication and loss events, the (w d, w l )-affine cost of H is defined to be w d d H + w l l H. 2.4 The reconciliation problem The duplication history H can be inferred by reconciling S and the gene tree G of F. The symbol g s denotes a gene g F in species s. For U V (G), λ(u) def = {λ(u) u U}. The lca reconciliation is the map λ : V (G) V (S) defined as: { s if g = g λ(g) = s V lf (G), lca (λ(ch(g))) if g V o (G). (1) If G is binary, λ induces the unique duplication history of F that has the minimum duplication and loss costs [4, 14]. In other words, it finds the most parsimonious evolution history. Furthermore, for g V o (G), g is inferred to be a duplication node if λ(g) {λ(g ) g Ch(g)}. The corresponding gene duplication event occurs in the branch (p(λ(g)), λ(g)) in S, and there is a gene loss occurring in each branch off the path from λ(g) to λ(g ) for each g Ch(g) in the inferred duplication history. We define the cost of the lca reconciliation of G and S to be the cost of the corresponding duplication history for each of the duplication, loss, and affine cost models. If G is non-binary, it is not clear how many duplication events can be inferred and where they should occur in the most parsimonious duplication history of F. The problem of reconciling an arbitrary gene tree G and a binary species tree 3

4 S is formulated as follows: Instance: The true gene tree G of a family of genes F, observed in species with a known species tree S. The reconciliation cost is c. Solution: A duplication history of F, represented as a binary tree G, with the cost min G BR(G) c(g, S), where BR(G) is the set of all binary trees that refine G. Note that V (G) V (T ) for every T BR(G). The lca reconciliation of T and S maps every node in G to the same node in S for any T BR(G). Therefore, we just need to infer the duplication history from each ancestral gene g to its children in the subtree S λ(ch(g)) (called its child-image subtree) (Figure 1C), for each g V o (G) separately. In the next section, we discuss our algorithm for non-binary nodes in G, which is identical to the simple rule mentioned above when applied to binary nodes. 3 Compressed Image Subtrees By definition, λ(g) is the root of S λ(ch(g)). If S λ(ch(g)) contains degree-2 nodes, its size can be much larger than Ch(g). To design a fast algorithm for reconciling G and S, we need to compress S λ(ch(g)) by contracting all degree-2 nodes except for those in λ(ch(g)) for each g (Figure 1D). The compressed version of S λ(ch(g)) is written I(g). Let P be a path from p 1 to p 2 in S λ(ch(g)) such that p 1 and p 2 are of degree 3 or in λ(ch(g)) and all the middle nodes are of degree 2 and not in λ(ch(g)). Note that any parsimonious duplication history from g to its children can only have gene loss events in the first branch of P, gene duplication events in the last branch of P, or both. It is also true that if the depths of p 1 and p 2 in S are known, we can compute the gene losses occurring in the branches leading away from P, when working on I(g). I(g) is obtained from S λ(ch(g)) by replacing each of such paths with a single branch. Importantly, I(g) 2 Ch(g) for each g and hence g V o(g) I(g) 2 G. Additionally, we have the following fact, whose proof is in Appendix A. Theorem 1 It takes linear time O( G + S ) to construct the compressed child-image subtrees of all the internal nodes of G in S. Finally, we assume that for each s I(g), its depth d(s) in S is computed and is stored in the data structure along with other information on node s. Note that d(p(s)) d(s) is the number of branches in the path from p(s) to s in S, which is used to compute the gene loss cost of the duplication history from g to its child genes in S. Theorem 1 leads immediately to an improved method for tree reconciliation. By implementing a dynamic programming algorithm to resolve different non-binary gene tree nodes in their corresponding I(g), we can compute an optimal reconciliation of G and S in time O( S + d 2 G ) in the affine cost model, where d is the maximum node degree of G (see Appendix B for details). 4 Irreducible Duplication Histories To develop a linear time algorithm for reconciling non-binary gene trees, we need to focus on a special type of duplication histories of gene families. In this section, we introduce this type of duplication histories. 4.1 Equivalence of gene duplication histories Consider a duplication history H from g to Ch(g) in the child-image subtree S λ(ch(g)). If duplication and loss occur in the same branch (Figure 2A), we can eliminate one duplication and one loss to obtain a new duplication history with fewer events (Figure 2B), because we do not distinguish the elements in Ch(g). Hence, the duplication history of Ch(g) with the smallest duplication cost does not allow both duplication and loss to occur in the same branch. We use n in H (e) and nout H (e) to denote the numbers of genes flowing into and out of a branch e. For u V ( ) S λ(ch(g)), we define: ω(u) def = {g Ch(G) : λ(g ) = u}. (2) 4

5 The following conditions hold for a duplication history with the minimum duplication cost: (C1) For each branch e, n in H (e) 1 and nout H (e) 1. If e is the root branch, n in H (e) = 1. (C2) For any leaf u, n out H (e) = ω(u) for e = (p(u), u). (C3) For any branches e = (u, v) and e = (v, w) in S λ(ch(g)), n out H (e) = nin H (e ) + ω(v). (C4) In every branch e, k duplications occur iff n out H (e) nin H (e) = k; similarly, l losses occur in e iff n in H (e) nout H (e) = k. Let {( ) Σ H = e, n in H(e), n out H (e) e E ( ) } S λ(ch(g)). (3) Two duplication histories H and H from g to Ch(g) are said to be equivalent if Σ H = Σ H. Clearly, any given value of Σ H may be achieved by a large number of histories with the same duplication and loss costs. In this work, we infer a duplication history by determining values of the three arguments defined in (3) for all branches. One benefit of taking this approach is that our method effectively outputs the full set of optimal duplication histories that reconcile the input gene and species trees. 4.2 Irreducible duplication histories λ(ch(g)) = D 0 D 1... D k, (4) where is the sum operation for multi- A C Gene Loss Duplication Figure 2: A. A duplication history that does not have the minimum duplication cost, in whose rightmost lineage, a duplication and a loss occur. B. An irreducible duplication history equivalent to the duplication history in panel A. Here the oldest gene lineage is colored red, the right copy in the first two leaves are the descendants of the gene duplicate produced in the left lineage, and the right copy in the rightmost leaf is the descendant of the duplicate produced in the root branch. C. The gene tree that represents the duplication history in panel B, in which circle nodes correspond to species tree nodes and square nodes are duplication nodes. sets 1, such that (i) k equals the number of duplication events in H; (ii) D 0 = V lf (S λ(ch(g)) ), representing the old- A duplication process copies an existing gene, giving rise to two versions of the gene. A duplication history from g to Ch(g) is irreducible if the ancestral gene representing g in the root branch does not experience any loss event, so that it has a descendant in every leaf of S λ(ch(g)) (the est gene lineage; (iii) D i red lineage in Figure 2B), and if every duplication event copies the corresponding descendant of this oldest gene. Note that a history with no duplication is also irreducible. Such limiting cases are called speciation histories. In general, several children of g may be mapped to the same leaf in S λ(ch(g)). We consider λ(ch(g)) to be a multiset, meaning that each element can have a multiplicity. It is not hard to see that an irreducible duplication history H from g to Ch(g) induces the following decomposition of λ(ch(g)) in S λ(ch(g)) : 5 B def = {x λ(ch(g)) : the gene copy made by E i has a descendant in x} for 1 i k, where E i is the i-th duplication event of H occurring in the branch entering lca(d i ). Conversely, such a decomposition of λ(ch(g)) defines uniquely an irreducible duplication history from g to Ch(g) in S λ(ch(g)). The following theorem is proved in Appendix C. Theorem 2 Every duplication history H from g to Ch(g) is equivalent to an irreducible duplication history H such that d H d H and l H l H. 1 The multiplicity of an element is equal to the sum of the multiplicities in the operands.

6 5 Linear Time Algorithm By the above theorem, in order to infer a duplication history with the minimum mutation cost, we need only to find a decomposition λ(ch(g))/v lf ( S λ(ch(g)) ) = D1 D 2 D k that minimizes k + l i, where l i is the loss cost of the speciation history defined by D i. This is because the number of gene losses in the speciation history defined by V lf ( S λ(ch(g)) ) is fixed. We refer to this as a minimum decomposition. Note that D 1 D 2 D k corresponds to the set of all child genes that are produced by duplication. For each leaf in S λ(ch(g)), all but one of the genes mapped to the leaf were produced by duplication; these duplicates are called redundant gene copies. The descendant of the oldest gene in each leaf is called the basal gene copy. We now present a linear-time algorithm for finding a minimum decomposition of the redundant gene copies by working on the compressed child-image subtree I(g). For the sake of clarity, we also assume that for any (u, v) E(I(g)), the difference of the depths of v and u in the species tree S is one. (We describe how to generalize to general cases later.) A rooted tree is called a defect tree if there is at least one degree-2 node in the middle of every path from the root to a leaf. It is a good tree if there is a root-to-leaf path in which all but the end nodes are of degree 3. Note that a speciation history is a subtree of I(g). Theorem 3 Let D : D 1 D 2 D k be the minimum decomposition of λ(ch(g))/v lf (I(g)). If D gives a duplication history such that redundant gene copies have the minimum gene loss cost, compared to all other duplication histories with the same mutation cost, then for each i, the speciation history I(g) Di satisfies: (1). The subtree T (u) below any degree-2 node u cannot be a defect tree. (2). I(g) Di must be a good tree. Theorem 3 is proved in Appendix D. It motivates us to design a bottom-up recursive algorithm for finding the minimum decomposition of λ(ch(g))/v lf (I(g)), thereby reconstructing the full duplication history from g to its children. By Theorem 3, any component in a minimal decomposition of λ(ch(g))/v lf (I(g)) induces a good tree that has a special structural property. Hence, for subset V V lf (I(g)), we use the induced subtree I(g) V to represent V. As such, we use a set of subtrees to represent a partial decomposition obtained at each internal node. At a leaf u V lf (I(G)), the partial decomposition consists of ω(u) singleton trees, which are considered good trees. Let u be a node with two children u 1 and u 2 in I(g). Consider a partial decomposition D 1 of [λ(ch(g)) V (T (u 1 ))]/V lf (I(g)) into b(u 1 ) trees and a partial decomposition D 2 of [λ(ch(g)) V (T (u 2 ))]/V lf (I(g)) into b(u 2 ) trees. We attempt to merge these two partial decompositions to obtain a decomposition of [λ(ch(g)) V (T (u))]/v lf (I(g)). By Theorem 3, each component of a minimum decomposition induces a good subtree. However, for a good subtree X and an internal node y, X I(g)(y) can be a defect tree. Hence, a partial decomposition may contain defect trees. We distinguish between defect trees and good trees. Assume that a(u 1 ) out of b(u 1 ) trees are good in D 1, and that a(u 2 ) out of b(u 2 ) trees are good in D 2, such that a(u 2 ) a(u 1 ). We merge D 1 and D 2 by considering the following two cases (Figure 3). 1. a(u 2 ) b(u 2 ) < a(u 1 ) b(u 1 ) (panel A in Figure 3). Merge a(u 2 ) pairs of good trees, b(u 2 ) a(u 2 ) pairs of good and defect trees, extend a(u 1 ) b(u 2 ) good trees from D 1, and discard b(u 1 ) a(u 1 ) defect trees from D 1. Further, add ω(u) singleton trees, which are good trees. 2. a(u 2 ) a(u 1 ) min{b(u 1 ), b(u 2 )} (panels B and C in Figure 3). Merge a(u 2 ) pairs of good trees, a(u 1 ) a(u 2 ) pairs of good and defect trees, min{b(u 1 ), b(u 2 )} a(u 1 ) pairs of defect trees, and discard b(u 2 ) b(u 1 ) defect trees from D 2 if b(u 2 ) > b(u 1 ) or 6

7 A C B D m is odd. For each u V (I(g)), we use b(u) to denote the number of trees in the decomposition obtained at u in which a(u) out of b(u) trees are good trees. For u and k 0. we define dist (k, [a(u), b(u)]) = min x k, x [a(u),b(u)] Figure 3: Schematic view of merging partial decompositions for the three possible cases (A-C) where u has two children, and also for the case when u has only one child (D). Good trees and defect trees are colored orange and blue respectively in decompositions D 1 (left) and D 2 (right). The ω(u) singleton trees added at the current node are not shown in each case. b(u 1 ) b(u 2 ) defect trees from D 1 otherwise. Add ω(u) singleton trees. Proposition 1 Let m 1 m 2 m 3 m 4 be the arrangement of {a 1, a 2, b 1, b 2 } from smallest to largest. Merging D 1 and D 2 produces ω(u) + m 2 good trees and m 3 m 2 defect trees to merge, and detects m 4 m 3 defect trees to discard. At an internal node u with only one child u 1 (panel D in Figure 3), we create ω(u) singleton trees, extend all good trees, and discard all the defect trees in the decomposition D 1 obtained at u 1. Using the above bottom-up merging procedure, we obtain a set of good and defect trees at the root of I(g). This set of trees defines a minimal decomposition of λ(ch(g))/v lf (I(g)). More specifically, each good tree corresponds to a component of the minimal decomposition. But each defect tree corresponds to k 2 components, where k equals the cardinality of the maximum incomparable degree-2 internal nodes in the tree. Similarly, each defect tree discarded at an internal nodes also corresponds to a set of components of the minimal decomposition. For m real numbers i 1, i 2,, i m, we use median{i 1, i 2,..., i m } to denote their median if k = median{k, a(u), b(u)} ω(u), (5) 0 if u is a leaf, f(u, k ) = C(u 1, k ) + k if Ch(u) = {u 1 }, u Ch(u) C(u, k ) if Ch(u) = {u 1, u 2 }, and C(u, k) = dist (k, [a(u), b(u)]) + f(u, k ). (6) Theorem 4 Let r be the root of I(g). The decomposition D r obtained by the above merging procedure determines a duplication history of redundant gene copies with the minimum mutation cost C(r, 0). Theorem 4 is proved in Appendix E. It suggests a two-step algorithm for reconstructing the evolution from g to its children in linear time (Figure 4). First, we compute the numbers of good and defect trees obtained at the internal nodes in I(g) by visiting all the nodes in order from leaf to root, which guarantees that we visit all the children of a node before the node itself. We then identify duplications and losses by computing the numbers of genes flowing into and out of the branches in I(g), top down from root to leaf. To take into account the basal gene copies, we add one to the numbers of ancestral gene copies flowing into and out of each branch. Figure 5 gives an example to illustrate this algorithm. Recall that we assume d(u) = d(p(u)) + 1 for each u V (I(g)) in the algorithm described above. It can be modified for general cases by (i) finding all maximal subtrees of I(g) that do not contain any branch (u, v) such that d(v) > d(u) + 2 in S and then (ii) for each subtree T found in (i), replacing every branch (u, v) such that d(v) = d(u) + 2 by the two-branch path between u and v in S and then applying the algorithm to the resulting subtree T. The complete version of this algorithm can be found in Appendix F. 7

8 Input An annotated compressed child-image subtree I(g); Output The nos. of genes flowing into and out of branches in I(g). 1. Traversing I(g) in post-order Compute a(u) and b(u) at node u: if (u is a leaf) { a(u) = ω(u) 1; b(u) = ω(u) 1; } else if (Ch(u) = {u 1, u 2 }) { max a = max(a(u 1 ), a(u 2 )); min b = min(b(u 1 ), b(u 2 )); a(u) = ω(u) + min(max a, min b); b(u) = ω(u) + max(max a, min b); } else if (Ch(u) = {u 1 }) { a(u) = ω(u); b(u) = a(u 1 ) + ω(u); } 2. Traversing I(g) in pre-order /* in(u) and out(u) denotes the */ /* nos. of genes flowing into */ /* and out of the branch (p(u), u) */ Compute in(u) and out(u) at node u: if (u is the root) { α(u) = 0; β(u) = ω(u) + a(u); } else { α(u) = β(p(u)) ω(p(u)); β(u) = median{α(u), a(u), b(u)}; } /* factor in the basal copy in */ /* each branch */ in(u) = 1 + α(u); out(u) = 1 + β(u); Figure 4: A linear-time algorithm for reconstructing the evolution from g to its children. Here we assume that d(u) = d(p(u)) + 1 for each u in I(g). The general version of this algorithm is in Appendix F. 6 Experimental Tests We compared a naive dynamic programming method (DP) (found in [9]) and a modified dynamic programming method (DP+C) (which applies the dynamic programming technique to the compressed child-image subtrees) to the proposed linear-time method (LT) using simulated A C t: w v x r y s z: B D a b t v w x y z s r Gene Loss Duplication Figure 5: Illustration of the reconciliation algorithm. (A) A compressed child-image tree I(g) with the redundant child genes of g (bullets) drawn beside their image nodes. (B). The values of a and b at internal nodes. (C) The trees to be merged at each node in I(g). Two good trees are obtained when the merging process terminates at the root. A subtree obtained at each node is good if its root is connected to at most one blue branch. (D) The duplication history of the redundant child gene copies in the compressed child-image tree, with the minimum mutation cost of 6. Note that the basal gene copies are not shown. data. For fair comparison, we also implemented DP. Our implementation of DP is slightly faster than the dynamic programming approach found in NOTUNG [9], but to be fair the latter has several other features, such as listing all the inferred optimal solutions. All three programs were run to reconcile nonbinary gene trees with the mutation cost, using the same machine (3.4GHz and 8G RAM). We measured their run times for 100 reconciliations between a non-binary tree containing 1.2n genes and its corresponding species trees over n species. For each size n, both the species tree and a binary gene tree with 1.2n leaves were generated using the Yule model. The leaves in the gene tree were labeled with random species se- 8

9 lected from a uniform distribution. Finally, a non-binary gene tree was obtained from the binary gene tree by contracting each edge with a fixed rate p. We examined 40 cases by allowing n to take 10 different values in the range from 1,000 to 10,000 and setting the edge contraction rate p to either 0.4, 0.6, 0.7 or 0.8 (Figure 6). We also ran LT on 20 different tree sizes in the range from 5,000 to 100,000, which are too large for the other two methods. The results was summarized in Figure S2 in Appendix G. which confirm that the run times of LT are linearly proportional to the size of the gene trees. LT is slightly faster than DP+C, and 5 to 20 times faster than DP for gene trees with thousands of genes. 7 Discussion and future work Here we present a linear-time algorithm to reconcile the non-binary gene tree of a gene family and the corresponding species tree to reconstruct the duplication history of the gene family with the minimum number of duplications and losses. Time LT DP + C DP Time LT DP + C DP The reconciliation times are an order of magnitude faster than others achieved using compressed child-image trees and working on irreducible duplication histories. Our approach has several important benefits. First, we do not consider incomplete lineage sorting (ILS) events, which may not be rare and hence cannot be ignored in certain circumstances [18, 20]. Since the effect of an ILS event on the divergence of gene and species trees is similar to that of a duplication event, the concepts proposed here can easily be extended to take into account ILS. Second, the output of our program is actually a class of optimal duplication histories, not an individual history. This is because the program assigns multiple duplications to each branch in the species trees, and these duplications can be arranged in different ways. Third, our linear-time algorithm is fast and hence is ideal for providing online service for tree reconciliation (see our TxT server Finally, our bottom-up approach can incorporate multiple sources of information on gene similarity, including sequence similarity and conserved gene order, when it is applied to genome-wide studies of the evolution of gene families. This is definitely an interesting future project No. of Species No. of Species References Time LT DP + C DP Time LT DP + C DP [1] Arvestad, L., Lagergren, J. et al.: The gene evolution model and computing its associated probabilities. J. ACM 56, 1-44 (2009) No. of Species No. of Species Figure 6: Comparison of three algorithms: dynamic programming (DP), dynamic programming with compressed child-image subtrees (DP+C), and the proposed linear time (LT) algorithm. Four figures are drawn for the four different edge contraction rates 0.4(top left), 0.6(top right), 0.7(bottom left), and 0.8 (bottom right). The run time is given in microseconds and the no. of species is in thousands. [2] Bansal, M.S., Alm, E.J., Kellis, M.: Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinform. 28:i283-i291 (2012) [3] Chang, W.C., Eulenstein, O.: Reconciling gene trees with apparent polynomies. In Proc. COCOON 06. pp [4] Chauve, C., El-Mabrouk, N.: New perspectives on gene family evolution: losses in reconciliation and a link with supertrees. In Proc. of RECOMB 09, pp (2009) 9

10 [5] Chen, K., Durand, D., Farach-Colton, M.: NOTUNG: a program for dating gene duplications and optimizing gene family trees. J. Comput. Biol. 7, (2000) [6] Chen, Z.Z., Deng, F., Wang, L.: Simultaneous identification of duplications, losses, and lateral gene transfers. IEEE/ACM TCBB 9: (2012) [7] Doyon J P, et al.: Models, algorithms and programs for phylogeny reconciliation. Briefings Bioinfrom. 12: (2011) [8] Dufayard J.-F. et al.: Tree pattern matching in phylogenetic trees: automatic search for orthologs or paralogs in homologous gene sequence databases. Bioinformatics 21: (2005) [9] Durand, D., Halldorsson, B., Vernot, B.: A hybrid micro-macroevolutionary approach to gene tree reconstruction. J. Comput. Biol. 13: (2006) [10] Eulenstein, O. et al.: Reconciling phylogenetic trees. In Evolution After Duplication (eds: K. Dittmar, D. Liberles), pp Wiley-Blackwell, New Jersey, USA (2010) [11] Fitch,W.M.: Distinguishing homologous from analogous proteins. Syst. Zool. 19: (1970) [12] Goodman, M. et al.: Fitting the gene lineage into its species lineage, a parsimony strategy illustrated by cladograms constructed from globin sequences. Syst. Zool. 28: (1979) [13] Goodstadt, L., Ponting, C.: Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput. Biol. 2: e133 (2006) [14] Górecki, P., Tiuryn, J.: DLS-trees: a model of evolutionary scenarios. Theoret. Comput. Sci. 359: (2006) [15] Kellis, M. et al.: Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: (2003) [16] Kristensen, D.M., Wolf, Y.I., Mushegian, A.R., Koonin, E.V.: Computational methods for gene orthology inference. Briefings Bioinform. 12: (2011) [17] Lafond, M., Swenson, K. M., El-Mabrouk, N. An optimal reconciliation algorithm for gene trees with polytomies. In Alg. in Bioinform., pp Springer (2012) [18] Pollard, et al.: Widespread discordance of gene trees with species tree in Drosophila: evidence for incomplete lineage sorting. PLoS Genet. 2(10), e173 (2006) [19] Schieber, B., Vishkin, U.: On finding lowest common ancestors: simplification and parallelization, SIAM J. Comput. 17: (1988) [20] Stolzer, M. et al.: Inferring duplications, losses, transfers and incomplete lineage sorting with nonbinary species trees. Bioinformatics, 28(18), i409-i415 (2012) [21] Storm C, Sonnhammer E.: Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinform. 18:92-99 (2002) [22] Tatusov, R.L. et al.: A genomic perspective on protein families. Science 278: (1997) [23] Wapinski I. et al.: Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54-61 (2007) [24] Warnow, T.: Large-scale multiple sequence alignment and phylogeny estimation. In Models and Algorithms for Genome Evolution pp Springer, London, 2013 [25] Zhang, L.X.: On a Mirkin-Muchnik-Smith conjecture for comparing molecular phylogenies. J. Comput. Biol. 4: (1997) [26] Zheng, Y., Wu, T., Zhang L.X.: A lineartime algorithm for reconciliation of nonbinary gene tree and binary species tree, In Proc. COCOA 13, pp (2013) 10

11 Appendixes Appendix A: Proof of Theorem 1 By preprocessing a species tree, one can compute the lowest common ancestor of any two nodes in constant time (Schiebe and Vishkin, SIAM J. Computing, 17: (1988)), which leads to computing the images of all gene tree nodes under λ in linear time O( G + S ) (Zhang, JCB 4: (1997)). In the rest of this section, we assume lca(s, s ) can be computed in constant time for any s, s V (S). We also assume a linked list pre(s) that contains all the gene tree nodes mapped to s: that is, pre(s) = {g V (G) λ(g) = s} is available for use for each s V (S). Proposition S1 Let g V (G). Assume g 1, g 2,, g k is the arrangement of Ch(g) such that λ(g 1 ), λ(g 2 ),, λ(g k ) are visited from the earliest to latest in the post-order traversal of S. Then V (I(g)) = λ(ch(g)) {lca(λ(g i ), λ(g i+1 )) 1 i k 1}. Proof Let u V (I(g))/λ(Ch(g)). By definition, u has two children u 1 and u 2 in I(g) and there is at least one image node below each of u 1 and u 2. Without loss of generality, we may assume that the nodes in T (u 1 ) are visited before those in T (u 2 ) in the post-order traversal of S. Thus, λ(g j ) V (T (u 1 )) and λ(g j+1 ) V (T (u 2 )) for some j. This implies that u = lca(λ(g j ), λ(g j+1 )). Conversely, for any j, we let v be lca(λ(g j ), λ(g j+1 )). If v λ(g i+1 ), then λ(g j ) and λ(g i+1 ) are below the different children of v. Hence, by definition, v V (I(g)). Proposition S1 suggests that if the elements in Ch(g) are arranged properly, we can compute the node set of I(G) by simply applying the lca operation Ch(g) 1 times, taking O( Ch(g) ) steps because each lca operation takes constant time in the preprocessed species tree. We now present a linear time algorithm to compute I(g) for all g in G simultaneously. First, we use the following linear time algorithm to properly rearrange the children in Ch(g) for each g so that the assumption in Proposition S1 holds. Here, we assume an array of k pointers is used to represent a gene tree node of degree k, denoted by ptr(g), so that swapping two childrens positions can be done just by exchanging the corresponding pointer values. 11

12 Traverse G in a depth-first order set i h = 0 at each h V o (G); \* Arrange the children of each h V (G) according to *\ \* the positions of their images in the post-order traversal of S *\ Traverse S in post-order at each s V (S), do { for each h pre(s) {swap h and i p(h) -th child of p(h); i p(h) i p(h) + 1;} } Second, we compute an array B(s) that contains all the gene tree nodes g, such that s I(g) for each s S, traversing G in post-order. \* Compute B(s) = {h V (G) s I(h)} for all s V (S) *\ Traverse G in post-order at each h V (G), do { for each child h i Ch(h) {add h into B(λ(h i ))} for i = 1 to Ch(h) 1 {add h into B(lca(λ(h i ), λ(h i+1 ));} } Third, we store all the species tree nodes that comprise the compressed child-image subtree I(g) of g V (G) in a stack named ImageTreeNodes(g) and then to construct all I(g) s. 12

13 \* ImageTreeNodes(g) contains... *\ Traverse S by following its Euler tour at each s V (S), do { for each h B(s) {push s into ImageTreeNodes(h)} } Traverse G in post-order Construct I(h) at each h V o (G) by: make a copy a of the elm. e 1 popped from ImageTreeNodes(g); do { pop an elm. e 2 from ImageTreeNodes(g); if (e 2 does not has a copy) make a copy b of the elm. e 2 ; else assign the copy to b; if depth(e 1 ) < depth(e 2 ) { add an edge (a, b) if a and b are not connected; } else {add an edge (b, a) if a and b are not connected;} a b; } until ImageTreeNodes(g) is empty; We now analyze the time complexity of the algorithm. At Step 1, the sub-procedure of setting the counters of all gene tree nodes simply takes G operations. The sub-procedure of rearranging the children of all gene tree nodes takes S +2 G operations, as it visits each species tree node once and needs to do a swap and a counter increment for each branch in the input gene tree. At Step 2, the procedure for constructing B(s) for all species tree nodes requires two insertion operations for each branch in the input gene tree. Hence, it takes at most 2 G operations. At Step 3, since the total size of all the compressed childimage subtrees is 2 G, each of the traversal procedures takes 6 G operations. Hence, our algorithm take linear time to compute all the compressed child-image subtrees of the gene tree nodes. The algorithm outputs the compressed child-image subtree of each g V o (G) in S. If g is a binary node, g is a duplication node if and only if I(g) is a singleton or a two-node tree. If g is non-binary, we will infer the optimal binary refinement of g by working on I(g). Finally, we assume that for each s I(g), its depth d(s) in the species tree S is also computed and saved at s. Note that d(p(s)) d(s) is the number of edges in the path from p(s) to s in S, which is needed to compute the gene loss cost of a binary refinement of g in S. 13

14 7.1 Appendix B: An improved dynamic programming method Theorem 1 leads immediately to an improved method for tree reconciliation. Let A(u, k) be the cost of the optimal duplication history H of the child genes in I(g)(u) with k ancestral genes flowing into (p(u), u) in I(g) in the (w d, w l )-affine cost model, where k Ch(g). Obviously, the restriction of H in T (u ) must be optimal for each child u Ch(u). If u is a leaf in I(g), we set { w d if ω(u) > k w = w l if ω(u) k. There are d(u) d(p(u)) 1 nodes between p(u) and u in S. Assume that there are t genes flowing into the branch entering u in S, A(k, u) = min[(k t )w t l + (ω(u) t )w d + t w l c u ]. = w min{k, ω(u)} + w k ω(u). (7) where c u = d(u) d(p(u)) 1, w = min{w d + w l, c u w l }, and ω(u) is defined in (2). If u is a node with only one child u 1 in I(g), we assume that t genes flow into the branch entering u and x genes flow out u and t genes. We need to assume min[(k t )w t l + (ω(u) t )w d + t w l c u + k w l ]. = w min{k, ω(u)} + w k ω(u) + k w l gene duplication and loss events in the path from p(u) to u. Hence, for this case, A(k, u) = min [ w min{k, ω(u)} + w k ω(u) + 1 k Ch(g) k w l + A(u 1, k )]. (8) Similarly, if u is a node with two children u 1 and u 2 in I(g), we have: A(k, u) = min [ w min{k, ω(u)} + w k ω(u) + A(u 1, k ) + A(u 2, k )]. (9) 1 k Ch(g) By implementing a dynamic programming algorithm based on Eqn. (7)-(9) for reconciliation in I(g), we can compute an optimal reconciliation of G and S in time O( S + u V o(g) d2 I(g) ) = O( S + d 2 G ) under the affine cost model, where d is the largest degree of a node in G. The time required for finding the compressed child-image subtrees is factored into this estimate. 14

15 Appendix C: Proof of Theorem 2 Theorem 2 Every duplication history H from g to Ch(g) is equivalent to an irreducible duplication history H such that d H d H and l H l H. Proof We prove the statement by induction on the number of duplications occurring in H. If k = 0, so that H is a speciation history, then λ(ch(g)) = V lf (S λ(ch(g)) ), derived from the definition of S λ(ch(g)). Therefore, H itself is irreducible. Assume the statement is true for any duplication history with k 1 duplications. Consider the most recent duplication event E of H. Assume it occurs in a branch (p(u), u), u V ( ) S λ(ch(g)), suggesting that H has no duplication occurring in the subtree T (u). Each ancestral gene derived from E has at most one descendant gene copy in each leaf in T (u). Fix such an ancestral gene o and ( let DS(o) ) be the set of leaves that contain a descendant of o. Note that DS(o) V lf S λ(ch(g)). Removing E from the duplication history H results in a duplication history H. H has k 1 duplications and covers all the gene copies that are not descendants of o. By induction, λ(ch(g))/ds(o) = D 0 D 1... D k, k k 1. If D 0 = V ( ) lf S λ(ch(g)), then λ(ch(g)) = D 0 D 1... D k DS(o) is a desired decomposition. If D 0 V ( ) lf S λ(ch(g)), then S DS(o)/D is a forest subgraph of T (u). Assume it has m 0 tree components, say T 1, T 2,..., T m. Define. We thus have D 0 def = D 0 V lf (T 1 ) V lf (T 2 ) V lf (T m ) = V lf ( S λ(ch(g)) ), D k +1 = DS(o)/[V lf (T 1 ) V lf (T 2 ) V lf (T m )] λ(ch(g)) = D 0 D 1... D k D k +1. This decomposition defines a unique, irreducible history that is equivalent to H. By moving the leaves of all T i s from the last term to the first term, the gene loss cost of the speciation history defined by the first term decreases by m, whereas the gene loss cost of the speciation history defined by the last term increases by at most m. Hence, we have obtained a desired decomposition. 15

16 Appendix D: Proof of Theorem 3 Therorem 3 Let D : D 1 D 2 D k be the minimum decomposition of λ(ch(g))/v lf (I(g)). If D gives a duplication history such that redundant gene copies have the minimum gene loss cost, compared to all other duplication histories with the same mutation cost, then the speciation history I(g) Di satisfies the following properties for each i: (1). The subtree T (u) below any degree-2 node u cannot be a defect tree. (2). I(g) Di must be a good tree. Proof (1). Without loss of generality, assume that u is a node of degree 2 in I(g) D1. If T (u) is a defect tree (panel B in Figure S1), we consider a maximal set of incomparable degree-2 nodes {u 1,, u j } in T (u). We have: V lf (T (u)) = V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )). By replacing D 1 with {V lf (T (u 1 )), V lf (T (u 2 )),, V lf (T (u j )), D 1 /V lf (T (u))}, we obtain the following decomposition D : V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )) D 1 /V lf (T (u)) D 2 D k. It is easy to see that the duplication cost of D is equal to k+j. Further, by partitioning D 1 into V lf (T (u 1 )), V lf (T (u 2 )),, V lf (T (u j )), and D 1 /V lf (T (u)), the gene loss events occurring at the degree-2 nodes u 1, u 2,, u j, and u are eliminated, and a new gene loss is introduced at p(u) in the corresponding speciation history of D 1 /V lf (T (u)), as illustrated in Figure S1C. Hence, the mutation cost of D is equal to the mutation cost of D, but its gene loss cost is less than that of D. This contradicts the fact that D is a minimum decomposition of λ(ch(g))/v lf (I(g)) with the minimum gene loss cost. (2). Without loss of generality, we may assume that I(g) D1 is a defect tree. Consider a maximal set {u 1,, u j } of incomparable nodes of degree 2 in the tree. We have D 1 = V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )). By replacing D 1 with {V lf (T (u 1 )), V lf (T (u 2 )),, V lf (T (u j ))}, we obtain the following decomposition λ(ch(g))/v lf (I(g)): D : V lf (T (u 1 )) V lf (T (u 2 )) V lf (T (u j )) D 2 D k. The duplication cost of the corresponding speciation history of D is j 1 plus that of D. However, the gene loss cost of D is j less than that of D. Hence, the mutation cost of D is less than that of D. This contradicts the assumption that D is a minimum decomposition of λ(ch(g))/v lf (I(g)) having the minimum gene loss cost. 16

17 A B C Decomposition Figure S1: A. A defect tree in which degree-2 nodes are colored blue. B. A defect subtree (below u) inside a speciation history, in which {u 1, u 2, u 3 } is a maximal set of incomparable nodes of degree 2. C. The speciation history in (B) is decomposed into a duplication history with the same mutation cost but a smaller gene loss cost. 17

18 Appendix E: Proof of Theorem 4 Recall that a good tree has at least one root-to-leaf path not containing any degree-2 nodes. Defect trees are those that are not good. For convenience, we let a u = a(u), b u = b(u), a i = a(u i ) and b i = b(u i ) for i = 1, 2. We also set dist(k, [a, b]) = d(k, [a, b]). Lemma S1 For a child v of u, a v and b v are defined to be the numbers of good trees and defect trees respectively in the decomposition D v associated with v. The following facts are true for k defined in Eqn. (6). (1) If Ch(u) = {u 1 }, k = median{k ω(u), 0, a 1 }. (2) If Ch(u) = {u 1, u 2 }, k = median{k ω(u), a 1, b 1, a 2, b 2 }. Proof (1). The first statement is derived from the facts that a u = 0 + ω(u) and b u = a 1 + ω(u) if Ch(u) = {u 1 }. (2) Let m 1 m 2 m 3 m 4 be the arrangement of a 1, b 1, a 2, b 2 from smallest to largest as in Proposition 2. By the merging procedure, we have a u = m 2 + ω(u), b u = m 3 + ω(u). Hence, and Since {k ω(u), a u ω(u), b u ω(u)} = {k ω(u), m 2, m 3 }, {k ω(u), a 1, b 1, a 2, b 2 } = {k ω(u), m 1, m 2, m 3, m 4 }. m 1 max(m 2, k ω(u)) = median{k ω(u), m 2, m 3 } min(m 3, k ω(u)) m 4, median{k ω(u), m 1, m 2, m 3, m 4 } = median{k ω(u), m 2, m 3 }. This concludes the proof. Lemma S2 For any integer k 0 and u V (I(g)), C(u, k) = d(k, [a u, b u ]) + C(u, a u ). (10) Proof We prove the theorem by induction. For a leaf u, a u = b u = ω(u). By definition, C(u, k) = d(k, [a u, b u ]) = k a u and C(u, a u ) = d(a u, [a u, b u ]) = 0. Hence, Eqn. (10) holds. 18

19 We now assume that Eqn. (10) holds for the children of u. If u has only one child u 1, a u = ω(u) and b u = ω(u) + a 1, implying the part 1 of Lemma S1: By induction, Eqn. 10 holds for u 1. In particular, and Applying Inequality (11), we obtain: and k = median{k ω(u), 0, a 1 } a 1. (11) C(u 1, 0) = a 1 + C(u 1, a 1 ), C(u 1, k ) = d(k, [a 1, b 1 ]) + C(u 1, a 1 ) + k. C(u, k) = d(k, [a u, b u ]) + C(u 1, k ) = d(k, [a u, b u ]) + d(k, [a 1, b 1 ]) + k + C(u 1, a 1 ) = d(k, [a u, b u ]) + a 1 + C(u 1, a 1 ) d(k, [a u, b u ]) + C(u, a u ) = d(k, [a u, b u ]) + C(u 1, 0) = d(k, [a u, b u ]) + a 1 + C(u 1, a 1 ). If u has two children u 1, u 2, C(u, k) = d(k, [a u, b u ]) + C(u 1, k ) + C(u 2, k ) = d(k, [a u, b u ]) + d(k, [a i, b i ]) + C(u i, a i ). On the other hand, since median{a u ω(u), a u ω(u), b u ω(u)} = a u ω(u), d(k, [a u, b u ]) + C(u, a u ) = d(k, [a u, b u ]) + C(u 1, a u ω(u)) + C(u 1, a u ω(u)) = d(k, [a u, b u ]) + d(a u ω(u), [a i, b i ]) + C(u i, a i ). Without loss of generality, we may assume that a 2 a 1. We consider two cases to prove that d(k, [a i, b i ]) = d(a u ω(u), [a i, b i ]). If a 2 b 2 a 1 b 1, then b 2 k = median{k ω(u), a 1, b 1, a 2, b 2 } a 1, and thus d(k, [a i, b i ]) = k b 2 + a 1 k = a 1 b 2, 19

20 and d(a u ω(u), [a i, b i ]) = b 2 b 2 + a 1 b 2 = a 1 b 2. If a 2 a 1 min(b 1, b 2 ), then a 1 k = median{k ω(u), a 1, b 1, a 2, b 2 } min(b 1, b 2 )] and thus d(k, [a i, b i ]) = 0, and d(a u ω(u), [a i, b i ]) = 0. This concludes the proof of Lemma 2. For any subset of real number X and a real number r, f X (r) def = x r x X It is not hard to see that d(x, [i 1, i 2 ]) = 1 2 f {i 1,i 2 }(x) (i 2 i 1 ), x R. (12) Lemma S3 For any disjoint real intervals [i 1, i 2 ] and [i 3, i 4 ] and any i 5 R, d(x, [i 1, i 2 ]) + d(x, [i 2, i 3 ]) + d(x, i 5 ) d(m, [i 1, i 2 ] + d(m, [i 3, i 4 ]) + d(m, i 5 ), (13) for any x R, where m = median{i 1, i 2, i 3, i 4, i 5 }. Proof The inequality is derived from: d(x, [i 1, i 2 ]) + d(x, [i 2, i 3 ]) + d(x, i 5 ) = 1 2 f {i 1,i 2,i 3,i 4,i 5,i 5 }(x) [(i 2 i 1 ) + (i 4 i 3 )]. Note that the middle values minimize the sum of distances f {i1,i 2,i 3,i 4,i 5,i 5 }(x) (Dasgupta, Papadimitriou and Vazirani, Algorithms, page 86). Hence, as a middle value of {i 1, i 2, i 3, i 4, i 5, i 5 }, the median{i 1, i 2, i 3, i 4, i 5 } minimizes d(x, [i 1, i 2 ])+ d(x, [i 3, i 4 ]) + d(x, i 5 ). Lemma S4 For any real interval [i 1, i 2 ], and any i 3 R, d(x, [i 1, i 2 ]) + d(x, i 3 ) + x d(m, [i 1, i 2 ]) + d(m, i 3 ) + m, for any x R, where m = median{0, i 1, i 3 }. 20

Non-binary Tree Reconciliation. Louxin Zhang Department of Mathematics National University of Singapore

Non-binary Tree Reconciliation Louxin Zhang Department of Mathematics National University of Singapore matzlx@nus.edu.sg Introduction: Gene Duplication Inference Consider a duplication gene family G Species