BIOINFORMATICS GABRIEL VALIENTE ALGORITHMS, BIOINFORMATICS, COMPLEXITY AND FORMAL METHODS RESEARCH GROUP, TECHNICAL UNIVERSITY OF CATALONIA

BIOINFORMATICS GABRIEL VALIENTE ALGORITHMS, BIOINFORMATICS, COMPLEXITY AND FORMAL METHODS RESEARCH GROUP, TECHNICAL UNIVERSITY OF CATALONIA 2005 2006 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 1 / 86

Introduction April 27 Ultrametric trees Phylogenetic reconstruction May 4 Additive and non-additive trees May 11 May 18 Perfect phylogenies Compatibility Taxonomic reconstruction May 25 Consensus June 1 Combination Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 2 / 86

Introduction Michael S. Waterman (University of Southern California). Introduction to Computational Biology. Chapman & Hall, 1995. Dan Gusfield (University of California, Davis). Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology. Cambridge University Press, 1997. Roderic D. M. Page (University of Glasgow) and Edward C. Holmes (University of Oxford). Molecular Evolution: A Phylogenetic Approach. Blackwell Science, 1998. Gabriel Valiente (Technical University of Catalonia). Algorithms on Trees and Graphs. Springer-Verlag, 2002. Neil C. Jones, Pavel A. Pevzner (University of California, San Diego). An Introduction to Bioinformatics Algorithms. The MIT Press, 2004. Arthur M. Lesk (Pennsylvania State University). Introduction to Bioinformatics. 2nd Edition. Oxford University Press. 2005. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 3 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 4 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 5 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 6 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 7 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 8 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged 35 20 10 5 2.5 30MYA Brown Bear Polar Bear Black Bear Spectacled Bear Giant Panda Raccoon Red Panda No correlation between evolutionary distances and edge lengths Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 9 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged 15 10 5MYA 2.5 2.5 2.5 5 5 10 20 30 30 Brown Bear Polar Bear Black Bear Spectacled Bear Giant Panda Raccoon Red Panda No correlation between evolutionary distances and edge lengths Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 10 / 86

Ultrametric trees The (evolutionary) distance D i,j between two species i and j measures the length of time since the species diverged 40 MYA 35 30 25 20 15 10 5 0 Brown Bear Polar Bear Black Bear Spectacled Bear Giant Panda Raccoon Red Panda Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 11 / 86

Ultrametric trees Given a weighted tree T with n leaves, compute the length d T i,j of the path between any two leaves i and j 2 3 4 12 16 13 13 14 17 12 12 13 5 1 6 The length of the path between any two nodes can be calculated as the sum of the weights of the edges in the path between them For example, d1,5 T = 12 + 13 + 14 + 17 + 12 = 68 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 12 / 86

Ultrametric trees Given an n n distance matrix D, find a tree T with n leaves that fits the data, that is, such that di,j T = D i,j for every two leaves i and j 0 37 55 69 68 25 37 0 42 56 55 38 55 42 0 46 45 56 69 56 46 0 25 70 68 55 45 25 0 69 25 38 56 70 69 0 A matrix D if symmetric non-negative if D i,j = D j,i and D i,j 0 for all i and j A matrix D satisfies the triangle inequality if D i,j + D j,k D i,k for all i, j, and k A matrix D is a distance matrix if it is symmetric non-negative, it satisfies the triangle inequality, and D i,j 0 for all i j Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 13 / 86

Ultrametric trees There are many ways in which distance matrices can be generated Sequence a particular gene in n species and define D i,j as the edit distance between this gene in species i and species j Sequence a particular gene in n species and define D i,j as the alignment distance between this gene in species i and species j Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 14 / 86

Ultrametric trees Given an n n distance matrix D, find a tree T with n leaves that fits the data, that is, such that d T i,j = D i,j for every two leaves i and j There is only one unrooted binary tree topology T with n = 3 leaves i d T i,c c d T k,c The lengths of each edge in T are defined by three equations with three variables d T i,c + d T j,c = D i,j d T i,c + d T k,c = D i,k d T j,c + d T k,c = D j,k k d T j,c j d T i,c = (D i,j + D i,k D j,k )/2 d T j,c = (D i,j + D j,k D i,k )/2 d T k,c = (D i,k + D j,k D i,j )/2 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 15 / 86

Ultrametric trees Given an n n distance matrix D, find a tree T with n leaves that fits the data, that is, such that d T i,j = D i,j for every two leaves i and j An unrooted binary tree with n leaves has 2n 3 edges Fitting any given tree T with n leaves to an n n distance matrix D involves solving a system of ( n 2) equations with 2n 3 variables For n = 4, this amounts to solving a system of six equations with only five variables, and it is not always possible to solve this system, making it hard or impossible to construct such a tree T from D Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 16 / 86

Ultrametric trees A distance matrix D is ultrametric if for every three leaves i, j, and k, of the three distances D i,j D i,k D j,k the two largest are equal (three point condition) D i,j D i,k = D j,k D i,k D i,j = D j,k D j,k D i,j = D i,k i j a a b k D i,j = 2a a + b = D i,k = D j,k implies a b It can be determined in O(n 3 ) time whether or not an n n distance matrix D is ultrametric Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 17 / 86

Ultrametric trees Ultrametric distance matrices model evolutionary trees An evolutionary tree is a rooted binary tree with internal nodes labeled by a number and with strictly decreasing labels along any root-to-leaf path For every two leaves i and j, their distance D i,j is the label of the least common ancestor of species i and j 0 1 2 4 8 14 14 14 1 0 2 4 8 14 14 2 2 0 4 8 14 14 8 4 4 4 0 8 14 14 4 8 8 8 8 0 14 14 2 1 12 14 14 14 14 14 0 12 A B C D E F G 14 14 14 14 14 12 0 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 18 / 86

Ultrametric trees Unweighted Pair Group Method with Arithmetic Mean is an algorithm for reconstructing a tree T from an ultrametric distance matrix D P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: The Principles and Practice of Numerical Classification. W. H. Freeman, San Francisco, 1973 Starting with n clusters of one element each, merge the two closest clusters until only a single cluster remains The distance between two disjoint clusters C i and C j is defined as the average inter-cluster pairwise distance, D(C i, C j ) = 1 C i C j D i,j i C i j C j The length of an edge (u, v) is defined as the difference in heights of the vertices u and v The height plays the role of the molecular clock, and allows one to date the divergence point for every vertex in the evolutionary tree Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 19 / 86

Ultrametric trees Unweighted Pair Group Method with Arithmetic Mean is an algorithm for reconstructing a tree T from an ultrametric distance matrix D Form n clusters, each with a single element Construct a graph T with a vertex v of height h(v) = 0 for each cluster while there is more than one cluster do Find the two closest clusters C i and C j Merge C i and C j into a new cluster C for every cluster C C do Set D(C, C ) to the average distance between elements of C and C end for Add a new vertex C to T and connect it to vertices C i and C j Assigh h(c) = D(C i, C j )/2 Assign length h(c) h(c i ) to edge (C i, C) Assign length h(c) h(c j ) to edge (C j, C) Remove rows and columns of D corresponding to C i and C j Add a row and column to D for the new cluster C end while Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 20 / 86

Ultrametric trees Given an n n ultrametric distance matrix D, the unique tree T with n leaves that fits the data can be reconstructed in O(n 2 ) time using the UPGMA algorithm Example Example 0 2 4 6 2 0 4 6 4 4 0 6 6 6 6 0 0 2 4 4 2 0 4 4 4 4 0 2 4 4 2 0 1 1 1 1 1 1 2 3 A B C D 1 1 1 1 A B C D Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 21 / 86

Ultrametric trees Given an n n distance matrix D, the tree T with n leaves that can be reconstructed using the UPGMA algorithm is not necessarily unique nor does it fit the data unless D is ultrametric Example 0 2 2 2 2 0 3 2 2 3 0 2 2 2 2 0 0 2 2 2 2 0 3 2 2 3 0 2 2 2 2 0 0 1 1 1 1 1/6 1 7/6 A B C D 1/8 1/8 1 1 A B C D Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 22 / 86

Additive and non-additive trees A distance matrix D is additive if for every four leaves i, j, k, and l, of the three sums of distances D i,j + D k,l D i,k + D j,l D i,l + D j,k the two largest are equal (four point condition) D i,j + D k,l D i,k + D j,l = D i,l + D j,k D i,k + D j,l D i,j + D k,l = D i,l + D j,k D i,l + D j,k D i,j + D k,l = D i,k + D j,l i k j l It can be determined in O(n 4 ) time whether or not an n n distance matrix D is additive Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 23 / 86

Additive and non-additive trees Neighbor Joining is an algorithm for reconstructing a tree T from an additive distance matrix D Closest leaves (leaves i and j with minimum D i,j ) are not necessarily neighbors i j 11 2 k 6 4 7 l Leaves i and j are neighbors, but D i,j = 13 > 12 = D j,k Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 25 / 86

Additive and non-additive trees Neighbor Joining is an algorithm for reconstructing a tree T from an additive distance matrix D Starting with n clusters of one element each, merge the two closest, and far apart from the rest, clusters until only a single cluster remains Define the separation of cluster C from other clusters as u(c) = 1 # 2 C C D(C, C ) Simultaneously minimize D(C i, C j ) and maximize u(c i ) + u(c j ) Minimize D(C i, C j ) u(c i ) u(c j ) Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 26 / 86

Additive and non-additive trees Neighbor Joining is an algorithm for reconstructing a tree T from an additive distance matrix D Form n clusters, each with a single element Construct a graph T with an isolated vertex for each cluster while there is more than one cluster do Find clusters C i and C j minimizing D(C i, C j ) u(c i ) u(c j ) Merge C i and C j into a new cluster C for every cluster C C do Set D(C, C ) to the average of D(C i, C ) and D(C j, C ) end for Add a new vertex C to T and connect it to vertices C i and C j Assign length (D(C i, C j ) + u(c i ) u(c j ))/2 to edge (C i, C) Assign length (D(C i, C j ) + u(c j ) u(c i ))/2 to edge (C j, C) Remove rows and columns of D corresponding to C i and C j Add a row and column to D for the new cluster C end while Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 27 / 86

Additive and non-additive trees Given an n n additive distance matrix D, the unique tree T with n leaves that fits the data can be reconstructed in O(n 5 ) time using the NJ algorithm Example 0 2 4 6 2 0 4 6 4 4 0 6 6 6 6 0 1 1 1 1 2 3 A B C D Example 0 2 4 4 2 0 4 4 4 4 0 2 4 4 2 0 2 1 1 1 1 A B C D Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 28 / 86

Additive and non-additive trees Given an n n distance matrix D, the tree T with n leaves that can be reconstructed using the NJ algorithm does not necessarily fit the data unless D is additive Example 0 2 2 2 2 0 3 2 2 3 0 2 2 2 2 0 0.75 1.25 0.25 1.25 0.75 A B C D Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 29 / 86

Additive and non-additive trees Neighbor Joining is an algorithm for reconstructing a tree T from an additive distance matrix D N. Saitou, M. Nei. The Neighbor-Joining Method: A New Method for Reconstructing Phylogenetic Trees. Molecular Biology and Evolution 4(4):406 425, 1987 O(n 5 ) J. A. Studier, K. J. Keppler. A Note on the Neighbor-Joining Algorithm of Saitou and Nei. Molecular Biology and Evolution 5(6):729 731, 1988 O(n 3 ) Richard Durbin (Sanger Centre), Sean R. Eddy, Anders Krogh, Graeme Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998. Appendix 7.8: Proof of neighbour-joining theorem T. Mailund, G. S. Brodal, R. Fagerberg, C. N. S. Pedersen, D. Phillips. Recrafting the Neighbor-Joining Method. BMC Bioinformatics 7:29, 2006 O(n 3 ) Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 30 / 86

Additive and non-additive trees J. E. Stajich and twenty others. The BioPerl Toolkit: Perl Modules for the Life Sciences. Genome Research, 12(10):1611 1618, 2002 http://www.bioperl.org/ http://doc.bioperl.org/ http://search.cpan.org/search?dist=bioperl http://code.open-bio.org/cgi/viewcvs.cgi/ perl -MCPAN -e " install Bundle :: BioPerl " sudo apt - get install bioperl Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 31 / 86

Additive and non-additive trees The distance matrix can be parsed from a Phylip file my $filename = $ARGV [ 0]; use Bio :: Matrix :: IO; my $parser = new Bio :: Matrix :: IO( -format => phylip, -file => $filename ); my $mat = $parser - > next_matrix ; Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 32 / 86

Additive and non-additive trees The phylogenetic tree can be reconstructed using Neighbor-Joining (NJ) or Unweighted Pair Group Method with Arithmetic Mean (UPGMA) my $method = $ARGV [1]; # UPGMA or NJ use Bio :: Tree :: DistanceFactory ; my $dfactory = Bio :: Tree :: DistanceFactory -> new ( -method => $method ); my $tree = $dfactory -> make_tree ( $mat ); Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 33 / 86

Additive and non-additive trees The phylogenetic tree can be output in Newick format use Bio :: TreeIO ; my $output = new Bio :: TreeIO ( -format => newick ); $output -> write_tree ( $tree ); Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 34 / 86

Additive and non-additive trees The phylogenetic tree can also be set as a rectangular cladogram use Bio :: Tree :: Draw :: Cladogram ; use Bio :: TreeIO ; my $input = $ARGV [ 0]; my $output = $ARGV [ 1]; my $treeio = new Bio :: TreeIO ( -format => newick, -file => $input ); my $tree = $treeio - > next_tree ; my $obj = new Bio :: Tree :: Draw :: Cladogram ( -tree => $tree, - compact => 0); $obj -> print (- file => $output ); Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 35 / 86

Additive and non-additive trees Input distance matrix in Phylip format 7 brown 0 5 10 20 40 70 70 polar 5 0 10 20 40 70 70 black 10 10 0 20 40 70 70 spectacled 20 20 20 0 40 70 70 giant 40 40 40 40 0 70 70 raccoon 70 70 70 70 70 0 60 red 70 70 70 70 70 60 0 Output phylogenetic tree (UPGMA) in Newick format ((((( brown :2.50000, polar :2.50000) :2.50000, black : 5. 00000) : 5. 00000, spectacled : 10. 00000) :10.00000, giant :20.00000) :15.00000,( red : 30. 00000, raccoon : 30. 00000) : 5. 00000) ; Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 36 / 86

Additive and non-additive trees Input distance matrix in Phylip format 7 brown 0 5 10 20 40 70 70 polar 5 0 10 20 40 70 70 black 10 10 0 20 40 70 70 spectacled 20 20 20 0 40 70 70 giant 40 40 40 40 0 70 70 raccoon 70 70 70 70 70 0 60 red 70 70 70 70 70 60 0 Output phylogenetic tree (NJ) in Newick format ( brown :2.50000, polar :2.50000,( black :5.00000,( spectacled :10.00000,( giant :20.00000,( raccoon : 30. 00000, red : 30. 00000) : 20. 00000) :10.00000) :5.00000) :2.50000) ; Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 37 / 86

Additive and non-additive trees 15 10 5 2.5 2.5 2.5 5 5 10 20 30 30 Brown Bear Polar Bear Black Bear Spectacled Bear Giant Panda Raccoon Red Panda Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 38 / 86

Additive and non-additive trees Output phylogenetic tree (UPGMA) in Newick format ((((( brown :2.50000, polar :2.50000) :2.50000, black : 5. 00000) : 5. 00000, spectacled : 10. 00000) :10.00000, giant :20.00000) :15.00000,( red : 30. 00000, raccoon : 30. 00000) : 5. 00000) ; brown polar black spectacled giant red raccoon brown polar black spectacled giant red raccoon Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 39 / 86

Additive and non-additive trees Output phylogenetic tree (NJ) in Newick format ( brown :2.50000, polar :2.50000,( black :5.00000,( spectacled :10.00000,( giant :20.00000,( raccoon : 30. 00000, red : 30. 00000) : 20. 00000) :10.00000) :5.00000) :2.50000) ; brown polar black spectacled giant raccoon red brown polar black spectacled giant raccoon red Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 40 / 86

Perfect phylogenies Given an n m genomic matrix M, find a tree T with n leaves that fits the data Example Lamprey Salmon Shark Lizard Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 41 / 86

Perfect phylogenies Given an n m genomic matrix M, find a tree T with n leaves that fits the data Example paired fins jaws large dermal bones fin rays lungs rasping tongue lamprey 0 0 0 0 0 1 shark 1 1 0 1 0 0 salmon 1 1 1 1 0 0 lizard 1 1 1 0 1 0 000000 000001 110000 110100 111000 111100 111010 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 42 / 86

Perfect phylogenies Let M be an n m genomic matrix Biological interpretation (Cladistics) n taxa m cladistic characters two states, 0 (absent) and 1 (present) unordered Example 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 1 0 6 000001 000000 4 1,2 110000 3 110100 111000 4 5 111100 111010 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 43 / 86

Perfect phylogenies Let M be an n m genomic matrix Biological interpretation (Genomics) n sequences m sites, possibly SNP sites two states, 0 and 1 ordered (on the chromosome) Example 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 1 0 0 1 1 1 0 1 0 6 000001 000000 4 1,2 110000 3 110100 111000 4 5 111100 111010 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 44 / 86

Perfect phylogenies Given an n m genomic matrix M, find a tree T with n leaves that fits the data Dan Gusfield. Efficient Algorithms for Inferring Evolutionary Trees. Networks 21(1):19 28, 1991 Theorem Radix sort M by columns in decreasing order Transform M into M by removing repeated columns Let O be the set of entries in M with value 1 for each (i, j) O do set L(i, j) to the largest index k < j such that M (i, k) O set L(i, j) to 0 if there is no such index k end for for each 1 j m do set L(j) to the largest L(i, j) such that (i, j) O end for M has a phylogenetic tree if and only if L(i, j) = L(j) for every (i, j) O Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 45 / 86

Perfect phylogenies Given an n m genomic matrix M, find a tree T with n leaves that fits the data Dan Gusfield. Efficient Algorithms for Inferring Evolutionary Trees. Networks 21(1):19 28, 1991 Create a node n j for each column j of M for each node n j with L(j) > 0 do Make node n L(j) the parent of node n j Label the edge with j and the indexes of all columns identical to j end for Create a root node r for each node n j with L(j) = 0 do Make node r the parent of node n j Label the edge with j and the indexes of all columns identical to j end for Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 46 / 86

Perfect phylogenies Given an n m genomic matrix M, find a tree T with n leaves that fits the data Dan Gusfield. Efficient Algorithms for Inferring Evolutionary Trees. Networks 21(1):19 28, 1991 for each 1 i n do Let c i be the largest index such that M [i, c i ] = 1 Let (n j, n k ) be the edge labeled with c i if node n k is a leaf then Label node n k with i else Create a leaf node n l Make node n k the parent of node n l Label node n l with i end if end for Theorem The resulting tree T is a phylogenetic tree for the genomic matrix M Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 47 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Test M for a phylogenetic tree) 1 0 0 1 0 0 1 0 0 0 M = 1 0 0 1 1 L = 0 1 1 0 0 1 0 0 0 0 0 1 0 0 1 4 0 2 0 Π = ( 2 3 4 1 5 ) L = ( 0 0 2 1 4 ) M has a phylogenetic tree, because L(i, j) = L(j) for every (i, j) O Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 49 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Test M for a phylogenetic tree) 0 1 0 0 1 1 0 1 0 0 M = 1 1 0 0 1 L = 0 0 1 1 0 1 1 0 0 0 0 2 0 1 0 1 2 0 3 0 1 Π = ( 5 2 3 4 1 ) L = ( 0 1 1 3 2 ) M has no phylogenetic tree, because L(1, 2) L(2), and also because L(4, 3) L(3) Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 50 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Build a phylogenetic tree T for M) M = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 Π = ( 2 3 4 1 5 ) n 4 n 3 n 2 n 1 L = ( 0 0 2 1 4 ) n 5 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 51 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Build a phylogenetic tree T for M) M = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 Π = ( 2 3 4 1 5 ) L = ( 0 0 2 1 4 ) n 4 4 n 3 n 2 1 n 1 5 n 5 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 51 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Build a phylogenetic tree T for M) M = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 Π = ( 2 3 4 1 5 ) L = ( 0 0 2 1 4 ) n 4 4 n 3 3 r 2 n 2 1 n 1 5 n 5 Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 51 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Build a phylogenetic tree T for M) M = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 Π = ( 2 3 4 1 5 ) L = ( 0 0 2 1 4 ) D 4 n 3 3 r 2 n 2 1 n 1 5 C Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 51 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Build a phylogenetic tree T for M) M = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 Π = ( 2 3 4 1 5 ) L = ( 0 0 2 1 4 ) D 4 n 3 3 B r 2 n 2 1 E n 1 A 5 C Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 51 / 86

Perfect phylogenies Given an n m genomic matrix M, the unique tree T with n leaves that fits the data, if it exists, can be reconstructed in O(nm) time using the previous algorithm Example (Build a phylogenetic tree T for M) M = 1 0 0 1 0 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 Π = ( 2 3 4 1 5 ) D 4 3 B E 2 1 5 L = ( 0 0 2 1 4 ) A C Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 51 / 86

Perfect phylogenies The genomic matrix can be parsed from a Phylip file my $filename = $ARGV [ 0]; use Bio :: Matrix :: IO; my $parser = new Bio :: Matrix :: IO( -format => phylip, -file => $filename ); my $mat = $parser - > next_matrix ; The phylogenetic tree, if it exists, can be reconstructed using the previous algorithm Exercise Implement a Bio::Tree::PerfectPhylogeny module. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 52 / 86

Taxonomic reconstruction Compatibility Compatible trees with overlapping taxa can be combined into a single supertree containing the evolutionary information of the given trees. Incompatible trees do not admit their simultaneous inclusion into a common supertree. Two or more phylogenetic trees with nested taxa are ancestrally compatible if they can be refined into a common supertree. Two or more phylogenetic trees with nested taxa are perfectly compatible if there exists a common supertree whose topological restriction to the taxa in each tree is isomorphic to that tree. Philip Daniel, Charles Semple. Supertree Algorithms for Nested Taxa. In: Olaf R. P. Bininda-Emonds (ed.) Phylogenetic Supertrees: Combining Information to Reveal the Tree of Life, Computational Biology, vol. 4, chap. 7, pp. 151 171. Kluwer (2004). Charles Semple, Philip Daniel, Wim Hordijk, Roderic D. M. Page, Mike Steel. Supertree Algorithms for Ancestral Divergence Dates and Nested Taxa. Bioinformatics 20(15), 2355 2360 (2004). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 53 / 86

Taxonomic reconstruction Compatibility Incompatible phylogenetic trees can still be partially combined into a maximum agreement subtree. Mike A. Steel, Tandy Warnow. Kaikoura Tree Theorems: Computing the Maximum Agreement Subtree. Information Processing Letters 48(2), 77 82 (1993). Compatible phylogenetic trees can be combined into a common supertree. B. R. Baum. Combining Trees as a Way of Combining Datasets for Phylogenetic Inference, and the Desirability of Combining Gene Trees. Taxon 41(1), 3 10 (1992). M. A. Ragan. Phylogenetic Inference based on Matrix Representation of Trees. Molecular Phylogenetics and Evolution 1(1), 53 58 (1992). Charles Semple, Mike A. Steel. A Supertree Method for Rooted Trees. Discrete Applied Mathematics 105(1 3), 147 158 (2000). Roderic D. M. Page. Modified Mincut Supertrees. In: Proc. 2nd Int. Workshop Algorithms in Bioinformatics, Lecture Notes in Computer Science, vol. 2452, pp. 537 552. Springer-Verlag (2002). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 55 / 86

Taxonomic reconstruction Compatibility Definition Let A be a fixed set of labels. A node of a rooted tree with only one child is an elementary node. A semi-labeled tree over A is a rooted tree with some of its nodes, including all its leaves and all its elementary nodes, injectively labeled in the set A. An A-tree is a rooted tree with some of its nodes, including all its leaves, injectively labeled in the set A. The set of the labels of the leaves of an A-tree T is denoted by L(T ), and the set of the labels of all its nodes is denoted by A(T ). Example A semi-labeled tree (left) and an A-tree (right). X A H V C D A H V C D L L Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 56 / 86

Taxonomic reconstruction Compatibility Definition Let T be an A-tree. For every v V (T ), the cluster of v in T is the set A T (v) of the labels of all its descendants, including itself. The cluster representation of T is C A (T ) = {A T (v) v V (T )}. Example The cluster representation of the A-tree A H V C D L is { {C}, {D}, {H}, {L}, {V }, {C, V }, {D, L}, {A, D, H, L}, {A, C, D, H, L, V } }. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 57 / 86

Taxonomic reconstruction Compatibility Definition The restriction T X of an A-tree T to a set X A of labels is the subtree of T supported on the set of nodes V (T X ) = {v V (T ) A(v) X } and where a node is labeled when it is labeled in T and this label belongs to X, in which case its label in T X is the same as in T. Example An A-tree (left) and its restriction to the set of labels {A, C, D} (right). V A H C D C L A D Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 58 / 86

Taxonomic reconstruction Compatibility Theorem (Llabrés, Rocha, Rosselló, Valiente 2006) Let T 1 and T 2 be two A-trees with A(T 1 ) = A(T 2 ). Then, T 1 and T 2 are ancestrally compatible if and only if C A (T 1 ) and C A (T 2 ) satisfy jointly the following two conditions: For every A A(T 1 ) = A(T 2 ), the smallest member of C A (T 1 ) containing A is equal to the smallest member of C A (T 2 ) containing this label. For every X C A (T 1 ) and Y C A (T 2 ), if X Y, then X Y or Y X. Mercè Llabrés, Jairo Rocha, Francesc Rosselló, Gabriel Valiente. On the Ancestral Compatibility of Two Phylogenetic Trees with Nested Taxa. To appear in J. Math. Biol. (2006). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 59 / 86

Taxonomic reconstruction Compatibility Example Two incompatible phylogenetic trees obtained from TreeBASE. Convallaria Peliosanthes Geitonoplesium Phormium Herreria Asparagus Ruscus Uvularia Tricyrtis Trillium Alstroemeria Luzuriaga Philesia Dioscoreaceae Smilax Stemonaceae Ripogonum Petermannia Taccaceae Trillium Alstroemeria Tricyrtis Philesia Petermannia Taccaceae Dioscoreaceae Smilax Stemonaceae Ripogonum Uvularia Peliosanthes Convallaria Luzuriaga Geitonoplesium Herreria Phormium Asparagus Ruscus Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 60 / 86

Taxonomic reconstruction Compatibility Example Two incompatible semi-labeled trees obtained from TreeBASE, one of which has a cluster labeled by a taxon in the other tree. Vernonia Asteroideae Blumea Inula Asteroideae Vernonia Inula Gnaphalium Antennaria Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 61 / 86

Taxonomic reconstruction Compatibility Example Two compatible phylogenetic trees obtained from TreeBASE. Loganiaceae Rubiaceae Viburnum Columellia Caprifoliaceae Viburnum Columellia Heptacodium Diervilla Linnaea Example Two incompatible semi-labeled trees obtained from TreeBASE, in which an incompatible triple of labels involves three taxa in one tree and two taxa plus one internal label in the other tree. Loganiaceae Rubiaceae Viburnum Columellia Caprifoliaceae Caprifoliaceae Viburnum Columellia Heptacodium Diervilla Linnaea Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 62 / 86

Taxonomic reconstruction Compatibility Algorithm (Ancestral compatibility) A := A(T 1 ) A(T 2 ); T 1 := T 1 A; T 2 := T 2 A for each label A A do let X 1 be the smallest member of C A ( T 1 ) containing A let X 2 be the smallest member of C A ( T 2 ) containing A if X 1 X 2 then return X 1 and X 2 are incompatible end if end for for each cluster X 1 C A ( T 1 ) do for each cluster X 2 C A ( T 2 ) do if X 1 X 2 and X 1 X 2 and X 2 X 1 then return X 1 and X 2 are incompatible end if end for end for return T 1 and T 2 are compatible Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 63 / 86

Taxonomic reconstruction Compatibility The A-trees can be parsed from a Phylip file, and they can be tested for ancestral compatibility use Bio :: Tree :: Compatible ; use Bio :: TreeIO ; my $filename = $ARGV [ 0]; my $input = new Bio :: TreeIO ( -format => newick, -file => $filename ); my $t1 = $input - > next_tree ; my $t2 = $input - > next_tree ; my ( $incompat, $ilabels, $inodes ) = $t1 -> Bio :: Tree :: Compatible :: is_compatible ( $t2 ); Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 64 / 86

Taxonomic reconstruction Compatibility The cluster representation of the trees is the basis for a certificate of incompatibility if ( $incompat ) { print " the trees are incompatible \n"; my % cluster1 = %{ $t1 -> Bio :: Tree :: Compatible :: cluster_representation }; my % cluster2 = %{ $t2 -> Bio :: Tree :: Compatible :: cluster_representation }; Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 65 / 86

Taxonomic reconstruction Compatibility T 1 and T 2 are incompatible if for some label A A(T 1 ) = A(T 2 ), the smallest member of C A (T 1 ) and of C A (T 2 ) containing A differ. if ( scalar ( @$ilabels )) { foreach my $label ( @$ilabels ) { my $n1 = $t1 -> find_node (-id => $label ); my $n2 = $t2 -> find_node (-id => $label ); my @c1 = sort @{ $cluster1 { $n1 } }; my @c2 = sort @{ $cluster2 { $n2 } }; print " label $label "; print " cluster "; map { print " ",$_ } @c1 ; print " cluster "; map { print " ",$_ } @c2 ; print "\n"; } } Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 66 / 86

Taxonomic reconstruction Compatibility T 1 and T 2 are incompatible if for some X C A (T 1 ) and Y C A (T 2 ), clusters X and Y overlap but none is contained in the other. if ( scalar ( @$inodes )) { while ( @$inodes ) { my $n1 = shift @$inodes ; my $n2 = shift @$inodes ; my @c1 = sort @{ $cluster1 { $n1 } }; my @c2 = sort @{ $cluster2 { $n2 } }; print " cluster "; map { print " ",$_ } @c1 ; print " properly intersects cluster "; map { print " ",$_ } @c2 ; print "\n"; } } } else { print " the trees are compatible \n"; } Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 67 / 86

Taxonomic reconstruction Compatibility Lemma The ancestral compatibility algorithm takes time quadratic in the size of the trees. Proof. The size of the cluster representation is bounded by the size of the tree, and the bound is tight in the worst case. Exercise Improve the ancestral compatibility algorithm to run in time linear in the size of the trees. Exercise Implement a linear time Bio::Tree::Compatible module. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 68 / 86

Taxonomic reconstruction Consensus Definition A path (v 0, v 1,..., v k ) in an A-tree T is elementary if, for every i = 1,..., k 1, v i+1 is the only child of v i ; in other words, if all its intermediate nodes have out-degree 1. In particular, an arc forms an elementary path. Example 1 3 2 4 6 5 Path 1 5 is not elementary. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 69 / 86

Taxonomic reconstruction Consensus Definition Two non-trivial paths (a, v 1,..., v k ) and (a, w 1,..., w l ) in an A-tree T are said to diverge if their origin a is their only common node. Example 1 3 2 4 6 5 Paths 1 5 and 1 6 do not diverge. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 70 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a minor of an A-tree T if there exists an injective mapping f : V (S) V (T ) satisfying the following condition: for every a, b V (S), if (a, b) E(S), then there exists a path f (a) f (b) in T with no intermediate node in f (V (S)). In this case, the mapping f is said to be a minor embedding f : S T. Example y S r T 1 x 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 3 and f (y) = 4 is not a minor embedding, because, although it transforms arcs in S into paths in T, the path f (r) f (y) contains the node 3 = f (x), which belongs to f (V (S)). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 71 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a minor of an A-tree T if there exists an injective mapping f : V (S) V (T ) satisfying the following condition: for every a, b V (S), if (a, b) E(S), then there exists a path f (a) f (b) in T with no intermediate node in f (V (S)). In this case, the mapping f is said to be a minor embedding f : S T. Example y S r T 1 x 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 5 and f (y) = 6 is a minor embedding, because the arcs (r, x), (r, y) E(S) become paths f (r) f (x) and f (r) f (y) in T with no intermediate node in f (V (S)). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 71 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a minor of an A-tree T if there exists an injective mapping f : V (S) V (T ) satisfying the following condition: for every a, b V (S), if (a, b) E(S), then there exists a path f (a) f (b) in T with no intermediate node in f (V (S)). In this case, the mapping f is said to be a minor embedding f : S T. Example y S r T 1 x 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 5 and f (y) = 6 is not a topological embedding, because these paths f (r) f (x) and f (r) f (y) do not diverge. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 71 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a topological subtree of an A-tree T if there exists a minor embedding f : S T such that, for every (a, b), (a, c) E(S) with b c, the paths f (a) f (b) and f (a) f (c) in T diverge. In this case, f is called a topological embedding f : S T. Example y S r x T 1 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 2 and f (y) = 6 is a topological embedding, because the arcs (r, x), (r, y) E(S) become divergent paths f (r) f (x) and f (r) f (y) in T without intermediate nodes in f (V (S)). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 72 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a topological subtree of an A-tree T if there exists a minor embedding f : S T such that, for every (a, b), (a, c) E(S) with b c, the paths f (a) f (b) and f (a) f (c) in T diverge. In this case, f is called a topological embedding f : S T. Example y S r x T 1 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 2 and f (y) = 6 is not a homeomorphic embedding, because the path f (r) f (y) contains an intermediate node with more than one child. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 72 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a homeomorphic subtree of an A-tree T if there exists a minor embedding f : S T satisfying the following extra condition: for every (a, b) E(S), the path f (a) f (b) in T is elementary. In this case, f is said to be a homeomorphic embedding f : S T. Example y S r x T 1 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 2 and f (y) = 4 is a homeomorphic embedding, because the arcs (r, x), (r, y) E(S) become elementary paths f (r) f (x) and f (r) f (y) in T with no intermediate node in f (V (S)). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 73 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is a homeomorphic subtree of an A-tree T if there exists a minor embedding f : S T satisfying the following extra condition: for every (a, b) E(S), the path f (a) f (b) in T is elementary. In this case, f is said to be a homeomorphic embedding f : S T. Example y S r x T 1 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 2 and f (y) = 4 is not an isomorphic embedding, because the path f (r) f (y) is not an arc. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 73 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is an isomorphic subtree of an A-tree T if there exists an injective mapping f : V (S) V (T ) satisfying the following condition: if (a, b) E(S), then (f (a), f (b)) E(T ). Such a mapping f is called an isomorphic embedding f : S T. Example y S r x T 1 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 1, f (x) = 2 and f (y) = 3 is an isomorphic embedding, because it transforms every arc in S into an arc in T. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 74 / 86

Taxonomic reconstruction Consensus Definition An A-tree S is an isomorphic subtree of an A-tree T if there exists an injective mapping f : V (S) V (T ) satisfying the following condition: if (a, b) E(S), then (f (a), f (b)) E(T ). Such a mapping f is called an isomorphic embedding f : S T. Example y S r x T 1 3 2 4 6 5 The mapping f : V (S) V (T ) defined by f (r) = 4, f (x) = 5 and f (y) = 6 is an isomorphic embedding, because it transforms every arc in S into an arc in T. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 74 / 86

Taxonomic reconstruction Consensus Remark Minor, topological, homeomorphic, and isomorphic embeddings of A-trees are injective mappings satisfying the additional condition that node labels are preserved and reflected. Example T 1 S T 2 A B C D E F A B D E A B F D E C There are isomorphic embeddings S T 1 and S T 2. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 75 / 86

Taxonomic reconstruction Consensus Remark Minor, topological, homeomorphic, and isomorphic embeddings of A-trees are injective mappings satisfying the additional condition that node labels are preserved and reflected. Example T 1 S T 2 A B C D E F A B D E A B F D E C There are topological embeddings S T 1 and S T 2. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 75 / 86

Taxonomic reconstruction Consensus Lemma The size M(T 1, T 2 ) of a largest common topological subtree of two A-trees T 1 and T 2 can be computed in O(n 4.5 log n) time. Proof. Mike Steel, Tandy Warnow. Kaikoura Tree Theorems: Computing the Maximum Agreement Subtree. Inform. Process. Lett. 48(2), 77 82 (1993). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 76 / 86

Taxonomic reconstruction Consensus Remark The size M(T 1, T 2 ) of a largest common topological subtree of two binary A-trees T 1 and T 2 follows a simple recurrence. T 1 v w T 2 a b c d M(T 1 [v], T 2 [w]) is the size of L(T 1 [v]) L(T 2 [w]) if T 1 [v] or T 2 [w] is a singleton, otherwise M(T 1 [v], T 2 [w]) = max M(T 1 [a], T 2 [c]) + M(T 1 [b], T 2 [d]) M(T 1 [a], T 2 [d]) + M(T 1 [b], T 2 [c]) M(T 1 [a], T 2 [w]) M(T 1 [b], T 2 [w]) M(T 1 [v], T 2 [c]) M(T 1 [v], T 2 [d]) Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 77 / 86

Taxonomic reconstruction Consensus Lemma The size M(T 1, T 2 ) of a largest common topological subtree of two A-trees T 1 and T 2 can be computed in O(n 2 ) time. Proof. Martin Farach, Mikkel Thorup. Fast Comparison of Evolutionary Trees. Inform. Comput. 123(1), 29 37 (1995). Exercise Implement an efficient Bio::Tree::Agreement module. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 78 / 86

Taxonomic reconstruction Combination Remark Let x denote isomorphic, homeomorphic, topological, or minor. Theorem The problems of finding a largest common x-subtree and a smallest common x-supertree of two trees, in each case together with a pair of witness x-embeddings, are reducible to each other in time linear in the size of the trees. Proof. Francesc Rosselló, Gabriel Valiente. An Algebraic View of the Relation between Largest Common Subtrees and Smallest Common Supertrees. To appear in Theoret. Comput. Sci. (2006). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 79 / 86

Taxonomic reconstruction Combination Definition The intersection of two trees T 1 and T 2 obtained through minor embeddings f 1 : T 1 T and f 2 : T 2 T into a tree T, is the graph T p with set of nodes V (T p ) = V (T 1 ) V (T 2 ) and set of arcs defined in the following way: for every a, b V (T 1 ) V (T 2 ), (a, b) E(T p ) if and only if there are paths a b in T 1 and in T 2 without intermediate nodes in V (T 1 ) V (T 2 ). Example Let T be a tree with nodes a 1, a 2, b, c and arcs (a 1, a 2 ), (a 2, b), (a 2, c), let T 1 be its minor with nodes a 1, b, c and arcs (a 1, b), (a 1, c), and let T 2 be its minor with nodes a 2, b, c and arcs (a 2, b), (a 2, c). In this case T p is the graph with nodes b, c and no arc, and in particular it is not a tree. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 80 / 86

Taxonomic reconstruction Combination Theorem For every two trees T 1 and T 2, any intersection of T 1 and T 2 obtained through x-embeddings into a smallest common x-supertree of them is a largest common x-subtree of T 1 and T 2. Corollary Every largest common x-supertree of a pair of trees T 1 and T 2 is, up to an isomorphism, the intersection of T 1 and T 2 obtained through their embeddings into a smallest common x-supertree. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 81 / 86

Taxonomic reconstruction Combination Definition The x-join of two trees T 1 and T 2 obtained through x-embeddings m 1 : T µ T 1 and m 2 : T µ T 2 of a largest common x-subtree T µ of them, is the quotient graph T po of the disjoint sum T 1 + T 2 by the equivalence relation θ defined, up to symmetry, by the following condition: (a, b) θ if and only if a = b or there exists some c V (T µ ) such that a = m 1 (c) and b = m 2 (c). Example Let T µ be the graph with nodes b, c and no arc, let T 1 be the tree with nodes a 1, b, c and arcs (a 1, b), (a 1, c), and let T 2 be the tree with nodes a 2, b, c and arcs (a 2, b), (a 2, c). The x-join T po of T 1 and T 2 through the obvious embeddings of T µ is the graph with nodes a 1, a 2, b, c and arcs (a 1, b), (a 1, c), (a 2, b), (a 2, c), but it is not a smallest common x-supertree of T 1 and T 2, because T µ is not a largest common x-subtree of T 1 and T 2 (it is not even a tree). Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 82 / 86

Taxonomic reconstruction Combination Definition The x-sum of two trees T 1 and T 2 obtained through x-embeddings m 1 : T µ T 1 and m 2 : T µ T 2 of a largest common x-subtree T µ of them, is the graph T σ obtained from the x-join T po of T 1 and T 2 by removing every arc that is subsumed by a path: that is, we remove from T po each arc (v, w) for which there is another path v w in T po. Example T 2 T po T σ T µ T 1 a a a a a b c b c b c d e d e d e d e d e Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 83 / 86

Taxonomic reconstruction Combination Theorem For every pair of trees T 1 and T 2, any x-sum of T 1 and T 2 is a smallest common x-supertree of them. Corollary Every smallest common x-supertree of a pair of trees T 1 and T 2 is, up to an isomorphism, the x-sum of T 1 and T 2 obtained through the embeddings of a largest common x-subtree into them. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 84 / 86

Taxonomic reconstruction Combination Theorem The problems of finding a largest common x-subtree and a smallest common x-supertree of two trees, in each case together with a pair of witness x-embeddings, are reducible to each other in time linear in the size of the trees. Proof. Next, remove from T σ all arcs subsumed by paths, as follows. For each node y V (T σ ) of in-degree 2, let x, x V (T σ ) be the source nodes of the two arcs coming into y. Now, perform a simultaneous traversal of the paths of arcs coming into x and x, until reaching node x along the first path or x along the second path. The simultaneous traversal of incoming paths may stop along either path, but continue along the other one, because a node of in-degree 0 or in-degree 2 is reached. Finally, remove from T σ either arc (x, y), if node x was reached along the first path, or arc (x, y), if node x was reached along the second path. Gabriel Valiente (ALBCOM) Bioinformatics 2005 2006 85 / 86