Parsimony via Consensus

Similar documents
arxiv: v1 [q-bio.pe] 16 Aug 2007

X X (2) X Pr(X = x θ) (3)

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

A phylogenomic toolbox for assembling the tree of life

Axiomatic Opportunities and Obstacles for Inferring a Species Tree from Gene Trees

arxiv: v1 [q-bio.pe] 1 Jun 2014

Subtree Transfer Operations and Their Induced Metrics on Evolutionary Trees

Points of View Matrix Representation with Parsimony, Taxonomic Congruence, and Total Evidence

Evolution, University of Auckland, New Zealand PLEASE SCROLL DOWN FOR ARTICLE

Effects of Gap Open and Gap Extension Penalties

Distances that Perfectly Mislead

arxiv: v1 [cs.ds] 1 Nov 2018

Phylogenetic Networks, Trees, and Clusters

A Fitness Distance Correlation Measure for Evolutionary Trees

Dr. Amira A. AL-Hosary

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Letter to the Editor. The Effect of Taxonomic Sampling on Accuracy of Phylogeny Estimation: Test Case of a Known Phylogeny Steven Poe 1

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Phylogenetic Tree Reconstruction

A CLUSTER REDUCTION FOR COMPUTING THE SUBTREE DISTANCE BETWEEN PHYLOGENIES

arxiv: v1 [q-bio.pe] 6 Jun 2013

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

TheDisk-Covering MethodforTree Reconstruction

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

Exploring Treespace. Katherine St. John. Lehman College & the Graduate Center. City University of New York. 20 June 2011

Finding the best tree by heuristic search

Evolutionary Tree Analysis. Overview

Algorithms in Bioinformatics

A Phylogenetic Network Construction due to Constrained Recombination

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

The Information Content of Trees and Their Matrix Representations

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Letter to the Editor. Department of Biology, Arizona State University

SPR Distance Computation for Unrooted Trees

Properties of normal phylogenetic networks

On the Uniqueness of the Selection Criterion in Neighbor-Joining

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

reconciling trees Stefanie Hartmann postdoc, Todd Vision s lab University of North Carolina the data

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

ALGORITHMIC STRATEGIES FOR ESTIMATING THE AMOUNT OF RETICULATION FROM A COLLECTION OF GENE TREES

Assessing Congruence Among Ultrametric Distance Matrices

Fast Phylogenetic Methods for the Analysis of Genome Rearrangement Data: An Empirical Study

Phylogenetic Inference and Parsimony Analysis

Charles Semple, Philip Daniel, Wim Hordijk, Roderic D M Page, and Mike Steel

THE TRIPLES DISTANCE FOR ROOTED BIFURCATING PHYLOGENETIC TREES

Ratio of explanatory power (REP): A new measure of group support

arxiv: v5 [q-bio.pe] 24 Oct 2016

What is Phylogenetics

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Molecular Evolution & Phylogenetics

Chapter 9 BAYESIAN SUPERTREES. Fredrik Ronquist, John P. Huelsenbeck, and Tom Britton

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

A Faster Algorithm for the Perfect Phylogeny Problem when the Number of Characters is Fixed

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Increasing Data Transparency and Estimating Phylogenetic Uncertainty in Supertrees: Approaches Using Nonparametric Bootstrapping

Consensus Methods. * You are only responsible for the first two

Phylogenetics: Parsimony

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

Walks in Phylogenetic Treespace

Solving the Maximum Agreement Subtree and Maximum Comp. Tree problems on bounded degree trees. Sylvain Guillemot, François Nicolas.

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetics. BIOL 7711 Computational Bioscience

UNICYCLIC NETWORKS: COMPATIBILITY AND ENUMERATION

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

A (short) introduction to phylogenetics

Exact ILP Solutions for Phylogenetic Minimum Flip Problems

Properties of Consensus Methods for Inferring Species Trees from Gene Trees

Consistency Index (CI)

Supertree Algorithms for Ancestral Divergence Dates and Nested Taxa

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Notes 3 : Maximum Parsimony

Kei Takahashi and Masatoshi Nei

Reconstructing Phylogenetic Networks Using Maximum Parsimony

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

A new algorithm to construct phylogenetic networks from trees

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

On the Subnet Prune and Regraft distance

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Comparative Performance of Supertree Algorithms in Large Data Sets Using the Soapberry Family (Sapindaceae) as a Case Study

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Constructing Evolutionary/Phylogenetic Trees

A GA for Maximum Likelihood Phylogenetic Inference using Neighbour-Joining as a Genotype to Phenotype Mapping

1.1 The (rooted, binary-character) Perfect-Phylogeny Problem

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic analyses. Kirsi Kostamo

Phylogenetics in the Age of Genomics: Prospects and Challenges

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Unsupervised Learning in Spectral Genome Analysis

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Beyond Galled Trees Decomposition and Computation of Galled Networks

An Investigation of Phylogenetic Likelihood Methods

Disk-Covering, a Fast-Converging Method for Phylogenetic Tree Reconstruction ABSTRACT

Transcription:

Syst. Biol. 57(2):251 256, 2008 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150802040597 Parsimony via Consensus TREVOR C. BRUEN 1 AND DAVID BRYANT 2 1 Department of Mathematics, University of California, Berkeley, USA; E-mail: tbruen@math.berkeley.edu 2 Department of Mathematics, University of Auckland, New Zealand; E-mail: d.bryant@auckland.ac.nz Abstract. The parsimony score of a character on a tree equals the number of state changes required to fit that character onto the tree. We show that for unordered, reversible characters this score equals the number of tree rearrangements required to fit the tree onto the character. We discuss implications of this connection for the debate over the use of consensus trees or total evidence and show how it provides a link between incongruence of characters and recombination. [Compatibility; homoplasy; maximum parsimony; SPR distance.] The parsimony length of an unordered, reversible character on a tree equals the minimum number of state changes required to fit the character onto a tree. Here we introduce an alternative way to view parsimony length, one that is dual to the original definition. We show how the parsimony length of a character equals the minimum number of changes in the tree required to fit the tree onto the character. This new perspective on parsimony has applications to consensus trees and recombination detection. The first application we discuss is how the reformulation of parsimony provides a closer link between parsimony-based analysis and supertree methods. We demonstrate that the maximum parsimony tree can be viewed as a type of median consensus tree, where the median is computed with respect to the SPR distance (see below). As well, the result shows how to conduct a parsimony-based analysis not just on characters but on trees, without having to recode the trees as binary character matrices. This opens the way to a hybrid between the consensus approach and the total evidence approach, where the data are a mixture of characters, trees, and subtrees. The second application of the main result is to the analysis of pairs of characters. We show that the score of the maximum parsimony tree for two characters is a simple function of the smallest number of recombinations required to explain the incongruence between the characters without homoplasy. This result provides the basis of a highly efficient test for recombination (Bruen et al., 2006). Here and throughout the paper we assume that all phylogenetic trees are fully resolved (bifurcating) and that by parsimony we refer to Fitch or Wagner parsimony, where the character states are unordered and reversible. Some of the results presented here can be extended to other forms of parsimony (Bruen, 2006), and possibly to incompletely resolved trees, but these lie beyond the scope of this paper. Note that in this paper we are dealing with unrooted SPR rearrangements, which are those used in tree searches. There is a related, but distinct, concept of rooted SPR rearrangements, where the rearrangements are restricted to obey a type of temporal constraint (Song, 2003). It is this latter class of rooted SPR rearrangments that are used to model lateral gene transfers and recombination. It would be a worthwhile, but challenging, goal to investigate whether any of the results on unrooted SPR rearrangements in this paper can be extended to rooted SPR rearrangements. LINKING PARSIMONY WITH SPR A subtree-prune and regraft (SPR) rearrangement is an operation on phylogenetic trees whereby a subtree is removed from one part of the tree and regrafted to another part of the tree, see Figure 1 (Felsenstein, 2004; Swofford et al., 1996). These SPR rearrangements are widely used by tree searching software packages like PAUP (Swofford, 1998) and Garli (Zwickl, 2006). The SPR distance between two trees can be defined as the minimum number of SPR rearrangements required to transform one tree into the other (Hein, 1990; Allen and Steel, 2001; Goloboff, 2007). For example, the two trees T 1 and T 3 in Figure 1 can be transformed into each other using a minimum of two SPR rearrangements, via the tree T 2, so their SPR distance is two. The parsimony length of a character on a tree is the minimum number of steps required to fit that character on the tree, as computed by the algorithm of Fitch (1971). The length of a character X i onatreet is denoted l(x i,t). A character with r i states therefore has parsimony length at least (r i 1), as every state not at the root has to arise at least once. A character is compatible withatreeifitrequires at most (r i 1) changes on that tree (Felsenstein, 2004). Instead of fitting a character onto a tree we could just as well fit the tree onto the character. If the character and the tree are compatible, then we have a perfect fit. When there is not a perfect fit we can measure how many SPR rearrangements are required to give a tree that does make a perfect fit. It turns out that this measure gives an equivalent score to parsimony length. More formally: Theorem 1. Let X i be a character with r i states and let T be a fully resolved phylogenetic tree. It takes exactly l(x i,t) (r i 1) SPR rearrangements to transform T into a tree compatible with X i. The result still holds if X i has some missing states. 251

252 SYSTEMATIC BIOLOGY VOL. 57 FIGURE 1. Two trees, T 1 and T 3, separated by two SPR rearrangements via the intermediate tree T 2. A binary character of parsimony length 3 is indicated on tree T 1 by the node colors. The character is compatible with a tree (T 3 ) within SPR distance two, illustrating Theorem 1. As an example, consider the character X 1 mapping taxa A,C,D,F to one and B,E,G to zero. The length of this character on tree T 1 of Figure 1 is three, and the number of SPR rearrangements needed to transform T 1 onto some tree T 3 compatible with X 1 is two. Note that there could be other trees compatible with X 1 that are further than two SPR rearrangements away: the result only gives the number of rearrangements required to obtain the closest tree. Once stated, the theorem is not too difficult to prove. First show that performing an SPR rearrangement decreases the length by at most one step. Hence it takes at least l(x i,t) (r i 1) SPR rearrangements to transform T into a tree compatible with the character X i. Then show that this is the minimum required. A formal proof is presented in Appendix 1. A restricted (binary character) version of this theorem was proved in Bryant (2003). The theorem captures an issue that is central to the interpretation of incongruence: whether an observed incongruence is to be explained by positing homoplasy or by modifying the tree. Define the SPR distance from a tree T to a character X i to be the SPR distance from T to the closest tree T that is compatible with X i. Theorem 1 then tells us that the SPR distance from T to a character X i is equal to the difference between the length l(x i,t) of X i on T and the minimum possible length of X i on any tree. That is, the amount that the length of a character exceeds the minimum possible length (r i 1) is equal to the amount we have to modify T in order to remove all incongruence between T and X i. CONSENSUS TREES, SUPERTREES, AND PARSIMONY In their insightful overview of supertree methods, Thorley and Wilkinson (2003) characterize a family of supertree methods that all minimize a sum of the form n d(t, t i ) = d(t, t 1 ) + d(t, t 2 ) + +d(t, t n ). (1)

2008 BRUEN AND BRYANT PARSIMONY VIA CONSENSUS 253 Here t 1,t 2,...,t n are the input trees and d(t, t i ) is a measure of the distance between the input tree t i with the supertree T. There are many choices for the distance measure d, and it need not be the case that the distance measure satisfies the symmetry condition d(t, t i ) = d(t i,t). Gordon (1986) was the first to propose this description of supertrees. Many supertree methods can be described in these terms, including matrix representation with parsimony (MRP) (Baum, 1992; Ragan, 1992), minimum flip supertrees (Chen et al., 2006), the median supertree (Bryant, 1997), majority rule supertree (Cotton and Wilkinson, 2007) and the average consensus supertree (Lapointe and Cucumel, 1997). Let d s (T, X i ) denote the SPR distance from T to the closest fully resolved tree T i that is compatible with X i. By Theorem 1, a maximum parsimony tree for X 1,...,X m is one that minimizes the expression m d s (T, X i ) = d s (T, X 1 ) + d s (T, X 2 ) + +d s (T, X m ). (2) In this way, maximum parsimony is a form of median consensus. The significance of this observation does not come from the fact that we can write the the parsimony score of T in the form (2); it is from the close connection with SPR distances, and from the way we will now use this connection to combine different kinds of data in the same theoretical framework. An SPR median tree for fully resolved trees t 1,...,t n on the same leaf set is a tree T that minimizes n d s (T, t i ) = d s (T, t 1 ) + d s (T, t 2 ) + +d s (T, t n ), where here d(t, t i ) denotes the SPR distance from T to t i (Hill, 2007). We extend this directly to a supertree method by mimicking the situation for characters. Suppose that t i is a phylogenetic tree, not necessarily fully resolved, on a subset of the set of leaves. We say that a fully resolved tree T on the full set of leaves is compatible with t i (equivalently, T displays t i ) if we can obtain t i from T by pruning off leaves and contracting edges. In this general situation, we let d s (T, t i ) denote the SPR distance from T to the closest fully resolved tree T i that is compatible with t i. This is equivalent to pruning leaves off T and then computing the distance from this pruned tree to t i. Now suppose that we have both characters and trees in the input. Both types of phylogenetic data can summarized by an SPR median tree T, chosen to minimize the sum n m d s (T, t i ) + d s (T, X i ). We have, then, a way to bring together both the supertree/consensus methodology and the total evidence methodology. In the case that the data comprise only trees, the tree is a median supertree; in the case that the data comprise only character data, the tree is the maximum parsimony tree. It is important to note the difference between this approach and the MRP method (Baum, 1992; Ragan, 1992), which could also be used to combine trees and characters. In MRP, the trees are broken down into multiple independent characters. This is a problem, because the characters encoding a tree are nowhere near independent. In contrast, the SPR median tree approach treats a tree as a single indivisible unit of information. There is one critical issue that has been side-stepped: computation time. At present, computational limitations make the construction of SPR median trees infeasible for all but the smallest data sets. Computing the SPR distance between two trees is an NP-hard problem (Hickey et al., 2006). In contrast, total evidence and MRP approaches are possible for at least 100 taxa. However, there are now good heuristics for unrooted SPR distance (Goloboff, 2007) and exact special case algorithms (Hickey et al., 2006) that could be applied to the problem. Below we describe a lower bound method for the SPR distance that should also aid construction of these SPR median trees. PARSIMONY ON PAIRS OF CHARACTERS Another valuable application of Theorem 1 follows when we consider parsimony analysis of just two unordered and reversible characters. The concept of pairwise character compatibility was introduced by Le Quesne (1969); see also Felsenstein (2004). Two binary undirected characters with states 0 and 1 are incompatible if and only if all four combinations of 00, 01, 10, and 11 are present as combination of states for the two characters. In a standard setting, character incompatibility is interpreted as implying that at least one of the characters has undergone convergent or recurrent mutation. In other words, for every possible phylogeny describing the history of the two characters, at least one homoplasy is posited for one of the characters (assuming the characters were correctly scored). Another interpretation of incompatibility of two characters is that characters evolved without homoplasy on two different phylogenies, where the phylogenies differ by one or more SPR rearrangement (Sneath et al., 1975; Hudson and Kaplan, 1985). Define the total incongruence score i(x 1,X 2 ) for two multistate unordered characters X 1 and X 2 (with r 1 and r 2 states respectively) as i(x 1,X 2 ) = min[l(x 1,T) + l(x 2,T)] T (r 1 1) (r 2 1). (3) This is the maximum parsimony score of the two characters X 1,X 2 minus the minimum number of changes required for each character. Equation (3) generalizes the incompatibility notion for two binary characters. It is also equivalent to the incongruence length difference statistic applied to only two characters (Farris et al., 1995). Importantly, the total incongruence score can be computed rapidly (Bruen and Bryant, 2008). The following

254 SYSTEMATIC BIOLOGY VOL. 57 consequence of Theorem 1 strengthens the connection between incongruence and SPR rearrangements. Theorem 2. The total incongruence score i(x 1,X 2 ) for two characters equals the minimum SPR distance between a tree T 1 andatreet 2 such that T 1 is compatible with X 1 and T 2 is compatible with X 2. Although the notion of total incongruence for two characters has been considered before in the context of character selection and weighting (Penny and Hendy, 1986), it has not been considered in the context of genealogical similarity. Essentially, Theorem 2 shows that the total incongruence score equals the minimum possible number of SPR rearrangements that could have occurred between the phylogenetic histories for both characters, assuming that the characters have different histories with which they are each perfectly compatible. Indeed, Theorem 2 suggests a natural way to interpret genealogical similarity between two characters, which we have used to develop a powerful test for recombination (Bruen et al., 2006). Choosing two characters from two different genes (which have possibly different histories) gives a simple approach to identify the distinctiveness of the histories of the genes. We can also apply Theorem 2 to obtain a lower bound on the SPR distance between two trees. Suppose that we have two trees T 1 and T 2 and we wish to obtain a lower bound on the SPR distance d(t 1,T 2 ) between the two trees. If we choose any character X 1 compatible with T 1 and any character X 2 compatible with T 2 then, by Theorem 2, we have that i(x 1,X 2 ) d(t 1,T 2 ). By carefully choosing X 1 and X 2 we can obtain tighter bounds. DISCUSSION AND EXTENSIONS We have presented a reformulation of parsimony that is complementary or dual to the standard definitions. Instead of measuring how well a character fits onto a tree, we look at how well the tree fits onto the character. A consequence of this new perspective is that we can combine trees and character data using one general SPR framework, and we also obtain new results connecting incongruence measures and recombination. Nevertheless, it is not immediately clear how the new reformulation can be interpreted in itself. Consider the information a single character, or tree, represents. Given a single character, we can define the set or cloud of trees around the character comprising exactly those trees compatible with the character (Fig. 2). If the character evolved without homoplasy, then the true evolutionary tree must be contained somewhere within the cloud. Now suppose we have multiple characters, each with its own cloud. There may not be a single tree contained in the intersection of all of these clouds. FIGURE 2. Cartoon representation of parsimony in terms of tree rearrangements. Each character X i gives a cloud of trees containing those trees compatible with X i. The distance from T to a cloud is the SPR distance d(t, t i ) to the tree t i in the cloud that is closest to T. The maximum parsimony tree is then the tree closest to these clouds under the SPR distance.

2008 REFERENCES 255 Instead, we search for a tree that is as close as possible to all of the clouds. The distance from T to the cloud associated to character X i is exactly d(t, X i ), so by Theorem 1 a tree closest to all of the clouds is a maximum parsimony tree. We note that several of the results in this article can be extended; for details, see Bruen (2006). Both Theorems 1 and 2 remain valid if we replace the SPR distance with the tree bisection and reconnection (TBR) distance. In a TBR rearrangement, a subtree is removed from the tree and then reattached elsewhere with the tree, the difference with SPR being that with SPR we can only reattach the subtree at the node where the branch was broken, whereas with TBR we can reattach from any node in the subtree (Allen and Steel, 2001; Felsenstein, 2004). The TBR distance between two trees is the minimum number of TBR rearrangements required to transform one tree into the other. That Theorems 1 and 2 hold for the TBR distance may appear surprising because the TBR distance between two trees is always less than, or equal to, the SPR distance between the trees. However, the extension follows by a small change to the proof of Theorem 1, noting that a TBR move can only reduce the parsimony score of a character by at most one. We have also explored extensions of the result to other distances between trees, notably the Robinson-Foulds distance and the nearest neighbor interchange distance. See Bruen (2006) for details. ACKNOWLEDGEMENTS We would like to thank Mike Steel, Sebastien Böcker, Olaf Bininda- Emonds, Pablo Golloboff, Melanie Pierson, Mark Wilkinson, and an anonymous referee for their valuable suggestions. This research was partially supported by the New Zealand Marsden Fund. REFERENCES Allen, B., and M. Steel. 2001. Subtree transfer operations and their induced metrics on evolutionary trees. Ann. Comb. 5:1 13. Baum, B. 1992. Combining trees as a way of combining datasets for phylogenetic inference, and the desirability of combining gene trees. Taxon 41:3 10. Bruen, T. 2006. Discrete and statistical approaches to genetics. PhD thesis, McGill University School of Computer Science, Montreal, Canada. Bruen, T., and D. Bryant. 2008. A subdivision approach to maximum parsimony. Ann. Comb. 12:45 51. Bruen, T., H. Philippe, and D. Bryant. 2006. A simple and robust statistical test to detect the presence of recombination. Genetics 172:1 17. Bryant, D. 1997. Building trees, hunting for trees and comparing trees. PhD thesis, Department Mathematics, University of Canterbury, UK. Bryant, D. 2003. A classification of consensus methods for phylogenetics. Pages 163 184 in Bioconsensus (M. F. Janowitz, F.-J. Lapointe, F. R. McMorris, B. Mirkin, and F. S. Roberts, eds.), volume 61 of DI- MACS. American Mathematical Society, Providence, Rhode Island. Bryant, D. 2004. The splits in the neighborhood of a tree. Ann. Combinatorics 8:1 11. Chen, D., O. Eulenstein, D. Fernandez-Baca, and M. Sanderson. 2006. Minimum-flip supertrees: Complexity and algorithms. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3:165 173. Cotton, J., and M. Wilkinson. 2007. Majority-rule supertrees. Syst. Biol. 56:445 452. Farris, J. S., M. Källersjö, A. G. Kluge, and C. Bult. 1995. Constructing a significance test for incongruence. Syst. Biol. 44:570 572. Felsenstein, J. 2004. Inferring phylogenies. Sinauer Associates, Sunderland, Massachusetts. Fitch, W. M. 1971. Towards defining the course of evolution: Minimum change for a specific tree topology. Syst. Zool. 20:406 416. Goloboff, P. 2007. Calculating SPR distances between trees. Cladistics 23:1 7. Gordon, A. D. 1986. Consensus supertrees: The synthesis of rooted trees containing overlapping sets of labeled leaves. J. Class. 3:335 348. Hein, J. 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98:185 200. Hickey, G., F. Dehne, A. Rau-Chaplin, and C. Blouin. 2006. The computational complexity of the unrooted subtree prune and regraft distance. Tech. Rep. CS-2006-06. Faculty of Computer Science, Dalhousie University. Hill, T. 2007. Development of new methods for inferring and evaluating phylogenetic trees. PhD thesis, Uppsala Universitet, Uppsala, Sweden. Hudson, R. R., and N. L. Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111:147 164. Lapointe, F.-J., and G. Cucumel. 1997. The average consensus procedure: Combination of weighted taxa containing identical or overlapping sets of taxa. Syst. Biol. 46:306 312. Le Quesne, W. J. 1969. A method of selection of characters in numerical taxonomy. Syst. Zool. 18:201 205. Penny, D., and M. Hendy. 1986. Estimating the reliability of evolutionary trees. Mol. Biol. Evol. 3:403 417. Ragan, M. A. 1992. Phylogenetic inference based on matrix representations of trees. Mol. Phyl. Evol. 1:53 58. Sneath, P., M. Sackin, and R. Ambler. 1975. Detecting evolutionary incompatibilities from protein sequences. Systematic Zoology 24:311 332. Song, Y. S. 2003. On the combinatorics of rooted binary phylogenetic trees. Ann. Comb. 7:365 379. Swofford, D., G. Olsen, P. Waddell, and D. Hillis. 1996. Molecular sytematics, Pages 407 514 in Phylogenetic inference. Sinauer Associates, Sunderland, Massachusetts. Swofford, D. L. 1998. PAUP*. Phylogenetic analysis using parsimony (*and other methods). Sinauer Associates, Sunderland, Massachusetts. Thorley, J. L., and M. Wilkinson. 2003. A view of supertree methods. Pages 185 194 in Bioconsensus (M. F. Janowitz, F.-J. Lapointe, F. R. McMorris, B. Mirkin, and F. S. Roberts, eds.), volume 61 of DIMACS, Series in Discrete Mathematics and Theoretical Computer Science, The American Mathematical Society, New York. Zwickl, D. 2006. Genetic algorithm approaches for the phylogenetic analysis of large biological sequence datasets under the maximum likelihood criterion. PhD thesis, University of Texas at Austin. First submitted 3 April 2006; reviews returned 26 August 2006; final acceptance 6 June 2007 Associate Editor: Olaf Bininda-Emonds Editor in Chief: Roderic Page APPENDIX The first observation is that a TBR rearrangement of a tree increases the length of a character by at most one. As SPR rearrangements are a special case of TBR rearrangements, the same result holds for SPR. Lemma 1. Let T be a fully resolved phylogenetic tree and X i an unordered reversible character. Let T be a phylogenetic tree that differs from T by a single TBR rearrangement. Then l(χ, T ) l(χ, T) + 1. Proof. The proof of Lemma 5.1 in Bryant (2004) for binary characters applies directly to the multistate case. Let d SPR (T, T ) denote the unrooted SPR distance between two phylogenetic trees T and T. Theorem 1 Let X i be a character with r i states and let T be a fully resolved phylogenetic tree. It takes exactly l(x i,t) (r i 1) SPR rearrangements to transform T into a tree compatible with X i. The result still holds if X i has some missing states.

256 SYSTEMATIC BIOLOGY VOL. 57 Proof. Let T be a fully resolved phylogenetic tree compatible with X i for which d SPR (T, T ) is minimized and let m = d SPR (T, T ). Then there exists a sequence of trees T = T 0,...,T m = T such that every adjacent pair of trees in the sequence differs by exactly one SPR rearrangement. By Lemma 1 the existence of this sequence implies that l(t, X i ) l(t,x i ) d SPR (T, T ) and because X i is compatible with X i we have l(t,x i ) = r i 1, giving l(t, X i ) (r i 1) d SPR (T, T ). For the other direction, we show that we can construct a sequence of l(t, X i ) (r i 1) SPR rearrangements that transform T into a tree T compatible with X i. Firstly, if l(t, X i ) (r i 1) = 0, then T is compatible with X i, so the proof is finished. Otherwise, let X i be an assignment of states to internal nodes that minimizes the number of state changes (that is, a minimum extension of X i ). Then because X i is not compatible with T there exist three vertices u, v, and w, where {u, v} E(T), v lies on the path from u to w and X i (u) = X i (w) X i (v). Perform an SPR rearrangement by removing edge {u, v}, supressing the v vertex and creating a new edge {u, t} where t is a new vertex on an edge adjacent to w. Furthermore, set X i (t) = X i (w). Then the number of edges on which a change has occurred has decreased by 1, thereby decreasing the parsimony length by 1. This procedure can be repeated until the parsimony length equals r i 1, constructing the desired sequence of trees and completing the proof. Theorem 2. The total incongruence score i(x 1,X 2 ) for two characters equals the minimum SPR distance between a tree T 1 and T 2 such that X 1 is compatible with T 1 and X 2 is compatible with T 2. Proof. Let T 1 and T 2 be any two trees compatible with X 1 and X 2, respectively. Then l(x 1,T 1 ) = r 1 1 and by Theorem 1, l(x 2,T 1 ) (r 2 1) d SPR (T 1,T 2 ). We have then that i(x 1,X 2 ) l(x 1,T 1 ) + l(x 2,T 1 ) (r 1 1) (r 2 1) d SPR (T 1,T 2 ) and so i(x 1,X 2 ) is a lower bound for d SPR (T 1,T 2 ). We show that this bound can be achieved. Let T be a maximum parsimony tree for the pair of characters X 1,X 2. By Theorem 1 there exist two trees T 1 and T 2 such that T 1 is compatible with X 1, T 2 is compatible with X 2 and d SPR (T 1,T) + d SPR (T 2,T) = i(x 1,X 2 ), implying that d SPR (T 1,T 2 ) d SPR (T 1,T) + d SPR (T 2,T) i(x 1,X 2 ) and hence d SPR (T 1,T 2 ) = i(x 1,X 2 ).