Visualising Phylogenetic Trees

Visualising Phylogenetic Trees Wan Nazmee Wan Zainon & Paul Calder School of Informatics and Engineering Flinders University of South Australia PO Box 21, Adelaide 51, South Australia wanz1@infoeng.flinders.edu.au calder@infoeng.flinders.edu.au Abstract This paper describes techniques for visualising pairs of similar trees. Our aim is to develop ways of presenting the information so as to highlight both the common structure of the trees and their points of difference. The impetus for the work comes from the field of bioinformatics, where geneticists construct complex phylogenetic trees to represent the evolution of species or genes. But the techniques can also be used for other treestructured data such as file systems, parse trees, decision trees, and organisational hierarchies. To investigate our techniques, we have built a prototype application that reads and displays phylogenetic trees in the popular Nexus format. The application incorporates a variety of interactive and automated visualisation techniques, and is implemented in Java. We are working with biologists to see how well the techniques work for real-world data. Keywords: Interactive visualisation, phylogenetic trees, bioinformatics. 1 Introduction Tree-structured data occurs in many domains: file systems, parse trees, organisational hierarchies, and classification schemes of many kinds. The impetus for this work described in this paper is the domain of phylogenetic classification, which is used by geneticists to describe possible evolutionary relationships between species or individuals. Although we have developed our techniques specifically for that domain, many of our techniques could also be applied to other domains that use similar trees. This paper presents techniques for visualising pairs of phylogenetic trees in order to emphasise the similarity of the trees while also highlighting how they differ. We have implemented these techniques in the context of a prototype tool for interactively visualising phylogenetic trees, and are in the process of evaluating the effectiveness of the tool for real phylogenetic data. Copyright 26, Australian Computer Society, Inc. This paper appeared at the Seventh Australasian User Interface Conference (AUIC26), Hobart, Australia. Conferences in Research and Practice in Information Technology (CRPIT), Vol. 5. Wayne Piekarski, Ed. Reproduction for academic, notfor profit purposes permitted provided this text is included. The remainder of this paper is organised as follows. Section 2 provides an introduction to the bioinformatics basis of phylogenetic trees and outlines other work that has investigated the visualisation and comparison of such data. Section 3 presents our approach to the problem and details several of the algorithms we use to compute visualisations. Section 4 describes brief implementation details for the prototype visualisation tool, shows examples of its interface and discusses its use. 2 Related Work 2.1 Bioinformatics Context Biologists and geneticists use phylogenetic trees to represent the evolutionary interrelationships between collections of related species or genes. The discovery and analysis of those relationships may help in many practical applications such as drug discovery, forensics, disease control, and ecological modelling. Biologists construct phylogenetic trees by examining the phenotypes or genotypes of a collection of organisms and attempting to infer the evolutionary process by which the organisms came to be. For example, a geneticist might obtain DNA sequence data from a range of species or from individuals within a population. Then, by comparing the sequences, she could infer how the sampled organisms might have evolved via a series of mutations, each caused by one change in the DNA sequence. This hypothesised evolutionary history is then represented as a tree of life showing how possible ancestors could have led to the current organisms. Bioinformaticists have devised a range of algorithms, based on strategies such as Maximum Likelihood (Felsenstein et al. 1982) and Maximum Parsimony (Farris 1983), for computing such phylogenetic trees. However, there is no gold standard ; current practice dictates that several different methods be applied to the sequence data (Thorup 1994). When this happen, biologists often need to compare several similar trees in order to get a more complete picture of the relationships involved. A similar situation arises when several species have evolved in close association (co-evolution); the biologist might be interested in understanding how the phylogenetic tree for one species compares with that for the co-evolved species. In its simplest form, a phylogenetic tree is drawn as a rooted binary tree. Each leaf node represents an actual species or organism; each internal node represents a hypothetical ancestor at which mutation is assumed to

Tree 1 Tree 2 Figure 1: Fictitious phylogenetic trees have occurred (and which therefore has exactly two branches). For example, Figure 1 shows two (clearly fictitious) trees that suggest two possible ways in which 4 present-day species might be related. Tree 1 implies that and diverged recently from a common ancestor, that the / ancestor and share a more distant common ancestry, and finally that the whole // tree split from the branch even further in the past. Tree 2, on the other hand, suggests that a common ancestor split into two branches, one ultimately leading to and and the other to and. Real phylogenetic trees will of course be much larger and thus more complex. Understanding such trees requires visual inspection, structural comparison, and interactive manipulation and exploration, and thus present a number of visualisation challenges (Carrizo 24). Biologists faced with inadequate visualisation tools for comparing trees have had to rely instead on paper, tape, and highlighter pens (Munzner et al. 23). 2.2 Tree Comparison Techniques Bioinformaticists use a variety of techniques to compare phylogenetic trees. Section 3.1 describes how we apply and extend some of these techniques in visualising trees. Consensus trees are widely used to summarise the agreement between a set of trees. A consensus tree represents a lowest common denominator of two or more trees; it depicts those aspects that the individual trees all agree on. Bryant (1997) reports on a variety of methods for creating a consensus tree, including the strict, majority rule, semistrict, and Nelson and Adams techniques. For example, the strict consensus tree of the trees in Figure 1 is as follows: The consensus tree indicates that both trees agree that and had a recent common ancestor, but disagree about how and fit into the picture. The best that can be said is that and both shared a common ancestor with the / ancestor at some time in the past. Note that a consensus tree includes all of the original leaf nodes, but is normally not fully resolved; areas of disagreement generally result in interior nodes with more than two branches. An agreement subtree is a subtree that is common to two or more trees. Conceptually, a subtree can be obtained by pruning leaf nodes (and collapsing the parent internal nodes) from the original tree. An agreement subtree is a subtree that can be extracted in such a manner from all of the trees. A greatest agreement subtree (GAS) is an agreement subtree with the greatest number of leaf nodes. For example, the trees in Figure 1 have two greatest agreement subtrees: Note that a greatest agreement subtree does not normally include all of the leaf nodes (unless, of course, the trees are identical). A triplet is a 3-node subtree and represents the smallest informative subtree of a rooted tree. The structure of a tree is fully characterised by enumerating the structure of its triplets. For example, Tree 1 has the following triplets: Triplets can be used as a basis for quantifying the difference between rooted trees. Using this approach, the structural difference between two trees is the number of triplets whose structure is different in the two trees. For example, Tree 2 has the following triplets. Since 2 triplets (the second and third) are different from the corresponding triplets in Tree 1, the structural triplet difference between the two trees is 2. The nearest neighbour interchange (NNI) technique (Robinson 1971) is also used to quantify the difference between trees. A nearest neighbour interchange is an interchange of two nearest neighbour branches. The NNI difference between two trees is the minimum number of such interchanges needed to convert one tree into the other. NNI is usually applied to unrooted trees, but can be adapted for rooted trees. For rooted trees, the nearest neighbour of a branch is one of the sub-branches (if they exist) of its sibling. For example, in Tree 1 the nearest neighbours of the branch are the and branches, and the nearest neighbours of the branch are the branch and the (unlabelled) common / ancestor branch.

Program PROTPARS NEIGHBOR DRAWGRAM CONSENSE RETREE Use Infers phylogenies from protein sequences using parsimony method Infers phylogenies from distance matrix data using either pairwise clustering or neighbour joining methods Draws a rooted tree based on output from one of the phylogeny inference programs Computes a consensus tree from a group of phylogenies Allows interactive manipulation of a tree Table 1: Selected PHYLIP programs Using this definition, the following trees can all be obtained by one NNI step from Tree 1. Since one of these (the bottom right) is structurally identical with Tree 2, the NNI difference between the two trees is 1. 2.3 Tools for Phylogenetic Tree Analysis and Visualisation Biologists use many applications to analyse and understand phylogenetic data. This section briefly describes four of the most popular tools that are freely available over the Internet. A comprehensive list of other tools is provided on the PHYLIP web site (Felsenstein 25) Gibas and Jambeck (21) report that the most widely used phylogenetic analysis package is PHYLIP (Felsenstein 25), which contains more than 3 programs that implement different phylogenetic algorithms. It has programs for tree plotting, heuristic tree search, interactive tree manipulation, and other phylogenetic analysis methods. Table 1 shows a list of PHYLIP programs that users are most likely to use to analyse protein and DNA sequence data. The COMPONENT application (Roderic 1993) can both display and analyse phylogenetic trees. It s emphasis is on computing comparative metrics between trees, although it includes simple interactive editing operations such as rearranging tree branches, deleting nodes, and rerooting trees. Mesquite (Madison and Madison 25) is a system that its developers describe as a modular system for evolutionary analysis. Available modules include components for construction and comparison of phylogenetic trees. The TreeSet Visualisation module (Klinger and Amenta 22) produces point-set visualisations that suggest clustering within large sets of trees. TreeJuxtaposer (Munzner at al. 23) supports structural comparison of trees. The tool can highlight parts of several trees that are structurally similar, although its emphasis is on efficiently handling very large trees (up to several hundred thousand nodes) rather than on identifying the specific differences between the trees. 3 Visualising Tree Differences Our approach to visualising trees similarities and differences makes use of the fact that a tree with unordered branches can be drawn in many arrangements. In a phylogenetic tree, the order in which branches appear is usually less important than the structural relationships between nodes. In such cases, we can take advantage of this flexibility to draw a pair of trees to highlight both their similarities and differences. Our technique is to draw the pair of trees face-to-face, with the arrangements of each tree chosen to best emphasise the similarities and highlight the differences. For example, the trees in Figure 1 could be drawn as follows: This arrangement shows the greatest agreement subtree (,, ) and also how the differing node () connects in the two trees. In essence, it suggests that in one case diverged from the / line, whereas in the other it diverged from the line. Typical phylogenetic trees can often have 5 or more nodes, and since the number of possible arrangements of a fully resolved tree of size n is 2 n-1 it is usually impractical to manually determine the best arrangement. To help in the process we have considered several strategies for automatically arranging the trees. The minimum triplet difference (MTD) algorithm computes arrangements of two trees for which the difference, as measured by triplet arrangement pattern, is minimised. The maximum branch similarity (MBS) algorithm arranges one tree so that its branches have as many leaf nodes as possible in common with the corresponding branch in the other tree.

A B C Triplet Tree 1 pattern Tree 2 pattern (,, ) A J (,, ) A D (,, ) A D (,, ) G G Table 2: Triplet patterns for Tree 1 and Tree 2 D G J E H K F I L Tree 2 arrangement Tree 1 arrangement 3 4 4 4 2 4 3 4 3 4 4 4 2 4 3 4 4 3 4 4 4 2 4 3 4 3 4 4 4 2 4 3 3 4 2 4 4 4 3 4 3 4 2 4 4 4 3 4 4 3 4 2 4 4 4 3 4 3 4 2 4 4 4 3 Figure 2: Labelled triplet arrangement patterns The all-but-n (ABn) algorithm attempts to arrange the common structures of the two trees so that the nodes that differ can be drawn in alignment. 3.1 Minimum Triplet Difference Nodes in a triplet can be labelled in 3 distinct ways, and there are 4 distinct arrangements for each labelling, making a total of 12 possible labelled triplet patterns, as shown in Figure 2. The nodes in the figure are labelled to suggest how the triplet pattern is assigned to a particular labelled tree. The label is assigned to the tree node with the lowest ordinal number (in some domain-specific ordering) of the three triplet nodes. Similarly, the label is assigned to the tree node with the highest ordinal number, and the label is assigned to the tree node with the intermediate ordinal number. For example, using an alphabetic ordering for the labels in the trees of Figure 1 and considering the triplet (,, ), label would map to (the label with the lowest ordering), to, and to (the highest ordering). Thus this triplet in Tree 1 would match pattern A, and the same triplet in Tree 2 would match pattern J. The triplet difference between two trees is computed by considering all triplets and counting the number of triplets for which the pattern in the two trees is different. For example, Table 2 lists the triplet patterns for all four of the triplets in the trees of Figure 1. Since three of the triplets have different patterns in the two trees, the triplet difference for these tree arrangements is 3. Table 3: Triplet difference matrix for all possible arrangements of Tree 1 and Tree 2 The minimum triplet difference (MTD) algorithm finds an arrangement for each tree that minimises the triplet difference. In principle, the algorithm considers each possible arrangement of each of the two trees, then choses the pair of arrangements for which the triplet difference is smallest. In general, there may be many pairs of arrangements with the same minimum triplet difference; MTD does not specify which such pair should be chosen. For example, Table 3 lists the triplet difference for each of the 8 possible arrangements of Tree 1 and Tree 2. In this case the minimum difference is 2, which is achieved by 8 pairs, of which one is as follows: 3.2 Maximum Branch Similarity The maximum branch similarity (MBS) algorithm arranges one tree so that the branches of each internal node have the largest number of leaf nodes in common with the corresponding branches of the equivalent node in the other tree. For example, consider the original arrangements of the trees in Figure 1. The set of leaf nodes comprised by the upper branch of the root node of Tree 1 is {}, and the

set comprised by the lower branch is {,, }. Similarly, the Tree 2 root node upper branch comprises {, } and lower branch {, }. Thus for this arrangement there are no nodes common to the upper branches, and only one () common to the lower branches, for a total common node count of 1. However, if the branches of the Tree 2 root node were exchanged, then the upper branches would have 1 common node () and the lower branches would have 2 common codes ( and ), for a total common node count of 3. Thus MBS indicates that the root node of Tree 2 should be flipped (its branches swapped), giving the following arrangement: The algorithm then recursively considers the upper and lower children of the original nodes, ultimately terminating at the leaf nodes. In this simple example, no further swaps occur since the upper branch of Tree 1 is already a leaf, and since flipping the lower branch of Tree 2 would not result in an increase in the number of common nodes (both alternatives have only 1 node in common). 3.3 All-But-n We have explored a class of algorithms, which we call All-But-n (ABn), that can arrange trees to maximise leaf node alignment in a face-to-face display where the GAS of the two trees is almost as large as the trees themselves (in other words, where the trees differ with respect to just a few nodes). The simplest situation (AB1) occurs for trees for which the GAS includes all but one node. In this case, the aim of the algorithm is to choose an arrangement for the GAS so that, when the differing node is re-inserted into the tree (which will be in a different position in the two trees), the differing nodes will be aligned. For example, the trees of Figure 1, which have a GAS that excludes the single node, could be drawn as follows. AB1 partitions the GAS into three components at the nearest common ancestor (NCA) of the points in the two original trees at which the different node is attached to the GAS. The component above the NCA (the outer tree) pays no further part in the algorithm. The algorithm proceeds by arranging the upper and lower inner branches of the NCA so that missing node attachment for one tree is on the lower boundary of the upper inner branch, while for the other tree it is on the upper boundary of the lower inner branch. Then, when the two trees are constructed around face-to-face copies of the GAS, the missing node insertion points will coincide. Since it is always possible to arrange a tree so that any one particular node is on the tree boundary, it is always possible to achieve this arrangement when the GAS is only one node short of the full trees. When more than one node must be pruned (and subsequently reinserted), the situation is more complex; sometimes full alignment can be achieved, but sometimes only partial alignment is possible. A full explanation of the ABn algorithm is beyond the scope of this paper. 4 A Visual Tree Comparison Tool We have implemented a prototype application for visualising pairs of phylogenetic trees and used it as a vehicle for developing and evaluating our ideas. The application is implemented in Java using the Swing components. Figure 3 shows the prototype tool displaying two 5-node trees. The program can read standard Nexus-format tree files (David et al. 1997) and display a selected pair of trees. It provides controls for specifying basic parameters of the tree display, including the separation between branches and the depth of each node. The information display area at the bottom of the window provides basic information about the trees and is used largely for debugging. Figure 3 shows the trees displayed in the raw arrangement specified in the Nexus file; in this example, that arrangement does not make it easy to compare the trees. However, the node connection display (between the two trees), which visually connects common leaf nodes in the two trees, provides some indication of similarities in the trees. Horizontal connection lines (coloured green in the application) indicate nodes whose vertical position is the same in the two trees. Clearly, if the two trees (or parts of the trees) are identical, then they can be drawn so that all nodes are aligned, in which case the connection display would consist entirely of parallel horizontal lines. In Figure 3, few nodes are aligned (the exception is a group of 3 towards the top of the display). Slanted connection lines (coloured red or yellow depending on whether the position of the node in the left tree is higher or lower than that in the right tree) indicate nodes that are not aligned. However, parallel slanted lines indicate groups of nodes whose relative positions are the same in the two trees, suggesting a similar structure for those groups in the two trees. Figure 3 shows several such groups.

Figure 3: Visualisation tool interface Figure 4: Collapsing interior nodes 4.1 Using The Application To rearrange the trees (in order to better compare them), the user can use a combination of manual interaction and automatic rearrangement. The palette on the left of Figure 3 includes tools for interactively modifying the tree appearance, including selecting tree nodes, collapsing selected branches, controlling the spacing between branches of a node, swapping the upper and lower branches of a node, and manually setting branch colours and line thicknesses. The collapse tool is used to temporary hide various parts of the tree, as shown in Figure 4. Collapsing nodes enhances visibility, especially for larger trees, because it enables the user to focus on specific parts of the tree while ignoring other parts. Collapsed nodes can then subsequently be expanded (and themselves arranged) once their containing structure has been dealt with. The insert gap and decrease gap tools are used to add additional space between branches in order to arrange a group of nodes so that they are located at the same level in both of the trees. The flip tool is used to swap the positions of the branches of a given interior node, which allows manual manipulation of the tree arrangement and may provide a simpler view of the tree structures. The visualisation tool currently implements the MTD and MBS automatic rearrangement algorithms, but not the ABn algorithm. To apply the algorithms, the user selects a branch (or perhaps the entire tree) in both left and right trees, then invokes the desired algorithm. The application computes the new arrangements, then redraws the trees with the selected nodes rearranged. 4.2 Evaluation Informal evaluation of our prototype visualisation tool has shown that a combination of automatic rearrangement and manual rearrangement is often effective in rapidly generating an arrangement that facilitates tree comparison, even for quite large trees. For example, Figure 5 shows an arrangement of the trees in Figure 3 for which most nodes are aligned. The arrangement was achieved by a combination of MBS (applied to the whole trees to align high-level structure), MTD (to sort out the tangles indicated by groups of nearly parallel connecting lines), manual node flipping (to fine-tune a few branches), and manual gap insertion (to move relatively aligned groups into absolute alignment).

Figure 5: An arrangement with greater alignment Note, however, that alignment of nodes does not necessarily indicate commonality of structure, although it does make it much easier to see such commonality. Figure 6 shows the same arrangement as does Figure 5, but with common leaf-level branches highlighted in colour. The colouring algorithm finds nodes that have the same siblings in both trees, then recursively examines their parents. Note that not all aligned nodes have common structure (although most do) and that not all nodes with common structure are aligned (although most are). Our current investigations suggest that the combination of alignment (to simplify the display) and colouring (to identify common structures) appears promising as a way to understand the two trees. We are working with our bioinformaticist colleagues to validate and further develop our ideas and to determine if interactive visusalisation is a viable technique for data of this kind. 5 Conclusion Information visualisation can play a major role in the analysis of phylogenetic data by allowing geneticists to visually compare and therefore better understand their data. We have developed and are in the process of evaluating a prototype tool that domain specialists that deal with phylogenies can use to help understand the data that they confront. Although we have not yet done so, we believe that our ideas will also be of value in other domains where similarly structured data is used, and where comparisons are key in understanding the implications of that data. Acknowledgements We gratefully acknowledge the contribution of Rejmond Sejic, who built an early version of the prototype tool and implemented the MTD algorithm as part of his Honours project (Sejic 24). Thank you also to our School of Biological Sciences colleagues Dr Cathy Abbott and Assoc. Prof. Mike Schwarz for their valuable insights and bioinformatics expertise. References Carrizo, S. F. (24): Phylogenetic Trees: An Information Visualisation Perspective. In Proc. 2nd Asia-Pacific Bioinformatics Conference (APBC24), Dunedin, New Zealand, Australian Computer Society, Inc. Klinger, J. and Amenta, N. (22): Case Study: Visualizing Sets of Evolutionary Trees. In Proc. IEEE Symposium on Information Visualization, Boston, Massachusetts, USA. Roderic, D. M. (1993): Component 2. User Guide, http://taxonomy.zoology.gla.ac.uk/rod/cplite/manual.ht ml (last accessed 8/8/25). Byrant, D. (1997): Building Trees, Hunting for Trees and Comparing Trees. Ph.D. Thesis, University of Canterbury. Sejic, R. (24): Visual Comparison of Phylogenetic Trees. Honours Thesis, Flinders University of South Australia. Gibas, C. and Jambeck, P. (21): Bioinformatics Computer Skills. O Reilly, USA.

Figure 6: The final presentation David, R. M., David, L. S., and Wayne, P. M. (1997): NEXUS: An Extensible File Format for Systematic Information. Systematic Biology, 46(4):59, 62. Munzner, T., Guimbretiere, F., Tasiran, S., Zhang, L. and Zhou, Y. (23): TreeJuxtaposer: Scalable Tree Comparison Using FocusContext with Guaranteed Visibility. In Proc. SINGGRAPH 23. Thorup, M and Farach, M. (1994): Fast Comparison of Evolutionary Trees. In Proc. 5th Annual ACM_SIAM Symposium on Discrete Algorithms. Maddison, W. P. and Maddison, D. R. (25): Mesquite: a modular system for evolutionary analysis. Version 1.6 http://mesquiteproject.org Felsenstein, J., Sawyer, S., and Kochin, R (1982): An efficient method for matching nucleic acid sequences. Nucleic Acids Research 1(1): 133-139. Farris J. S. (1983): The logical basis of phylogenetic analysis. In Advances in Cladistics, Platnick N.I. & Funk V.A., eds, pp. 1-36. Columbia Uni. Press, New York. Robinson, D. F. (1971): Comparison of labeled trees with valency three, J. Combin. Theory 11:15-119. Felsenstein, J. (accessed 27/1/25): PHYLIP web site. http://evolution.genetics.washington.edu/phylip.html