Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring

Size: px

Start display at page:

Download "Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring"

Kenneth Russell
5 years ago
Views:

1 Computational Issues in Phylogenetic Reconstruction: Analytic Maximum Likelihood Solutions, and Convex Recoloring Sagi Snir

3 Computational Issues in Phylogenetic Reconstruction: Analytic Maximum Likelihood Solutions, and Convex Recoloring Research Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Sagi Snir Submitted to the Senate of the Technion Israel Institute of Technology AV, 5764 Haifa August, 2004

5 This Research Thesis was done under the supervision of Professor Benny Chor in the Department of Computer Science I m glad for the opportunity to thank people who helped me: First, I want to thank Benny. Above all, Benny has been a friend. Our collaboration started by a request for advice sent to New Zealand, answered by: Why won t you come work with me?. Despite the student-advisor complex relationship, I could always turn to Benny with a request for advice or help. Our meetings were spent mainly on sharing stories, adventures, jokes or advices. Benny taught me that cafes are the ideal locations to advance research. Benny s questions helped me focus on the important things. It sometime took me few weeks to find that the whole issue is encompassed behind such a question. Benny has sent me around the world to his friends, trips that resulted all these friendly, productive collaborations. I m very grateful to Benny for all these. I also want to thank Shlomo very much. In our very intensive enjoyable collaboration during the last year, every meeting resulted in a new algorithm, some key definition or deep insight. Although we did not continue to a Ph.D in distributed algorithms, and pursued another fruitless direction, I feel very lucky for eventually succeeding to drag him into the project of the convex recoloring. I m sure that convex recoloring will be a big thing one day. However, without Shlomo, it would not have moved beyond a sketch on a paper. I want to thank my other collaborators: Mike Hendy whose Hadamard Conjugation amazed me back then when Benny sent me a bunch of papers from New Zealand. It has been thrilling to walk in Mike s footprints. I want to thank Steve Skiena for our efficient collaboration on the SB(H+RE) project. I want to thank very much to Zohar Yakhini, a friend and a running companion, for our long discussions (could be even better if Zohar talked while running) and the productive ideas. Thanks also to Eliezer Eskin, for getting me involved in the homology kernel project, for the nice conversations and advices, for the wonderful time we spent all along these years. Thanks to Mike Fellows for the unforgettable visit in Newcastle. The exposure to his parameterized complexity, stories and personality, and to the beaches of Newcastle, was a big, dense adventure. Thanks also to Amit Khetan for elevating the comb work to a level of professional algebraic contribution. I want to thank very much Arie Freund and Dror Rawitz. Their friendship has been a big asset. Many thanks to Danny Geiger for his support and wise advices, Ron Pinter for his informal enrichment and advices, to Ron Shamir and Mike Steel for enlightening ideas, to Esti for her very pleasant company during these four years, to the friends in the sixth floor: Irad, Reuven, Orna, Hadas, Ami, Eldar, Eli, Shaul and Seffi. To Yahalomit and Hertzel (ozer) for facilitating getting to the Technion and Tel Aviv Univ. (resp.).

6 Above all, I want to thank my devoted partner Rachel, who supported me so much during all these travels and hard work, and took all of the family burden on her strong, wide shoulders. The generous financial help of Technion and the Center for Complexity Science Scholarship is gratefully acknowledged

7 Contents Abstract 1 Notation 3 1 Introduction Brief Biological Background Phylogenetic Trees Analytical Maximum Likelihood Phylogenetic Trees Convex Recoloring of Trees and Strings Substitution Models and the Hadamard Conjugation Notations and the Hadamard Matrix Neymann Two States Substitution Models Kimura s 3-substitution Model Hadamard Conjugation of K3ST on a Phylogenetic Tree Maximum Likelihood on Evolutionary Trees Introduction Maximum Likelihood on Four Taxa Trees General ML System Simplifying Identities Solving the molecular clock fork Additional Results Solving the molecular clock Comb Overview Obtaining the Solution

8 Contents (continued) Solving the system Unique ML molecular clock comb Proof Solving the Jukes Cantor Triplets Preliminaries Obtaining the Maximum Likelihood Solution Results on Genomic Sequences Consistency of a Tree Reconstruction Method Concluding Remarks Minimum Convex Recoloring of Phylogenetic Trees Introduction Preliminaries NP-Hardness results Minimal Convex Recoloring of Strings is NP-Hard NP Hardness of Minimal Convex Recoloring of Leaves NP Hardness of Minimum Block-Recolorings Exact Algorithms Non-uniform Convex String Recoloring Enhanced algorithm Non-uniform Optimal Convex Recoloring of Binary Trees Fixed Parameter Tractable Recoloring Algorithms FPT Recoloring Algorithm for Bounded Degree Trees Approximation Algorithms for Convex Recoloring Lower Bounds via Penalties A 2-Approximation Algorithm for a String approximation algorithm for a tree Discussion and Future Work References 96 Hebrew Abstract i

9 List of Figures 1.1 The fork and comb two rooted topologies on four taxa A rooted tree over 3 species The three 4 taxa unrooted trees (a) Kimura s 3 substitution model (K3ST). (b) Substitution types t α = t 01, t β = t 10, t γ = t 11 and t ɛ = t (a): Example edge length spectra for the tree T 13. (b): Q = Q T The fork and comb two rooted topologies on four taxa A rooted tree over 3 species Rooted layout of the (12)(34)-molecular clock-fork (left) and its unrooted version (right) In the (12)(34) molecular clock fork, q 1 = q 2 and q 3 = q A general tree with two sister taxa i and j s.t. q i = q j r ε, log(r), q 12 as function of c r as a function of ε A triplet tree under the molecular clock satisfies q 1 = q The primates tree, taken from The Tree of Life project A Schematic view of the colored string corresponding to F. Informative segments appear white (in the figure) where junk segments are longer and have distinct colors A clause segment. The literals are l 1, l 2 and l 3, and the clause of size 3A consists of A repetitions of the corresponding triplet. Each block is a single vertex S xj, the segment of the literal x j. m + 1 d j -blocks are interleaved by the m blocks c i,xj, i = 1,..., m A recoloring of segments S xj and S xj corresponding to a satisfying assignment

10 4.6 A caterpillar of length A reduction from a fully colored string to a leaf colored caterpillar. Two leaves in the two ends are colored with two new colors. All other leaves are colored with the same color as the corresponding vertex in the string A convex recoloring of the input caterpillar. All blue (triangle) and green (circles) blocks were recolored, so a single vertex of each color can be retained without violating convexity Removing the mutation at the left edge implies the removal of the one at the right The input string (S, C) and the corresponding informative segment in (S z, C z ) The counter-weight segment in S z Changing u s color to red (circle) reduces the number of violations by 1 and the number of violations by The white good block is only partially overwritten in the optimal coloring The red (circles) core defined by the three leaf blocks in the core. Filled shapes (vertices) are in the red core The three right hand vertices are in the signature. The signature is invalid since the right red (second from right) is totally overwritten by the black core A block in the signature (left side) that is partially overwritten by two other cores (partially filled vertices), is maximally expanded (right side) The maximal sub block in a good block for a signature, is expanded C is a convex recoloring for C which defines the following penalties: p green (C ) = 1, p red (C ) = 2, p blue (C ) = The upper part of the figure shows the optimal blocks on the string and the lower part shows the coloring returned by the algorithm Case 2: a vertex v is contained in 3 different containers Case 3: Not case 1 nor 2. T is the subtree rooted at rd0 and ˆT = T \ T Case 3a: No vertices of ˆT are colored by d Case 3b: r d0 T d0 T d REDUCE of case 3b: T is replaced with T0 where w1(v 0 ) = C high C min = 2 and w1(r d0 ) = C medium C min =

11 List of Tables 2.1 (a):four aligned sequences with sixteen sites. (b): The corresponding observed sequence spectrum The observed sequence spectrum of NK cell receptor D gene of human, mouse and rat

13 Abstract This thesis is in the area of computational biology known as molecular phylogenetics. Specifically, we focus on two issues of phylogenetics: analytic maximum likelihood (ML) solutions and minimum convex recoloring for phylogenetic trees. The first part is dedicated to the ML issue. Up to now, the only known general analytical ML solution were for rooted triplets on the simplest evolutionary model - the Neyman two states under molecular clock. We here present three ML solutions for three different topologies on two different evolutionary models. Each extends the existing knowledge by either considering a larger topology, or a more general evolutionary model. The molecular clock fork and the comb extend the state of the art knowledge from three species to four while retaining the evolutionary model. The JC triplet retains the topology and the molecular clock assumption while extending the substitution model to a more realistic, four states substitution model - the JC model. A by product of this last work is a first formal proof of the Hadamard Conjugation for the Kimura substitution models. Even though this technology exists for ten years already, a formal proof of it was never published. The second part of this thesis deals with the theme of Minimum Convex Recoloring for Phylogenetic Trees. A coloring of a tree is convex if the vertices that pertain to any color induce a connected subtree; a partial coloring (which assigns colors to some of the vertices) is convex if it can be completed to a convex (total) coloring. Convex coloring of trees arises in areas such as phylogenetics, linguistics, etc. For example, a perfect phylogenetic tree is one in which the states of each character induce a convex coloring of the tree. Research on perfect phylogeny is usually focused on finding a tree such that few predetermined partial colorings of its vertices (i.e. each character is a different partial coloring on the leaves) are convex. When a coloring of a tree is not convex, it is natural to ask how far it is from a convex one. That is, what the minimal number of color changes at the vertices needed to make the coloring convex? This can be viewed as minimizing the number of exceptional vertices with respect to a closest convex coloring. We also study a similar measure, which aims at minimizing the number of exceptional edges respect to a closest convex coloring. We show that finding each of these distances is NP-hard even for paths (or strings). We then focus on the first measure and generalize it to weighted trees, and then to non-uniform coloring costs (that is, a change between any two colors is associated with a different cost) 1. On the positive side we present few 1 in contrast to the uniform models where a change between any two colors has a unitary cost 1

14 algorithms for convex recoloring of strings and bounded degree trees: First we present algorithms for optimal convex recolorings of strings and trees with non-uniform coloring costs, which, for any fixed number of colors, are linear in the input size. Then we present algorithms for string and bounded degree trees that run in time exponential in the number of changes required and linear in the input size. Finally, we present polynomial time approximation algorithms for convex recoloring of strings and trees. 2

15 Notation )(34)-fork A rooted tree with two sister taxa at both sides of the root ((12)3)4-comb A rooted tree with three taxa at one side and 1 and 2 are siblings H n Hadamard matrix of order n q The edge length spectrum s The expected sequence spectrum X The set of species e α The edge inducing the split (α, X α) Π α The path set induced by α 3

16 4

17 Chapter 1 Introduction This thesis is in an interdisciplinary area called computational biology, which is motivated by computational questions arising in molecular biology. This area offers many challenging algorithmic, combinatorial, probabilistic and optimization problems. The goal of this research is to devise algorithms, analyze complexity, solve open problems and give analytic results in the following specific topics of computational biology: Analytic solution of maximum likelihood evolutionary trees. Convex recoloring of phylogenetic trees. The remaining of the introduction is devoted to a brief molecular biological background, and an overview of the rest of the dissertation. 1.1 Brief Biological Background The complete set of instructions encoding an organism is called its genome. It contains the master blueprint for all cellular structures and biochemical processes for the lifetime of the cell or organism. The human genome physically consists of tightly coiled threads of deoxyribonucleic acid (DNA) and associated protein molecules, organized in 23 pairs of distinct, physically separate microscopic units called chromosomes. The nucleus of most human cells contains pairs of chromosomes, where in each pair, one chromosome originates from each parent. Each cell has 23 pairs of chromosomes 22 pairs of regular autosomal chromosomes and a pair of sex chromosomes X;Y (females carry two X chromosomes, and males carry one X and one Y chromosome). Chromosomes can be seen under a light microscope and, when stained with certain dyes, reveal a pattern of light and dark bands. Difference in size and banding pattern allow the 23 chromosomes to be distinguished from each other, an analysis called a karyotype. A few types of major chromosomal abnormalities, including missing or extra copies of chromosome or gross breaks and rejoining (translocations), can 5

18 be detected by microscopic examination; Down s syndrome, in which an individual s cells contain a third copy of chromosome 21, can be diagnosed by karyotype analysis. Most changes in DNA, however, are too subtle to be detected by such a technique and require more subtle molecular analysis. These subtle DNA abnormalities (mutations) are responsible for many inherited diseases such as cystic fibrosis and sickle cell anemia, or may predispose an individual to cancer, major psychiatric illnesses, and other complex diseases. A DNA molecule is a polymer consisting of two interwound helical strands. Each strand is a sequence of nucleotides, or bases, drawn from the set {A, C, T, G}, where A stands for Adenin, C for Cytosine, T for Thymine and G for Guanine. The two strands are complementary in the sense that each A on one strand is bound to its complementary nucleotide T on the other, and each C is bound to a G. The particular order of these bases is called the DNA sequence; the sequence specifies the exact genetic instructions required to create a particular organism with its own unique traits. DNA molecules can be very long. For example, the size of an average human chromosome is about 150 million base pairs, while the entire human genome contain about 3 billion base pairs. The genes, which are the basic physical and functional units of heredity, are arranged linearly along the chromosomes. A gene is a specific sequence of nucleotide bases, whose sequence carries the information required for constructing one or more protein. Proteins provide the structural components of cells and tissues as well as enzymes for essential biochemical reactions. The human genome, for example, is estimated to contain approximately 30,000 genes. 1.2 Phylogenetic Trees Given a set of taxa (a group of related biological species), the goal of phylogenetic reconstruction is to build a tree which best represents the course of evolution for this set over time. The leaves of the tree are labelled with the given, extant taxa. Internal nodes correspond to hypothesized, extinct taxa. Because events of taxon divergence are assumed to be rare, the sought after tree is bifurcating (or binary), with internal nodes of degree three. (In case of ambiguous data one might have to resort to multifurcating trees, which are less informative.) In early days, morphologic features were mostly used to study evolution. Today, molecular data are the primary basis for phylogenetic analysis of evolution, but other sources of information (for example palaeontological, anatomical, and morphological) are also in use. Still, our exposition will concentrate on molecular sequence data. The first step in constructing a tree is to collect from an updated database either DNA (typically genes), RNA, or amino acid sequences for all taxa under study. Homologous sequences (detected by similarities, or low edit distances) from different taxa are then grouped together. Homologous sequences for different taxa often have the same functionality (e.g. insulin, hemoglobin, etc.) and are assumed to be 6

19 descendants of a common ancestral sequence. Their degree of similarity gives an indication of the time when two taxa diverged. Since the mutational process is assumed to be probabilistic in nature and to operate locally, we expect that longer periods of time since divergence usually imply more accumulated mutations. Still, different proteins may evolve at different rates. Combining this with the stochastic nature of the process, single proteins, viewed separately, may give conflicting indications as to the history of evolution. To overcome this random noise effect it is thus advisable to employ longer sequences, obtained by concatenating many sequences together. Phylogeny reconstruction methods are broadly divided into character-based and distance-based methods. Distance based methods start by computing evolutionary distances between pairs of taxa. Then a tree with weighted edges whose pairwise tree distances approximate the evolutionary distances is sought, typically by some version of the neighbor joining clustering paradigm [47]. In contrast, character based methods work directly on character data. The best known and most widely used character-based methods are maximum parsimony [20] and maximum likelihood [16]. Maximum parsimony (MP) is a non-parametric combinatorial method, while maximum likelihood (ML) is a parametric statistical method. Despite its popularity, MP has the drawback that for certain ranges of input parameters it is inconsistent. This means that the sequence data leads to the construction of an incorrect tree, even if the number of sample points tends to infinity [15]. 1.3 Analytical Maximum Likelihood Phylogenetic Trees Maximum likelihood (ML) on molecular sequence data uses a model of evolution, which is usually a family of trees with n taxa at their leaves, and a substitution model. The parameters of the substitution model describe probabilities of changes in character states (e.g. point mutations in DNA nucleotides). Given a set of n observed sequences, the goal is to find the best explanation for the data within the model space. In our context, this usually means a weighted tree (where the weights are parameters of the substitution model for each edge) that maximizes the likelihood (the conditional probability, under the model, of generating the observed sequences). One of the attractive properties of maximum likelihood is that two competing explanations for the same data can be evaluated not only qualitatively but also quantitatively, by comparing their log likelihood values (or lod score). The current application of maximum likelihood for reconstructing evolution is by Felsenstein [16], and has gained wide acceptance [7, 24, 46, 56]. The method is computationally intensive, but for tractable cases it is the method of choice. Algorithmically, the likelihood is maximized separately for each tree in the family (pruning in the tree space is sometimes possible). The weighted tree (or trees) with maximum value(s) is then reported. There is no known analytical solution or direct algorithm that optimizes the edge parameters for a given tree. Existing algorithms [18, 55] use 7

20 an iterative, hill climbing approach. For hill climbing to be guaranteed to find the maximum, there must be a single local and global maximum in the parameter space. Fukami and Tateno [21], and subsequently Tillier [58], have argued that for each tree, the ML point is indeed unique. However, Steel [52] showed that their proof was erroneous, and constructed a surprisingly simple counter example (utilizing sequences with two sites and just four taxa). The example shown by Steel was on a very pathological input data. Chor et. al. [10] showed that multiple maxima can occur also on reasonably, non pathological input data. The molecular clock hypothesis assumes a constant rate of evolution across all lineages of the phylogenetic tree. This implies that a tree satisfying the molecular clock assumption can be rooted such that the length of the path from the root to each of the leaves is the same. ML can be divided into two related problems: Big ML that seeks for the weighted tree which maximizes the likelihood of the data among all the tree space, and Small ML that operates on a given tree and seeks for optimal edge weights (i.e. that maximize the likelihood on that tree). When the number of taxa is small, the big problem can be solved by solving the small problem for every tree in the tree space. The current state of analytical solutions for ML applies only to the small problem. However, since the number of species handled analytically is also small, this is not a serious limitation. This focuses the challenge on analytical small ML. The first to consider analytical solutions for small ML with simple substitution models was Yang (2000), who worked on three taxa with the Neymann evolutionary model of symmetric two state characters under molecular clock [59]. Yang denoted his work as the simplest phylogeny estimation problem, but added that it has many of the conceptual and statistical complexities involved in phylogenetic estimation. The solution of Yang was generalized and its derivation was simplified by Chor, Hendy and Penny [11] using the Hadamard Conjugation of Hendy, Penny, and Steel (1994) [28, 29], together with convexity arguments. The first part of this thesis deals with analytical ML solutions. Chapter 2 introduces the notations used in this part of the thesis, describes the two substitution models we work with, and explains the main basic mechanism behind all these solutions - the Hadamard conjugation. Moreover, when working under the more complex substitution model of Kimura, the Hadamard conjugation becomes substantially involved and requires comprehensive understanding. A first correctness proof for the Hadamard conjugation for the Kimura models is provided in that chapter as well. Chapter 3 in this thesis presents three different extensions to the analytical small ML of the rooted triplets works of [59, 11]. In each of these extensions, either the number of species is increased while the model is retained, or the model is extended and number of species is retained. In the first two works, the number of taxa is increased from three to four while the underlying model of substitution is retained. There are two families of topologies on four taxa under the molecular clock model: Topologies with two taxa in each subtree of the root, which we call fork topologies, and topologies where one subtree of the root has three taxa, which we call comb 8

21 topologies. Recall that under molecular clock, the distance from each of the four leaves to the root is the same (Figure 1.3) (((1 2) 3) 4)-MC-comb (1 2)(3 4)-MC-fork Figure 1.1: The fork and comb two rooted topologies on four taxa.!" $# & comb % fork % We start with the simpler topology - the fork - in Section 3.3 and later extend it to the comb topology in Section 3.4. Section 3.5 investigates the same topology handled by [59, 11]: the molecular clock triplet (see Figure 1.2). However, the substitution model is extended from the simple two states model of Neymann to the four states model of JC. 1.4 Convex Recoloring of Trees and Strings The second part of this thesis is in Chapter 4, and deals with combinatorial character based approaches in phylogenetics. In biology, characters describe attributes of the species under consideration and are the data that biologists use to reconstruct phylogenetic trees. Characters can be morphological (for example, wings versus no-wings), biochemical, physiological, behavioral, embryological, or molecular (for example, the nucleotide at a particular DNA sequence position, or the order of certain genes on a chromosome). In a rooted phylogenetic tree, one can view the character states along a path from the root to some species as evolving to that species state Figure 1.2: A rooted tree over 3 species. ( )* +, ' 9

22 A natural biological goal is that the reconstructed phylogeny has the property that each of the characters could have evolved without reverse or convergent transitions: In a reverse transition, some species regains a character state of some old ancestor whilst its direct ancestor has lost this state. A convergent transition occurs if two species posses the same character state, while their least common ancestor possesses a different state. The concept behind this constraint is of innovation. That is, each time the character state changes, it acquires a new state. The innovation assumption exclude reverse and convergent transitions, and is denoted alternatively - homoplasy free character [50]. A character in a phylogenetic tree can be viewed as a coloring of the tree vertices, where each color represents one of the character s states. A character is homoplasy free iff the corresponding coloring is convex, that is, the set of vertices having the same color (state) induces a subtree. Thus, the above discussion implies that in a phylogenetic tree, each character is likely to be convex. This makes convexity a fundamental property in the context of phylogenetic trees. In this work we introduce a natural criterion, suggested by the concept of Hamming distance between vectors: The minimal number of species whose states should be changed to make the given character convex. This measure is motivated by a scenario where a set of species for which the evolutionary tree is already known (e.g. the primates tree shown in Figure 4.1). Given a character relating to a subset of all the species (extant and extinct) in the tree, we want to know how much this character agrees with the tree - that is, how far this character is from perfect phylogeny on the given tree. Indeed, the evolution of a character which can be made convex by removing a small number of exceptions, can be explained by searching biological reasoning for few exceptional phenomena, as was done for the two above mentioned cases (see e.g. [40]). If however a very large number of state changes is needed to make the character convex, a biological explanation becomes less probable, and the reliability of the given phylogeny as a correct description of the evolution of this character diminishes accordingly. Another measure that we discuss is the minimal number of state changes that should be removed for making the character convex on the tree, or the minimum number of changes of both types which are needed to achieve this purpose. We note that our problem for partially colored trees bears some similarity to the small parsimony problem. In both problems a tree with a (partial) assignments of states to the vertices is given. The small parsimony problem finds a coloring with the minimum number of violations to the perfect phylogeny property, and the convex recoloring problem finds the minimum number of color changes needed to achieve perfect phylogeny. It is therefore a bit surprising that while the weighted version of the maximum parsimony problem has an efficient optimal solution [48], the unweighted version of minimum convex recoloring is NP-Hard, as we show here. We study also two more general cost functions, which enable more flexibility in measuring the distance of a given coloring from a convex one: The first allows weights on the vertices (the weight of a vertex reflects the certainty we have in its state/color). The most general criterion, allows also non uniform cost function for color changes (details are in Section 4.2). 10

23 In studying our problem, we first focus on a very simple form of a tree - a string (or a path), which seems to be interesting in its own right. Then we extend the results (if possible) to a tree. Our negative results show that the (unweighted) versions of the above problems are hard even for a very simple tree topology - a string, and for the case where character states are given only at the leaves (so that changes on extant species are not counted); we also prove that finding the minimum number of mutation removals needed to obtain convexity, in a sense to be defined, is NP-hard. On the positive side, we present dynamic programming algorithms for strings and bounded degree trees. The first algorithm is designed to solve the non-uniform versions of the problem. The algorithm runs in linear time for any fixed number of colors (i.e. the time is linear in the input size and exponential in the number of colors). Then we show that for strings and trees of bounded degree, the (unweighted version of the) problem can be solved by a fixed parameter tractable algorithm. The proof of this result is based on identifying signatures of recolorings, so that the number of possible signatures is bounded from above by the number of recolored vertices in an optimal solution, and then presenting a polynomial time algorithm for a minimal convex recoloring with a given signature. Finally, we present polynomial time algorithms for 2-approximation for strings and 3-approximation for trees, for the weighted versions of the problem. 11

24 12

25 Chapter 2 Substitution Models and the Hadamard Conjugation A central tool in our analytical solutions is the Hadamard conjugation [28, 29]. It is applicable to group like models of substitution. In this chapter we define these applicable models of substitution. Hadamard conjugation is an invertible transformation that links the probabilities of site substitutions on edges of an evolutionary tree, T, to the probabilities of obtaining each possible combination of characters. The Hadamard conjugation is applicable to a number of site substitution models: Neymann 2 state model, Jukes Cantor model [33], and Kimura 2ST model and 3ST model [39] (the last three are applicable to normal, four states DNA). For these models, the transformation yields a powerful tool which greatly simplifies and unifies the analysis of phylogenetic data, and in particular the analytical approach to ML. We begin by introducing notations that are used by all analytic solutions. We also define the specific Hadamard matrix we use. Next, we introduce the simplest model, the Neymann 2 state model [42], along with the corresponding Hadamard conjugation. Subsequently, we define the family of four states substitution models of Kimura. Although the Hadamard conjugation for the Kimura models is seemingly identical to the Neymann model, the underlying mathematics is substantially different. Moreover, the correctness of the Hadamard conjugation for the Kimura models was never formally proved. In the last section of this chapter, we explain and prove for the first time the appropriate Hadamard conjugation for this family of models. 2.1 Notations and the Hadamard Matrix We start with notations that are common to both models and will be useful for the rest of the work - tree labelling. Let X = {1, 2,, n} be a set of n taxa represented by the sequences σ 1, σ 2, σ n (these sequences are equi-length, assumed to be homologous and aligned). We select a reference taxon, n, and let X = X {n} be the non-referenced taxa. Consider now an evolutionary tree, T, on the taxa set 13

26 2 e 2 e 123 e e 3 e 123 e e 123 e23 e 2 2 e 1 1 T 12 e e 1 T 13 e e 1 T 23 e 3 3 Figure 2.1: The three 4 taxa unrooted trees ' X. The leaves of T are labelled by the elements of X. For i, j X we define the path Π i,j to be the set of edges in T connecting leaf i to leaf j. Each edge e of T defines a unique split A X A among the taxa set X induced by deleting e from T. We index e as e A where A = {i e Π i,n } X is the set of taxa separated from n by e. Figure 2.1 shows The three unrooted trees on the taxa set X = {1, 2, 3, 4}. The subscript A of e A (A X = {1, 2, 3}) is the set of taxa isolated from the reference taxon 4 on the deletion of that edge. The subscripts on the edges are abbreviated so that, for example e 23 stands for e {2,3}. The trees are indexed as T A where A is the index of the internal edge. The Hadamard matrix is used by both types of the Hadamard conjugation and is defined below: Definition A Hadamard matrix of order l is an l l matrix A with ±1 entries such that A t A = li l. We will use a special family of Hadamard matrices, called Sylvester matrices in MacWilliams[ and Sloan (1977, ] p. 45), defined inductively for n 0 by H 0 = [1] Hn H and H n+1 = n. For example, H n H n [ ] H 1 = and H = We encode a subset of {1,..., n} in an n-long binary number where the ith least significant bit (i = 1,..., n) is 1 if i is in the subset, and 0 otherwise. Using this representation, it is convenient to index the rows and columns of H n by subsets of {1,..., n} in a lexicographically increasing order (i.e. φ, {1}, {2}, {1, 2}...). Denote by h α,γ the general element of H n, the element at the (α, γ) entry of H n. Observation h α,γ = ( 1) α γ. This implies that H n is symmetric, namely Hn t = H n, and thus by the definition of Hadamard matrices Hn 1 = 1 H 2 n n. 14

27 Observation h A,B = h B,A, and h A,B = h A,C h A,D, (2.1) where C, D form a partition of B ( i.e. C D = B, C D = ). 2.2 Neymann Two States Substitution Models In Neymann 2 states model [42], each character at a species admits one out of two states, without loss of generality {x, y}. Hence, a character evolving along an evolutionary tree T with n leaves, induces a split pattern between the leaves admitting the state x and y. To every edge e E(T ) we assign a length q e. q e is defined as the expected number of substitutions (changes) per site along that edge. Given the edge lengths of T : q = [q e ] e E(T ) (0 q e < ), the probability of generating an α-split pattern (α {1,..., n 1}) is well defined. Denote this probability by s α = P r(α-split T, q). Using the same indexing scheme as above, we define the expected sequence spectrum (expected spec) s = [s α ] α {1,...,n 1}. The edges lengths spectrum (edges spec) of a tree T with n leaves is the 2 n 1 dimensional vector q = [q α ] α {1,...,n 1}, defined for any subset α {1,..., n 1} by q e if e E(T ) induces the split α, q α = e E(T ) q e if α =, 0 otherwise. The Hadamard conjugation specifies a relation between the expected sequence spectrum s and the edge lengths spectrum q of the tree as follows: Proposition (Hendy and Penny 1993) Let T be a phylogenetic tree on n leaves with finite edge lengths (q e < for all e E(T )). Assume that sites mutate according to a symmetric substitution model, with equal rates across sites. Let s be the expected sequence spectrum. Then s = s(q) = H 1 n 1 exp(hq), where the exponentiation function exp(x) = e x is applied element wise to the vector ρ = Hq. That is, for α {1,..., n 1}, s α = 2 (n 1) γ h α,γ (exp ( δ h γδq δ )). The Hadamard conjugation for the two states symmetric model was presented and proved by Hendy and Penny in [28] 1. 1 some earlier versions of spectral analysis also appeared in [26, 27]. 15

28 Definition A vector ŝ R 2n 1 satisfying α {1,...,n 1} ŝα = 1 and Hŝ > 0 is called conservative. For conservative data ŝ, the Hadamard conjugation is invertible, yielding : γ = γ(ŝ) = H 1 n 1 ln(hŝ) where the ln function is applied element-wise to the vector Hŝ. We note that γ is not necessarily the edge length spectrum of any tree. On the other hand, the expected sequence spectrum of any tree T is always conservative. 2.3 Kimura s 3-substitution Model In 1981 Kimura [38] introduced his 3 substitution type model of RNA nucleotide substitution (K3ST), (see Figure 2.2(a)), where he identified 3 classes of substitution: the transitions (with rate α), the type I transversions (with rate β) and the type II transversions (with rate γ). Setting β = γ gives his 2 substitution type (2ST) model [37], and setting α = β = γ gives the 1 parameter model of Jukes and Cantor [33]. We take the characters as the DNA or RNA nucleotides A, G, T(or U) and C. A particular advantage of K3ST is that the substitutions form a group acting on the set of nucleotides, and it is from this property that Hadamard conjugation can be derived. In Figure 2.2(b) we illustrate the substitution types t α = t 01 for transitions, t β = t 10 and t γ = t 11 for transversions. t ɛ = t 00 is the identity (no substitution). With the binary codes for the nucleotides (e.g. (0, 1) for C, c.f. Figure 2.2(a)) t xy (a, b) = (c, d), where c a + x(mod 2) and d b + y(mod 2). Kimura identified α, β and γ as three rate classes, however we take our parameters to be the probabilities of each substitution type p α, p β and p γ, with p ɛ = 1 (p α + p β + p γ ) being the probability of no substitution. These probabilities are recorded in a 2 2 probability matrix [ ] pɛ p P = α. p β For each e an edge of a tree T, we set P e = P, where each entry p θ is the probability that the characters at the vertices are transformed by t θ. (Thus X at one vertex is transformed to t θ (X) at the other.) (The direction is irrelevant as Kimura s model is symmetric.) From P e, (provided HP e H > 0) we derive the matrix [ K Q e = H 1 [Ln(HP e H)]H 1 qα = p γ q β q γ ], (2.2) where H = H 1 = 16 [ ] (2.3)

29 % (a) (b) U(T) α C t α (0, 0) (0, 1) β γ γ A α β G t γ t γ (1, 0) (1, 1) t β t α t β Figure 2.2: (a) Kimura s 3 substitution model (K3ST). (b) Substitution types t α = t 01, t β = t 10, t γ = t 11 and t ɛ = t 00. # # " t γ = t 11 t β = t 10 t α = t 01 (b) ' (K3ST ) ) * (a) ' t ɛ = t 00 is the 2 2 Hadamard matrix, having inverse H 1 = 1 H, and Ln is the natural 2 logarithm function applied individually to each entry of the matrix M = HP e H. The entries of q α, q β and q γ of Q e are additive parameters which we will refer to as the three edge-length parameters. Their sum q α + q β + q γ = K is the value Kimura refers to as evolutionary distance. If we assume a Poisson model of substitution with the three Kimura rates over time t, then we find q α = 2αt, q β = 2βt and q γ = 2γt are the expected numbers of substitutions of each type occurring along e, and R = 1 Q is the rate matrix. 2t Equation (2.2) can be inverted, giving [ P e = H 1 [Exp(HQ e H)]H 1 pɛ p = α p β p γ ], (2.4) where Exp refers to the exponential function applied individually to each entry of the matrix HQ e H. We see Hence [ HP e H = [ HQ e H = (p α + p γ ) 1 2(p β + p γ ) 1 2(p α + p β ) ] 0 q α + qγ. q β + q γ q α + q β 17 ] [ = 1 e 2(qα+qγ) e 2(q β+q γ) e 2(qα+q β) ], (2.5)

30 and [ ] pɛ p P e = α p β p γ = 1 [ ] 1 + e 2(q α+q γ) + e 2(q β+q γ) + e 2(qα+q β) 1 e 2(qα+qγ) + e 2(q β+q γ) e 2(qα+q β) e 2(qα+qγ) e 2(q β+q γ) e 2(qα+q β) 1 e 2(qα+qγ) e 2(q β+q γ) + e 2(qα+q (2.6). β) In Figure 2.2 (b), a reference nucleotide Σ can be inserted at (0, 0) then t θ (Σ) occurs at the point (a, b), where (0, 1) = α, (1, 0) = β and (1, 1) = γ. The set of substitutions {t ab a, b Z 2 } is a group acting on the nucleotides. The i th site of the sequence σ j is the i th character Σ ij of σ j. Selecting σ 0 as the reference sequence, the i th site pattern is a vector of n substitution types, with j th component the substitution which transforms Σ i0 to Σ ij. This site pattern can be referenced as a pair (C, D) of subsets of X = {1,..., n}, where C = {i:σ ij = t θ (Σ 0j ), θ = α or γ} D = {i:σ ij = t θ (Σ 0j ), θ = β or γ}. (2.7) Table 2.1(a) illustrates four sample DNA sequences with sixteen sites. σ 0 is the reference sequence, the pair of binary digits above each character of σ 1,, is the substitution type to derive that character from the homologous character of σ 0. For example, the entry 11 above G at site #10 of σ 3 indicates that the substitution to this nucleotide from the corresponding T of the reference sequence σ 0 is of type t γ. In (b), the frequencies of each of the site patterns from (a) are summarized in the observed sequence spectrum F. The rows of F are indexed by the first triple of the binary pairs, and the columns by the second, in the order 000, 100, 010, 110, 001, 101, 011, 111. The site pattern of site #10 is represented by the pair (011, 001) so the entry corresponding to this is in row 011 and column 001 of F. As this pattern occurs only at site #10, the entry in row 011 and column 001 of F is 1 (highlighted in bold font). 2.4 Hadamard Conjugation of K3ST on a Phylogenetic Tree In this section we prove the Hadamard conjugation for the Kimura s substitution models. Although the transformation is identical to the case of two states model, the expected sequence spectrum is defined substantially different, as well as the edge length spectrum. In the K3ST model, for each edge e A of T, instead of one length, we specify three edge-length parameters q ea (α), q ea (β) and q ea (γ). Using Equation 2.4 we can relate the probabilities of each type of substitution across e A as functions of the edge lengths, P ea = H 1 (Exp(HQ ea H))H 1, 18

31 ' site # σ 0 = A C A G T A G T G T T A C C A G σ 1 = A C A G C A A T G T T A T C T C σ 2 = C C A T T G A A G A T G C G T T σ 3 = C C A T C A A A C G T G T G A C a F = Table 2.1: (a):four aligned sequences with sixteen sites. (b): The corresponding observed sequence spectrum! " (! (! (b) ' 16 (a) b 19

32 where and [ ] pea (ɛ) p P ea = ea (α) p ea (β) p ea (γ) [ KeA q Q ea = ea (α) q ea (β) q ea (γ) ]. From these equations we can derive the transition matrix p ea (ɛ) p ea (α) p ea (β) p ea (γ) P (e A ) = p ea (α) p ea (ɛ) p ea (γ) p ea (β) p ea (β) p ea (γ) p ea (ɛ) p ea (α) p ea (γ) p ea (β) p ea (α) p ea (ɛ) Further, if W = {e A, e B,, e C } is a set of edges of T, then the product of transition matrices p W (ɛ) p W (α) p W (β) p W (γ) P (W ) = P (e A )P (e B ) P (e C ) = p W (α) p W (ɛ) p W (γ) p W (β) p W (β) p W (γ) p W (ɛ) p W (α) p W (γ) p W (β) p W (α) p W (ɛ) is the transition matrix of the probabilities p W (θ) of the substitutions multiplied across the edges of W being of type θ. Then we see from Equation (2.5) that [ ] pw (ɛ) p P W = W (α) = H 1 [Exp(HQ p W (β) p W (γ) W H)]H 1, (2.8) where Q W = e A W Q ea. The edge-length parameters for the full set of edges of T can be collected in three vectors q α, q β and q γ, where the components are indexed by the 2 n 1 subsets of X of taxa. For A X not an edge split, we set q A (θ) = 0, (θ {α, β, γ}), except for A =, where q (θ) = K(θ) where K(θ) is the sum of all other q A (θ) values. Figure 2.3(a) shows the edge length spectra for the tree T 13 on n = 4 taxa illustrated in Figure 2.1. Corresponding components of the vectors q α, q β, q γ, give the three edge lengths parameters for the corresponding edge. The value 0 indicates that there is no corresponding edge in T (e.g. q 12 in T 13 ). We will find it convenient to put these three vectors into a matrix Q(= Q T ) = [q A,B ] of 2 n rows and columns indexed by the subsets of X, with q, = (K(α) + K(β) + K(γ)), with the remaining entries of q α, q β and q γ becoming the leading row, column and main diagonal of Q, and all other entries set to 0. Figure 2.3(b) shows the matrix Q = Q T13 holding the vectors q α, q β, q γ from Figure 2.3(a).These vectors are placed in the leading row, column and main diagonal of the matrix Q. 20

33 q α = K(α) q 1 (α) q 2 (α) 0 q 3 (α) q 13 (α) 0 q 123 (α), q β = K(β) q 1 (β) q 2 (β) 0 q 3 (β) q 13 (β) 0 q 123 (β) (a), q γ = K(γ) q 1 (γ) q 2 (γ) 0 q 3 (γ) q 13 (γ) 0 q 123 (γ), Q T = K q 1 (α) q 2 (α) 0 q 3 (α) q 13 (α) 0 q 123 (α) q 1 (β) q 1 (γ) q 2 (β). q 2 (γ) q 3 (β)... q 3 (γ)... q 13 (β).... q 13 (γ) q 123 (β) q 123 (γ) (b), Figure 2.3: (a): Example edge length spectra for the tree T 13. (b): Q = Q T13 ' Q = Q T13 (b) ' T 13, ( " (! (a) This means that for A, B X = {1, 2, 3}, Q,B = q B (α), Q A, = q A (β), Q A,A = q A (γ), and for all other entries Q A,B = 0, except the first entry Q, = K, where K = K(α) + K(β) + K(γ). The entries indicated by. are all zero, these are zero for every tree. The entries indicated by 0 are zero for this tree T, but for different trees can be non-zero. The non-zero entries (in the leading row, column and main diagonal) should each be in the same component, and these identify the edge splits of T. For general trees on n taxa, the edge length spectra are vectors and matrices of order 2 n 1. For any A X, let Π A = {e B h A,B = 1}. It is easily shown that if A contains 2m (or 2m 1) elements, Π A is the disjoint union of m paths connecting disjoint pairs of leaves ina (or ina {n}). When T is binary, then these paths are disjoint. We will refer to Π A as the pathset generated by A. Consider the A th component of product Hq α : 21

34 (Hq α ) A = ( 1) A D q D (α) = q (α) + ( 1) A D q D (α) D X D :e D E(T ) = q D (α) + ( 1) A D q D (α) = 2 q D (α) = 2 D :e D E(T ) D :e D E(T ) D :h A,D = 1 D :e D Π A q D (α). (2.9) For any subset W of the edge set of T, let d W (θ) = e W q e(θ) be the θ length of W, for θ {α, β, γ}. Thus in particular from Equation (2.9) (Hq θ ) A = 2d ΠA. Similarly we find for A, B subsets of X = {1, 2,..., n}, the (A, B) th component of HQH is (HQH) A,B = 2(d ΠA (α) + d ΠB (β) + d ΠC (γ)), where C = A B is the symmetric difference of A and B. Let U = Π A Π C, V = Π B Π C and W = Π A Π B, so Π A Π B, is partitioned into the disjoint subsets U, V, W, with Π A = U W, Π B = V W and Π C = U V. We see so e 2d Π A (α) = e 2(d U (α)+d W (α)), (Exp(HQH)) A,B = e 2d Π A (α) e 2d Π B (β) e 2d Π C (γ), Now, from Equation(5) we see = e 2(d U (α)+d W (α)) e 2(d V (β)+d W (β)) e 2(d U (γ)+d V (γ)). (2.10) e 2(d U (α)+d U (γ)) = 1 2(p U (α) + p U (γ)) = p U (ɛ) p U (α) + p U (β) p U (γ), so that Equation (2.10) becomes Exp(HQH)) A,B = [p U (ɛ) p U (α) + p U (β) p U (γ)] [p V (ɛ) + p V (α) p V (β) p V (γ)] [p W (ɛ) p W (α) p W (β) + p W (γ)]. (2.11) When the factors of Equation (2.11) are expanded we obtain 64 terms each of the form Further we observe ±p U (θ 1 )p V (θ 2 )p W (θ 3 ). (2.12) p ΠA (α)p ΠB (β) = θ p U (αθ)p V (βθ)p W (θ), where θ is summed over the elements of {ɛ, α, β, γ}. The corresponding four terms of Equation (2.11) each occur with + sign. However each of the four terms corresponding to p ΠA (α)p ΠB (α) = θ p U (αθ)p V (αθ)p W (θ) 22

35 occur in Equation (2.11) with a sign. Continuing this observation with other terms we find that the coefficient of p ΠA (ψ)p ΠB (φ) is a(a, ψ)b(b, φ) where a(a, ψ) = { 1 1 when { ψ = ɛ, β ψ = α, γ, b(b, φ) = { 1 1 when { φ = ɛ, α φ = β, γ. Thus (Exp(HQH)) A,B = ψ,φ a(a, ψ)p ΠA (ψ)b(b, φ)p ΠB (φ). (2.13) The characters χ i1,..., χ in at the i th site, partition X into (up to four) subsets by their states. These patterns can be indexed by a pair D, E X, if we set so that D = {j :χ ij = t θ (χ i0 ) for θ = α, γ}, E = {j :χ ij = t θ (χ i0 ) for θ = β, γ}, {j :χ ij = t ɛ (χ i0 )} = X (D E), {j :χ ij = t α (χ i0 )} = D E, {j :χ ij = t β (χ i0 )} = E D, {j :χ ij = t γ (χ i0 )} = D E. Let s DE to be the probability of obtaining a site with such a pattern. Then we see a(a, ψ) = h AD, and b(b, φ) = h BE so that Equation (2.13) can be rewritten as (Exp(HQH)) A,B = h DU h EV h (D E)W S DE = h AD h BE S DE = (HSH) AB, D,E D,E which gives us our major result Theorem S = H 1 (Exp(HQH))H 1, which provided the arguments of the Ln function are all positive, inverts to Q = H 1 (Ln(HSH))H 1. 23

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005 Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions arxiv:q-bio/0505054v1 [q-bio.pe] 27 May 2005 Benny Chor Michael D. Hendy Sagi Snir December 21, 2017 Abstract Complex systems of polynomial