HADAMARD conjugation is an analytic formulation of the

Size: px

Start display at page:

Download "HADAMARD conjugation is an analytic formulation of the"

Marvin White
5 years ago
Views:

1 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER Hadamard Conjugation for the Kimura ST Model: Combinatorial Proof Using Path Sets Michael D. Hendy and Sagi Snir Abstract Under a stochastic model of molecular sequence evolution the probability of each possible pattern of characters is well defined. The Kimura s three-substitution-types (KST) model of evolution allows analytical expression for these probabilities by means of the Hadamard conjugation as a function of the phylogeny T and the substitution probabilities on each edge of T. In this paper, we produce a direct combinatorial proof of these results using path-set distances, which generalize pairwise distances between sequences. This interpretation provides us with tools that have proved useful in related problems in the mathematical analysis of sequence evolution. Index Terms Hadamard conjugation, KST model, path-sets, phylogenetic trees, phylogenetic invariants. Ç 1 INTRODUCTION HADAMARD conjugation is an analytic formulation of the relationship between the probabilities of expected site patterns of nucleotides for a set of homologous nucleotide sequences and the parameters of some simple models of sequence evolution on a proposed phylogeny T. An important application of these relations is to give a theoretical tool to analyze properties of phylogenetic inference such as the methods of maximum likelihood and maximum parsimony, as well as for generating simulated data, and determining phylogenetic invariants. Hadamard conjugation can also be used directly for phylogenetic inference, inferring either trees with the Closest Tree algorithm [11], [5] or networks using Spectronet [18]. Application of the Hadamard conjugation in maximum likelihood phylogenetic inference under the Kimura s three-substitution-types (KST) model was done in [5] and in a related problem, phylogenetic invariants were used to reconstruct quartet trees under a generalized variant of KST [4]. Hadamard conjugation was first introduced in 1989 [10], [1] to analyze two-state character sequences evolving under the Neyman model []. Evans and Speed [9] noted that KST model [1] for 4-state characters could be modeled by the Klein group Z Z. Noting this, Székely et al. [8], [9] extended the two-state analysis to a more general algebraic theory, substitutions belonged to an arbitrary Abelian group. They then applied this to sequences evolving under the KST model. Current applications of Closest Tree and Spectronet [18] are usually applied to the 4-state KST model or its derivatives, the KST and Jukes-Cantor models.. M.D. Hendy is with the Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Private Bag 11, Palmerston North 4410, New Zealand. m.hendy@massey.ac.nz.. S. Snir is with the Mathematics Department, University of California, Berkeley, Berkeley, CA ssagi@math.berkeley.edu. Manuscript received 5 Oct. 00; revised 0 Mar. 00; accepted 1 May 00; published online June 00. For information on obtaining reprints of this article, please send to: tcbb@computer.org, and reference IEEECS Log Number TCBB Digital Object Identifier no /TCBB A path-set in a phylogenetic tree T is a generalization of the concept of a path. This approach allows the concept of pairwise distances between sequences to be extended to distances connecting larger sets of taxa. It provides properties that can be related to other evolutionary phenomena such as the molecular clock hypothesis. This has, for example, proved pivotal in allowing a simpler analytic expression of the likelihood function, as developed in [5], leading to an algebraic solution for the maximum likelihood points. We demonstrate this use, as well as the relation to the molecular clock property in our last section describing the application of the Hadamard conjugation, as was used in [5]. It has also proved useful in identifying phylogenetic invariants [15], [], [4] and introducing the projected spectra [0], which reduces both the variance in the parameter estimates and the computational complexity of the Closest Tree algorithm [11]. All the above examples rely on some relationships between the phylogenetic tree and the probabilities of obtaining sequences evolved under that tree. These relationships were proved in the past by algebraic tools on more general model of evolution. However, on the KST, these relationship can be expressed as identities between expressions in the tree parameters and expressions in the sequence probabilities. These relationship were outlined in []. Here, we provide a self contained more rigorous proof that bears some resemblance (Section 8) to the sketch in []. However, that outline lacks the details of the combinatorial properties of the intermediate variables, which we find to be of interest. Therefore, our proof serves as a more intuitive alternative to the presentations in [1] and []. We model the relationship of the differences of n sequences labeled by elements of ½nŠ f1; ; ; ng, from a reference sequence labeled 0 (note that 0 = ½nŠ). Because the models are reversible, the choice of reference sequence is arbitrary. The topology of T and the model parameters are presented in a sparse matrix Q T of n rows and columns, called the edge-length spectrum. The probabilities of each site pattern are presented in a similar sized matrix S T called the sequence probability /08/$5.00 ß 008 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

2 4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 By including the identity transformation, we find that the set of substitution types: T f; ; ; g is a group under composition, acting on the nucleotide set fa; C; G; UðTÞg. Thus, for example, ððcþþ ðgþ A ðcþ, so. Consider the maps g 1, g : T! C f1; 1g defined by Fig. 1. KST, showing the three substitution types,,, and, applied either to the RNA nucleotides A, C, G, and U or to the DNA nucleotides A, C, G, and T. spectrum. We also define a Hadamard matrix H n of n rows and columns and show that the matrix products: H n Q T H n ; H n S T H n ; both relate to properties of path sets. We prove the major result by interpreting corresponding components of each entry of these matrices. In particular, we show that the ðe; FÞ entry in both matrices corresponds to certain evolutionary distances defined by path sets E and F. We note that we were motivated to provide this new proof as these variables served as the defining parameters in the likelihood equation in [5] while their biological interpretation has not been elaborated sufficiently. We start by describing the KST model over a single edge and then generalizing it to a set of edges in a tree. Next, we introduced the notion of a site pattern and the matrix S T. In Sections 5 and, we introduce the Hadamard matrices and path sets and the relationship between them. Section 8 is the main part of the proof, we show the relationship between corresponding entries of the matrices H n Q T H n and H n S T H n. We end by describing the derivation of the equations used in [5] leading to an analytical solution of the ML problem. We believe this is an important contribution that can serve in the burgeoning area of algebraic statistics in biology and phylogenetics, in particular (see, e.g., [1], [], [], [], [4], and []). KIMURA S ST MODEL In this section, we first describe the KST model. We then derive identities relating the substitution matrix M, and the matrix of expected numbers of substitution along each edge. Finally, we encode these relationship by means of two simpler matrices P and Q, and the Hadamard matrix H 1. KST [1] specified independent rates for each of the substitutions between pairs of RNA or DNA nucleotides. Here, we will refer to Kimura s three substitution rates as, and, and use,, and to refer to the substitution types, as illustrated in Fig. 1. These are defined formally as.. The substitutions A $ G, UðTÞ $ C (transitions)... The substitutions A $ UðTÞ, G $ C (transversions type )... The substitutions A $ C, UðTÞ $ G (transversions type ). g 1 :!1;!1;! 1;! 1; ð1þ g :!1;! 1;!1;! 1: We find that g 1 and g are both homomorphisms from ðt ; Þ onto the -group ðc ; Þ, and the map g :!ðg 1 ðþ;g ðþþ; T; is an isomorphism onto the group ðc C ; Þ. Observation 1. ðt ; Þ is isomorphic to the Klein 4-group, ðc C ; Þ. In contrast, the set of substitutions of the KST model and of the Jukes-Cantor model do not form groups, as products are not well defined (for example, a product of two transversions in KST could either be a transition or the identity). A related, however, different aspect is the property of generalization/specialization between models. For example, we can specialize from KST down to each of these models by imposing restrictions on parameters (for example, if the expected numbers of transitions and of transversions of each type are equated, then KST specializes to the Jukes-Cantor model). A different restriction on the values of the model parameters is imposed by the Molecular Clock constraint, however, this is beyond the scope of this work (see, e.g., [1] and []). Kimura modeled the expected differences between two sequences separated by time t. With the three specified rates, the expected numbers of substitutions of each type are therefore qðþ t; qðþ t; qðþ t: By setting, this model projects to KST, Kimura s better known two substitution type model [0]. Setting gives the simple Jukes-Cantor model [19]. The probabilities pðþ, pðþ, and pðþ of observing differences of each type over the time period t underestimate qðþ, qðþ, and qðþ, as multiple changes are not directly observed. Observed frequencies of differences can be taken as estimates of pðþ for f; ; g. In [1], Kimura derived expressions for the expected numbers qðþ as functions of the probabilities pðþ. These are equivalent to the standard expression of the stochastic matrix M, derived from the rate matrix R 4 ð þ þ Þ ð þ þ Þ ð þ þ Þ ð þ þ Þ 5 ;

3 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 4 over time t, so that (e.g., see []) M expðrtþ; pðþ pðþ pðþ pðþ pðþ pðþ pðþ pðþ M 4 pðþ pðþ pðþ pðþ 5 ; pðþ pðþ pðþ pðþ K qðþ qðþ qðþ qðþ K qðþ qðþ Rt 4 qðþ qðþ K qðþ 5 : qðþ qðþ qðþ K K qðþþqðþþqðþ is the total number of substitutions, and exp is the standard exponential function for square matrices. We note that as R and t are fixed, so too are M, p, q, and K as they are defined by them. Let H be the 4 4 Hadamard matrix: H : Observation. H diagonalizes both M and Rt. In particular, and H 1 MH pðþ pðþ pðþ pðþ 0 5 ; pðþ pðþ H 1 RtH qðþþqðþ qðþþqðþ 0 5 : qðþþqðþ Recall the exponential of a matrix is a power series, so expðh 1 RtH Þ n0 ðh 1RtH Þ n n! n0 H 1 ðrtþ n H 1 expðrtþh : As expðh 1RtH Þ is diagonal, so too is H 1 expðrtþh, with entries 1 e 0 ; e ðqðþþqðþþ ; e ðqðþþqðþþ ; e ðqðþþqðþþ : Now, using (), we observe H 1 MH H 1 ðexpðrtþþh : Equating the diagonal entries shows that the eigenvalues of M and expðrtþ are n! H ðþ 1 pðþþpðþþpðþþpðþ e 0 e KþqðÞþqðÞþqðÞ ; 1 ðpðþþpðþþ pðþ pðþþpðþ pðþ e ðqðþþqðþþ e K qðþþqðþ qðþ ; 1 ðpðþþpðþþ pðþþpðþ pðþ pðþ e ðqðþþqðþþ e KþqðÞ qðþ qðþ ; 1 ðpðþþpðþþ pðþ pðþ pðþþpðþ e ðqðþþqðþþ e K qðþ qðþþqðþ : These equations can be succinctly expressed (see [1]) as H ; P 1 1 H 1 PH 1 ExpðH 1 QH 1 Þ; pðþ pðþ ; Q pðþ pðþ ðþ K qðþ ; qðþ qðþ and Exp is the exponential function applied to each entry of the matrix. Equation () can be inverted (as the arguments of ln are all positive) to give H 1 QH 1 LnðH 1 PH 1 Þ; Ln is the natural logarithm applied to each entry of a matrix. The invertibility of () and (4) means that provided the parameters are in valid ranges, the model could be specified either by the three probabilities pðþ, pðþ, and pðþ, or by the three parameters qðþ, qðþ, and qðþ. This inversion does not rely on a rate/time specification and a Poisson process of substitution. Hence, we are able to test the validity of a constant rate model in analyzing observed data. SUBSTITUTIONS ACROSS THE EDGES OF A TREE We now extend the model of the previous section to handle sets of edges. We derive the probability of a substitution of type along a set of edges W and record it in a stochastic matrix M W. We also define the path length matrix Q W and by a similar fashion to that in the previous section, obtain the relationship between M W and Q W. Finally, we define the edge length spectrum that records all the tree parameters. Let ½nŠ f1; ;...;ng and ½nŠ 0 ½nŠ[f0g. Let T be a tree (phylogeny) with leaf set LðTÞ ½nŠ 0 and edge set eðtþ. For each edge e eðt Þ, we can postulate three independent Kimura probability parameters p e ðþ, p e ðþ, and p e ðþ. These are collected in a stochastic matrix: p e ðþ p e ðþ p e ðþ p e ðþ p M e e ðþ p e ðþ p e ðþ p e ðþ 4 p e ðþ p e ðþ p e ðþ p e ðþ 5 ; p e ðþ p e ðþ p e ðþ p e ðþ with eigenvalues 1, expð ðq e ðþþq e ðþþþ, expð ðq e ðþ þ q e ðþþþ, and expð ðq e ðþþq e ðþþþ. Let ð4þ

4 44 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 P e p eðþ p e ðþ p e ðþ p e ðþ ; Q e K e q e ðþ q e ðþ q e ðþ ; Thus, equating the diagonals, we have Observation. then by (), we see that the probabilities p e ðþ, p e ðþ, and p e ðþ are related to the three parameters q e ðþ, q e ðþ, and q e ðþ as H 1 P e H 1 ExpðH 1 Q e H 1 Þ: As the matrices M e, for e eðtþ, are each diagonalized by H, they commute. Hence, for any subset of edges W eðtþ, we can formally define the product: M W Y p W ðþ p W ðþ p W ðþ p W ðþ p M e W ðþ p W ðþ p W ðþ p W ðþ 4 p ew W ðþ p W ðþ p W ðþ p W ðþ 5 : ð5þ p W ðþ p W ðþ p W ðþ p W ðþ We note that the term p W ðþ is the probability that the product of the substitutions of all the edges of W is. In particular, if W is a path in T, then p W ðþ is the probability that the states at the endpoints of the path differ by the substitution. In addition, when W feg, we see M feg M e and p feg ðþ p e ðþ. We see that M W is diagonalized by H : H 1 M W H ðp W ðþþp W ðþþ ðp W ðþþp W ðþþ 0 5 ; ðp W ðþþp W ðþþ as is each factor in (5), so H 1 M W H H 1 Y M e!h ew Y H 1 M eh Y ew ew expð ðq eðþþq eðþþþ expð ðq e ðþþq e ðþþþ expð ðq e ðþþq e ðþþþ expð ððq W ðþþq W ðþþþ ; 0 0 expð ðq W ðþþq W ðþþþ expð ðq W ðþþq W ðþþþ q W ðþ q e ðþ; ew q W ðþ q e ðþ; ew q W ðþ q e ðþ: ew 1 p W ðþþp W ðþþp W ðþþp W ðþ expð0þ; 1 ðp W ðþþp W ðþþ p W ðþ p W ðþþp W ðþ p W ðþ expð ðq W ðþþq W ðþþþ; 1 ðp W ðþþp W ðþþ p W ðþþp W ðþ p W ðþ p W ðþ expð ðq W ðþþq W ðþþþ; 1 ðp W ðþþp W ðþþ p W ðþ p W ðþ p W ðþþp W ðþ expð ðq W ðþþq W ðþþþ: We now define the corresponding P and Q matrices for the edge set W: P W p W ðþ p W ðþ ; p W ðþ p W ðþ Q W K W q W ðþ Q e ; q W ðþ q W ðþ ew K W q W ðþþq W ðþþq W ðþ. The relationships of Observation, similar to (), can now be expressed as H 1 P W H 1 ExpðH 1 Q W H 1 Þ: ðþ As the Q e matrices are additive over edge sets of T, we refer to the expected numbers q e ðþ, q e ðþ, and q e ðþ as the three edge-length parameters, for each edge e eðtþ. We can thus specify our model by the set of jeðtþj independent edge-length parameters fq e ðþ : f; ; g; e eðtþg: Given T and the jeðtþj edge length parameters, we can model sequence evolution on T under the KST model, if we specify a sequence of nucleotides at one leaf and generate corresponding nucleotides at every other vertex according to the probabilities p e ðþ. We comment that the KST model induces uniform base distribution under equilibrium. However, since our work deals with the probabilities along the edges, our derivation is indifferent to the base distribution. Edge indexing. The deletion of an edge e eðtþ induces two subtrees, whose leaf label sets A, A 0 (with 0 A 0 ) partition LðTÞ ½nŠ 0. Thus, A is the set of leaves of T separated from reference leaf 0 by the edge e. We choose the subset A ½nŠ (the subset not containing 0) to index e as e A. Thus, for e e A eðtþ: A fi ½nŠ : e 0i g; 0i is the path (in T) connecting leaves 0 and i. A partition of a set into two subsets fa; A 0 g (so A \ A 0 ;and A [ A 0 ) is called a split of. When ½nŠ 0, we will identify each split fa; A 0 g by the subset A, which does not contain 0 and, hence, we see the set of splits of ½nŠ 0 f0; 1;...;ng is bijective with the set of subsets of ½nŠ f1; ;...;ng. Now, for each A ½nŠ, we define the three values, ðq Þ A, ðq Þ A, and ðq Þ A,by 8 < q ea ðþ if e A eðtþ; ðq Þ A K P e BeðT Þ q e B ðþ if A ;; : 0 else; for ; ;. We incorporate these values into three vectors q, q, and q, each of n entries. We order the

5 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 45 indicated by. are all zero and are zero for every tree. The entries indicated by 0 are zero for the topology of T, signifying that the splits represented by them are not part of T. Different topologies can have positive values for these entries. The nonzero entries (in the leading row, column, and main diagonal) should each be in the same coordinates as they identify the edge splits of T. For general trees on n þ 1 taxa, the edge length spectra are vectors and square matrices of order n. Fig.. Example: The edge-length spectrum of the tree T T 1. components of the vectors by the subsets of ½nŠ as follows: ;; f1g; fg; f1; g; fg; f1; g; f; g; f1; ; g; f4g; ; ½nŠ, etc. As ðq Þ ; K, the sum of the components in each vector is 0. We will also find it convenient to incorporate the vectors into a n n matrix: Q T q A;B A;B½nŠ ; 8 q ea ðþ if e A eðtþ;b;; >< q eb ðþ if A ;;e B eðtþ; q A;B q ea ðþ if A B; e A eðtþ; K T if A B ;; >: 0 else; K T ðq e ðþþq e ðþþq e ðþþ K e : eeðtþ eeðt Þ Thus, the leading row of Q T is q, the leading column is q, and the leading diagonal is q, all other entries are 0, apart from Q ;;; K T (hence, the sum of all entries of Q T is 0). Q T is referred to as the edge length spectrum for T. The positive entries of this spectrum identify the edges of T. Fig. shows an example of the tree T T 1 on n þ 1 4 taxa, and its edge-length spectrum as three vectors, and incorporated in the 8 8 matrix Q T. Corresponding coordinates of the vectors q, q, and q give the three edge length parameters for the corresponding edge. The 0 value indicates that there is no corresponding edge in T. These vectors are placed in the leading row, column, and main diagonal of the matrix Q T. This means that for A, B f1; ; g, q ;;B q B ðþ, q A;; q A ðþ, q A;A q A ðþ, and for all other entries, q A;B 0, except the first entry q ;;; K, K KðÞþKðÞþKðÞ. The entries ðþ 4 SITE PATTERNS In this section, we introduce the notion of a character. We also define the notion of site pattern and show that each site pattern is identified by an ordered pair of splits, ðc;dþ, ðc;d ½nŠÞ, and that every character can be recovered from the site pattern and the state at taxon 0. This leads to the definition of the sequence probability spectrum that records the probability of obtaining every site pattern. When we propose a sequence of nucleotides 1 at leaf 0, and an edge-length spectrum Q T on a phylogeny T with leaf set L ½nŠ 0, we can generate homologous sequences at each of the other leaves of T under this stochastic model. A common position in each of these sequences is called a site. An assignment of nucleotides at a given site is called a character. Specifically, : L!fA; C; G; Tg assigns a nucleotide to each leaf, with ðiþ the character state at leaf i. This assignment partitions L into subsets L A, L C, L G, and L T, for fa; C; G; Tg L fi L : ðiþ g: Given the character, we define the character substitution map : L!T such that ðiþðð0þþ ðiþ; and a pair of sets CðÞ, DðÞ ½nŠ, CðÞ fi ½nŠ : ðiþ f; gg; DðÞ fj ½nŠ : ðjþ f; gg: The pair of subsets ðc;dþ ðcðþ;dðþþ is called the site pattern for. Given the site pattern ðc;dþ and the character state ð0þ at the reference leaf 0, we can recover. For example, if i D C and ð0þ G, then ðiþ, so ðiþ ðgþ A. There are four characters (depending on the state of ð0þ) that correspond to the same site pattern ðc;dþ. Under equilibrium and by the symmetries of KST model, each has the same probability of being generated on T by this model. However, the transition matrices at the tree edges are not dependent on this. Let s C;D be the probability of obtaining the site pattern ðc;dþ (recall the site pattern ðc;dþ is obtained from four characters, as ð0þ takes each character value). We now define the n n matrix S T, the sequence probability spectrum, with rows and columns indexed by the subsets of ½nŠ, S T ½s C;D Š C;D½nŠ : 1. The assumption about nonuniform base frequency holds here as well.

6 4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 Fig.. (a) The path 0 fe 14 ;e 14 ;e g (dashed line). (b) The path 4 fe ;e 14 ;e 14 ;e 4 g (dashed line). (c) The path set 04 fe 14 ;e ;e ;e 14 ;e 4 g (dashed lines). Note that, in each case, E fe A eðtþ : ja \ Ej is oddg. We note 0 [ 4 can be partitioned into W 0 \ 4 fe 14 g, U 0 W fe 14 ;e g, and V 4 W fe ;e 14 ;e 4 g, as in (1). The main theorem of this paper (Theorem 10) links between the probability s C;D, for each C, D ½nŠ, and the edge length parameters q e ðþ : e eðtþ, T. We will derive explicit formulas for s C;D as a function of edge length parameters. 5 HADAMARD MATRICES We define recursively the family fh n : n Z þ g, (known as Sylvester matrices), for n H n H 1 H n 1 H n 1 H n 1 ; H n 1 H n 1 is a symmetric Hadamard matrix of order n, with H 1 and H as previously defined. It is easily seen that Hn 1 n H n. It is known [10] that if we index the rows and columns of H n by the subsets of ½nŠ, then, for A, B ½nŠ, we have the following observation. Observation 4. ½H n Š A;B hða; BÞ ð 1Þ ja\bj hðb; AÞ: Further, for B, C ½nŠ, we write their symmetric difference as BCð ðb [ CÞ ðb \ CÞÞ, and we see ð 1Þ ja\ðbcþj ð 1Þ ðja\bjþja\cj ja\ðb\cþjþ Hence, we have the following. Observation 5. ð 1Þ ja\bj ð 1Þ ja\cj : hða; ðbcþþ hða; BÞhðA; CÞ: 0i fe A eðtþ : i Ag: ij is obtained by deleting the common edges of 0i and 0j from their union, so ij 0i 0j fe A eðtþ : ja \fi; jgj 1g fe A eðtþ : hða; fi; jgþ 1g: For any E ½nŠ, let E fe A eðtþ : hða; EÞ 1g; so, in particular, for i, j ½nŠ, we see fi;jg ij ; fig 0i ; ; ;: Observation. In [14], it is shown that E is a collection of edge disjoint paths, with end-point set E or E [f0g. E is called a path set. Figs. a and b show the two paths 0; and ;4, respectively, while (Fig. c) shows the path set induced by the set {0,,, 4}. By similar arguments to the discussion above, we find the following. Observation. The set of path sets is a group (under symmetric difference) isomorphic to C n. In particular, EF E F. The sum of edge lengths on a path connecting two leaves can naturally be thought of as the distance between the leaves. We extend this distance concept, for each substitution type f; ; g, to path sets. We define the path-set distance of path set E to be the sum of the corresponding edge lengths of each edge of the path set, that is, d E ðþ e A E q A ðþ: ð8þ PATH SETS In this section, we show how to decompose a set of paths connecting an even number of leaves into a set of edge disjoint paths. We denote the latter as a path set. We then generalize the edge length into path-set distances with respect to each substitution T. For any i, j LðTÞ ½nŠ 0, we define the path ij to be the set of edges in T connecting leaves i and j. In particular, we note SITE PATTERNS AND THE HADAMARD MATRI Here, for the sake of the explanation, we extend the notion of a character to conceptually assign values to the internal vertices of T. This allows us to extend the notion of the substitution function to the context of edges and, subsequently, to path sets. Suppose we are given ð0þ, the character state at leaf 0 and assign a transformation ðvþ T to each vertex v of T, such that the character state at v is ðvþ ðvþðð0þþ. (In particular, note that ð0þ, the identity.) If we restrict

7 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 4 ourselves to the set of leaves ½nŠ, then we find that the consequent site pattern is ðc;dþ ðcðþ;dðþþ, CðÞ fi ½nŠ : ðiþ f; gg; DðÞ fj ½nŠ : ðjþ f; gg; are subsets of ½nŠ. Further, for each edge e ðu; vþ, the transformation across e is ðeþ ðuþ 1 ðvþ ðuþðvþ; (as ðt ; Þ is a Boolean group). For the path i;j connecting leaves i and j, we find Y Y ðeþ ðuþ 1 ðvþ ðiþ 1 ðjþ ðiþðjþ; ð9þ e i;j eðu;vþ i;j (as the products at each internal vertex cancel, and 1 ). We extend this to any path set E ( E ½nŠÞ) and define ðeþ Y ðeþ: e e Hence, as in (9), the products at all internal vertices cancel, so we find the following. Observation 8. ðeþ Y ie ðiþ: ð10þ Consider now a character, inducing a site pattern ðc;dþ ðcðþ;dðþþ. Recall (1) as the homomorphisms g 1, g : T! f1; 1g, g 1 ðþ 1 () f; g and g ðþ 1 () f; g: Then, for j 1; g j ððeþþ g j! Y ðiþ Y g j ððiþþ: ð11þ ie ie Then, g j ððeþþ 1 exactly when the number of factors g j ððiþþ 1 in the product 11 is even. Now, for i E so g 1 ððiþþ 1 () ðiþ f; g () i C; g 1 ððeþþ 1 () jc \ Ej 0ðmod Þ ()hðc;eþ 1: However, as g 1 ððeþþ 1()ðEÞ f; g, we find Similarly, we find ðeþ f; g () hðc;eþ 1: ð1þ ðeþ f; g ()g ððeþþ 1 () jd \ Ej 0ðmod Þ () hðd; EÞ 1: ð1þ Hence, we have shown the following. Observation 9. Given a character inducing a site pattern ðc;dþ ðcðþ;dðþþ; and path set E with ðeþ Q ie ðiþ, hðc;eþ g 1 ððeþþ; hðd; EÞ g ððeþþ: ð14þ 8 HADAMARD CONJUGATION This is the final section in the derivation in which we prove the main theorem of this paper. We start with the righthand side H n Q T H n and show that an entry in that matrix, corresponds to a path-set distance. We decompose these possibly overlapping distances into three disjoint edge sets using previous identities to derive probabilities of substitutions along each of these edge sets and then recombine them to the original path sets. Q T is the matrix containing the edge-length parameters across T. S T is the matrix of probabilities of patterns at the leaves of T. The link between these are the rotations H n S T H n and H n Q T H n. These both relate to path-set properties and enable us to state our major result. Theorem 10. S T Hn 1 ðexpðh nq T H n ÞÞHn 1 ; ð15þ which, provided the arguments of the logarithm are positive, is invertible and gives Q T Hn 1 ðlnðh ns T H n ÞÞHn 1 : ð1þ Proof. The proof of this theorem is based on interpreting the corresponding components, for E, F ½nŠ, ½H n S T H n Š E;F and ½H n Q T H n Š E;F : Expanding the second term, we find ½H n Q T H n Š E;F hða; EÞhðB; F Þq A;B A;B½nŠ q ;;; þ ðhða; EÞq A;; þ hða; F Þq ;;A e A eðt Þ þ hða; EÞhðA; FÞq A;A Þ; ð1þ as the only nonzero entries in Q T are q A;;, q ;;A, q A;A for e A eðtþ, and q ;;;. Recall (Observation 5) q ;; ðq A;; þ q ;;A þ q A;A Þ; and e A eðt Þ hða; EÞhðA; FÞ hða; EFÞ; hence, the RHS of (1) can be written as ððhða; EÞ 1Þq A;; þðhða; F Þ 1Þq ;;A e A eðt Þ þðhða; EF Þ 1Þq A;A Þ: ð18þ Now, as the terms with hða; EÞ 1 cancel, and by the definition of E, we can write (18) as

8 48 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 ½H n Q T H n Š E;F q A;; q ;;A q A;A : e A E e A F e A EF By Observation EF E F ; and by the definition of the matrix Q Q T so Hence, q A;; q ea ðþ; q ;;A q ea ðþ; q A;A q ea ðþ; q A;; d E ðþ; q ;;A d F ðþ; e A E e A F q A;A d E F ðþ: e A EF ð19þ ½H n Q T H n Š E;F ðd E ðþþd F ðþþd E F ðþþ: ð0þ We can partition E [ F into three parts U E F ; V F E ; W E \ F ; ð1þ as illustrated in Fig. c, with the path sets partitioned as E UW; F V W; EF E F UV: The path-set distances split into corresponding summands, as d U ðþ P eu q eðþ, etc., so that d E ðþ d U ðþþd W ðþ; d EF ðþ d U ðþþd V ðþ: Thus, d F ðþ d V ðþþd W ðþ; ½H n Q T H n Š E;F ½d U ðþþd W ðþþd V ðþþd W ðþþd U ðþ þ d V ðþš ½d U ðþþd U ðþš ½d V ðþþd V ðþš ½d W ðþþd W ðþš; Hence, () becomes ½ExpðH n Q T H n ÞŠ E;F ½p U ðþþp U ðþ p U ðþ p U ðþš ½p V ðþ p V ðþþp V ðþ p V ðþš ½p W ðþ p W ðþ p W ðþþp W ðþš " #" # g 1 ðþp U ðþ g ðþp V ðþ T T " # g 1 ð Þg ð Þp W ð Þ T ;; T g 1 ð Þg ð Þp U ðþp V ðþp W ð Þ ðexpanding the productsþ g 1 ðþg ðþp U ð Þp V ð Þp W ð Þ; ;; T ðwith and Þ g 1 ðþg ðþ ;;T " # p U ð Þp V ð Þp W ð Þ : ;T ðþ Now, as E and F are partitioned as UW, and V W, respectively, then ðeþ ðuþðwþ and ðf ÞðV ÞðW Þ. Hence, p U ð Þp V ð Þp W ð ÞPr½ðEÞ ^ ðfþ Š; T which is the joint probability that the product of substitutions across the edges of ðeþ is, and the product across the edges of ðf Þ is. Thus, () implies ½ExpðH n Q T H n ÞŠ E;F ;T g 1 ðþg ðþ Pr½ðEÞ ^ ðf ÞŠ: ð4þ and ½ExpðH n Q T H n ÞŠ E;F e ½dU ðþþdu ðþš e ½dV ðþþdv ðþš e ½dW ðþþdw ðþš : Now, by Observation e ½d U ðþþd U ðþš p U ðþþp U ðþ p U ðþ p U ðþ g 1 ðþp U ðþ; T e ½dV ðþþdv ðþš p V ðþ p V ðþþp V ðþ p V ðþ g ðþp V ðþ; T e ½d W ðþþd W ðþš p W ðþ p W ðþ p W ðþþp W ðþ g 1 ðþg ðþp W ðþ: T ðþ By Observation 9, given a character, inducing a site pattern ðc;dþ ðcðþ;dðþþ, we have g 1 ððeþþ hðc;eþ; g ððfþþ hðd; FÞ; ð5þ that is, under ðeþ f; g ()hðc;eþ 1; and ðf Þf; g () hðd; F Þ1: Summing the probabilities s C;D, both hðc;eþ 1 and hðd; FÞ 1, we find Pr½ðEÞ f; g;ðf Þf; gš C;D½nŠ:hðC;EÞ1;hðD;F Þ1 s C;D :

9 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 49 Similarly, we see Pr½ðEÞ f; g^ðfþf; gš ; s C;D ; C;D½nŠ:hðC;EÞ1;hðD;F Þ 1 Pr½ðEÞ f;g^ðfþf; gš s C;D ; C;D½nŠ:hðC;EÞ 1;hðD;FÞ1 Pr½ðEÞ f; g^ðfþf; gš s C;D : C;D½nŠ:hðC;EÞ 1;hðD;FÞ 1 Substituting these into (4), we obtain by Observation 9 ½ExpðH n Q T H n ÞŠ E;F hðc;eþhðd; FÞs C;D giving C;D½nŠ ½H n S T H n Š E;F ; ExpðH n Q T H n ÞH n S T H n ; from which (15) and (1) follow. 9 APPLYING THE HADAMARD CONJUGATION ðþ In this section, we provide an example application of using the Hadamard conjugation in real biological problems. We use the application reported in [5] of obtaining an analytical solution for the maximum likelihood problem of a phylogenetic reconstruction. As was shown above, a tree T uniquely determines the sequence spectrum S S T. In real life, however, we do not find such a perfect S. Given a set of input aligned sequences, every column induces a site pattern. The matrix ^S ½^sŠ C;D, denoted as the observed sequence spectrum records the frequency of each site pattern ðc;dþ. For a tree T, the likelihood function is defined as LðTÞ Y C;D½nŠ s^sc;d C;D ; tu ðþ s C;D comes from (15). That is, () expresses the probability of seeing ^S given T. The maximum likelihood problem is to find a tree such that the probability of obtaining the given data ^S is maximized. In [5], the ML problem of a triplet tree under the Jukes- Cantor model and the molecular clock hypothesis was studied (see Fig. 4). The Jukes-Cantor model of evolution [19] is the simplest model for four states DNA evolution. The assumption in this model is that when a base changes, it has equal probabilities to change to each of the other three bases. This model can be derived from the more general KST model by setting, for each edge of T, each of the three edge length parameters equal to a common value, namely, setting q e ðþ q e ðþ q e ðþ q e. We now look at a general tree T on three taxa {0, 1, } before determining the root is. The molecular clock hypothesis and determination of the root location are done at a later stage. T has just one topology, the star with the three edges e f1g, e fg, and e f1;g Fig. 4. (a) A general triplet tree over the species {0, 1, }. (b) A rooted triplet tree under Jukes-Cantor model and the molecular clock hypothesis. (see Fig. 4a). For simplicity we denote them as e 1, e and e 1; respectively. The edge-length spectrum of an arbitrary -tree can be expressed as Q 4 Now, we see that ðq 1 þ q þ q 1 Þ q 1 q q 1 q 1 q q 0 q 0 q q 1 5 : HQH 0 q 1 þ q 1 q þ q 1 q 1 þ q q 1 þ q 1 q 1 þ q 1 q 1 þ q þ q 1 q 1 þ q þ q q þ q 1 q 1 þ q þ q 1 q þ q 1 q 1 þ q þ q 1 5 ; q 1 þ q q 1 þ q þ q 1 q 1 þ q þ q 1 q 1 þ q and by (0), these are minus twice the sum of distances induced by every two path sets E, F for every entry ½HQHŠ E;F. For example, ½H n Q T H n Š 1;1 ðd 1 ðþþd 1 ðþþd 1 1 ðþþ ðd 1 ðþþd 1 ðþþd ðþþ 4ðq 1 þ q þ q 1 Þ: When applying the exponential function to each element of the matrix HQH, we obtain the so called path-set spectrum, R: 1 x 1 x 1 x x 1 x 1 x x R expðhqhþ 1 x 1 x 1 x 1 x 1 x x 1 x 1 x x 1 4 x x 1 x 1 x x 1 x x 1 x 1 x x 1 5 ; x 1 x x 1 x x 1 x 1 x x 1 x 1 x x i e 4q i : ð8þ ð9þ The x i values can replace the s C;D values as the defining parameters in the likelihood function (). The entries of R relate to the joint probabilities of differences between the end-points of the corresponding path sets in T, as implied by (4). By using our main Theorem 10, (15), the sequence probability spectrum equals

10 40 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 S H 1 RH 1 ; ð0þ a 0 a 1 a a 1 a 1 a 1 a 4 a a a 4 a a 4 5 ; ð1þ a a 4 a 4 a a 0 ð1þx 1 x þ x 1 x 1 þ x x 1 þ x 1 x x 1 Þ; a 1 ð1 x 1 x x 1 x 1 þ x x 1 x 1 x x 1 Þ; a ð1 x 1 x þ x 1 x 1 x x 1 x 1 x x 1 Þ; a ð1þx 1 x x 1 x 1 x x 1 x 1 x x 1 Þ; a 4 ð1 x 1 x x 1 x 1 x x 1 þ x 1 x x 1 Þ: ðþ Thus, we see that each expected sequence frequency takes one of the above values, which are the functions of the three parameters x 1, x, and x 1. We now apply the molecular clock constraint that asserts q 1 q ) x 1 x (see Fig. 4b). From (), it can be seen that under this constraint, a 1 a, so the number of free variables in the likelihood equation reduces to 4, leading to a further simplification. The rest of the ML solution is orthogonal to the material discussed in this paper and can be found in [5]. We remark here that the choice of these parameters was proved crucial in the derivation of the analytical solution. In former works [], [], [8], the defining parameters were the sequence probability variables themselves, and additional constraints were required to guarantee that they reside on a tree surface. This approach failed in this case, and as can be seen using the path-set variables, these constraints are removed. ACKNOWLEDGMENT The authors would like to thank the anonymous referee II who provided very helpful comments and suggestions. REFERENCES [1] E.S. Allman and J.A. Rhodes, Phylogenetic Invariants for the General Markov Model of Sequence Mutation, Math. Biosciences, vol. 18, pp , 00. [] E.S. Allman and J.A. Rhodes, Quartets and Parameter Recovery for the General Markov Model of Sequence Mutation, Applied Math. Research epress, vol. 4, pp , 004. [] E.S. Allman and J.A. Rhodes, Phylogenetic Ideals and Varieties for the General Markov Model, Advances in Applied Math., vol. 40, no., pp , 008. [4] M. Casanellas and J. Fernández-Sánchez, Performance of a New Invariants Method on Homogeneous and Nonhomogeneous Quartet Trees, Molecular Biology and Evolution, vol. 4, no. 1, pp. 88-9, 00. [5] B. Chor, M.D. Hendy, and S. Snir, Maximum Likelihood Jukes- Cantor Triplets: Analytic Solutions, Molecular Biology and Evolution, vol., no., pp. -, 00. [] B. Chor, A. Khetan, and S. Snir, Maximum Likelihood on Four Taxa Phylogenetic Trees: Analytic Solutions, Proc. Seventh Ann. Int l Conf. Computational Molecular Biology (RECOMB 0), pp. - 8, 00. [] B. Chor and S. Snir, Molecular Clock Fork Phylogenies: Closed Form Analytic Maximum Likelihood Solutions, Systematic Biology, vol. 5, no., pp. 9-9, 004. [8] B. Chor, M. Hendy, B. Holland, and D. Penny, Multiple Maxima of Likelihood in Phylogenetic Trees: An Analytic Approach, Molecular Biology and Evolution, vol. 1, pp , 000. [9] S.N. Evans and T.P. Speed, Invariants of Some Probability Models Used in Phylogenetic Inference, Annals of Statistics, vol. 1, pp. 55-, 199. [10] M.D. Hendy, The Relationship between Simple Evolutionary Tree Models and Observable Sequence Data, Systematic Zoology, vol. 8, pp. 10-1, [11] M.D. Hendy, A Combinatorial Description of the Closest Tree Algorithm for Finding Evolutionary Trees, Discrete Math., vol. 9, pp , [1] M.D. Hendy, Hadamard Conjugation: An Analytic Tool for Phylogenetics, Math. of Evolution and Phylogeny, chapter, first ed., O. Gascuel, ed., pp. 14-1, Oxford Univ. Press, 005. [1] M.D. Hendy and D. Penny, A Framework for the Quantitative Study of Evolutionary Trees, Systematic Zoology, vol. 8, pp. 9-09, [14] M.D. Hendy and D. Penny, Spectral Analysis of Phylogenetic Data, J. Classification, vol. 10, pp. 5-4, 199. [15] M.D. Hendy and D. Penny, Complete Families of Linear Invariants for Some Stochastic Models of Sequence Evolution with and without the Molecular Clock Assumption, J. Computational Biology, vol., pp. 19-1, 199. [1] M.D. Hendy, D. Penny, and M.A. Steel, A Discrete Fourier Analysis for Evolutionary Trees, Proc. Nat l Academy of Sciences, vol. 91, pp. 9-4, [1] B. Holland, D. Penny, and M. Hendy, Outgroup Misplacement and Phylogenetic Inaccuracy under a Molecular Clock A Simulation Study, Systematic Biology, vol. 5, pp. 9-8, 00. [18] K.T. Huber, M. Langton, D. Penny, V. Moulton, and M. Hendy, Spectronet: A Package for Computing Spectra and Median Networks, Applied Bioinformatics, vol. 1, pp , 00. [19] T.H. Jukes and C.R. Cantor, Evolution of Protein Molecules, Mammalian Protein Metabolism III, H.N. Munro, ed., Academic Press, 199. [0] M. Kimura, A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences, J. Molecular Evolution, vol. 1, pp , [1] M. Kimura, Estimation of Evolutionary Distances between Homologous Nucleotide Sequences, Proc. Nat l Academy of Sciences, vol. 8, pp , [] J.L. Neyman, Molecular Studies of Evolution: A Source of Novel Statistical Problems, Statistical Decision Theory and Related Topics, S.S. Gupta and J. Yackel, eds., Academic Press, 191. [] L. Pachter and B. Sturmfels, Algebraic Statistics for Computational Biology. Cambridge Univ. Press, 005. [4] L. Pachter and B. Sturmfels, The Mathematics of Phylogenomics, submitted for publication. [5] M.A. Steel, M.D. Hendy, L.A. Székely, and P.L. Erdös, Spectral Analysis and a Closest Tree Method for Genetic Sequences, Applied Math. Letters, vol. 5, pp. -, 199. [] M.A. Steel, M.D. Hendy, and D. Penny, Reconstructing Phylogenies from Nucleotide Pattern Probabilities: A Survey and Some New Results, Discrete Applied Math., vol. 88, pp. -9, [] B. Sturmfels and S. Sullivant, Toric Ideals of Phylogenetic Invariants, J. Computational Biology, vol. 1, pp. 04-8, 005. [8] L. Székely, P.L. Erdös, M.A. Steel, and D. Penny, A Fourier Inversion Formula for Evolutionary Trees, Applied Math. Letters, vol., pp. 1-1, 199. [9] L. Székely, M.A. Steel, and P.L. Erdös, Fourier Calculus on Evolutionary Trees, Advances in Applied Math., vol. 14, pp. 00-1, 199. [0] P.J. Waddell and M.D. Hendy, Using Phylogenetic Invariants to Enhance Spectral Analysis of Nucleotide Sequence Data, Information and Math. Sciences Report Series B, Massey Univ., 199.

He was with Massey University, he began research into phylogenetics in collaboration with molecular biologist, David Penny.

11 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 41 Michael D. Hendy received the PhD degree in algebraic number theory from the University of New England, NSW, Australia, in 19. He was with Massey University, he began research into phylogenetics in collaboration with molecular biologist, David Penny. Since 199, he has been a personal chair in mathematical biology at Massey University, Palmerston North, New Zealand. He is also the executive director of the Allan Wilson Centre for Molecular Ecology and Evolution, one of the seven Centres of Research Excellence, New Zealand the Allan Wilson Centre is a group of nearly 100 researchers in biology and mathematics across five New Zealand universities. In , he held a 10-month mercator professorship in biomathematics at the University of Greifswald, Germany. He has published more than100 research papers in the fields of algebraic number theory and phylogenetics. Sagi Snir received the BA degree in computer science and economics from Bar Ilan University, Israel, and the MSc and PhD degrees in computer science from the Technion, Israel. He was with various information technologies companies, including IBM Haifa Research Lab. He is currently a postdoctoral researcher at the University of California, Berkeley. His main research interest is computational biology and in particular phylogenetics.. For more information on this or any other computing topic, please visit our Digital Library at

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Benny Chor Michael Hendy David Penny Abstract We consider the problem of finding the maximum likelihood rooted tree under