HADAMARD conjugation is an analytic formulation of the

Size: px
Start display at page:

Download "HADAMARD conjugation is an analytic formulation of the"

Transcription

1 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER Hadamard Conjugation for the Kimura ST Model: Combinatorial Proof Using Path Sets Michael D. Hendy and Sagi Snir Abstract Under a stochastic model of molecular sequence evolution the probability of each possible pattern of characters is well defined. The Kimura s three-substitution-types (KST) model of evolution allows analytical expression for these probabilities by means of the Hadamard conjugation as a function of the phylogeny T and the substitution probabilities on each edge of T. In this paper, we produce a direct combinatorial proof of these results using path-set distances, which generalize pairwise distances between sequences. This interpretation provides us with tools that have proved useful in related problems in the mathematical analysis of sequence evolution. Index Terms Hadamard conjugation, KST model, path-sets, phylogenetic trees, phylogenetic invariants. Ç 1 INTRODUCTION HADAMARD conjugation is an analytic formulation of the relationship between the probabilities of expected site patterns of nucleotides for a set of homologous nucleotide sequences and the parameters of some simple models of sequence evolution on a proposed phylogeny T. An important application of these relations is to give a theoretical tool to analyze properties of phylogenetic inference such as the methods of maximum likelihood and maximum parsimony, as well as for generating simulated data, and determining phylogenetic invariants. Hadamard conjugation can also be used directly for phylogenetic inference, inferring either trees with the Closest Tree algorithm [11], [5] or networks using Spectronet [18]. Application of the Hadamard conjugation in maximum likelihood phylogenetic inference under the Kimura s three-substitution-types (KST) model was done in [5] and in a related problem, phylogenetic invariants were used to reconstruct quartet trees under a generalized variant of KST [4]. Hadamard conjugation was first introduced in 1989 [10], [1] to analyze two-state character sequences evolving under the Neyman model []. Evans and Speed [9] noted that KST model [1] for 4-state characters could be modeled by the Klein group Z Z. Noting this, Székely et al. [8], [9] extended the two-state analysis to a more general algebraic theory, substitutions belonged to an arbitrary Abelian group. They then applied this to sequences evolving under the KST model. Current applications of Closest Tree and Spectronet [18] are usually applied to the 4-state KST model or its derivatives, the KST and Jukes-Cantor models.. M.D. Hendy is with the Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Private Bag 11, Palmerston North 4410, New Zealand. m.hendy@massey.ac.nz.. S. Snir is with the Mathematics Department, University of California, Berkeley, Berkeley, CA ssagi@math.berkeley.edu. Manuscript received 5 Oct. 00; revised 0 Mar. 00; accepted 1 May 00; published online June 00. For information on obtaining reprints of this article, please send to: tcbb@computer.org, and reference IEEECS Log Number TCBB Digital Object Identifier no /TCBB A path-set in a phylogenetic tree T is a generalization of the concept of a path. This approach allows the concept of pairwise distances between sequences to be extended to distances connecting larger sets of taxa. It provides properties that can be related to other evolutionary phenomena such as the molecular clock hypothesis. This has, for example, proved pivotal in allowing a simpler analytic expression of the likelihood function, as developed in [5], leading to an algebraic solution for the maximum likelihood points. We demonstrate this use, as well as the relation to the molecular clock property in our last section describing the application of the Hadamard conjugation, as was used in [5]. It has also proved useful in identifying phylogenetic invariants [15], [], [4] and introducing the projected spectra [0], which reduces both the variance in the parameter estimates and the computational complexity of the Closest Tree algorithm [11]. All the above examples rely on some relationships between the phylogenetic tree and the probabilities of obtaining sequences evolved under that tree. These relationships were proved in the past by algebraic tools on more general model of evolution. However, on the KST, these relationship can be expressed as identities between expressions in the tree parameters and expressions in the sequence probabilities. These relationship were outlined in []. Here, we provide a self contained more rigorous proof that bears some resemblance (Section 8) to the sketch in []. However, that outline lacks the details of the combinatorial properties of the intermediate variables, which we find to be of interest. Therefore, our proof serves as a more intuitive alternative to the presentations in [1] and []. We model the relationship of the differences of n sequences labeled by elements of ½nŠ f1; ; ; ng, from a reference sequence labeled 0 (note that 0 = ½nŠ). Because the models are reversible, the choice of reference sequence is arbitrary. The topology of T and the model parameters are presented in a sparse matrix Q T of n rows and columns, called the edge-length spectrum. The probabilities of each site pattern are presented in a similar sized matrix S T called the sequence probability /08/$5.00 ß 008 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

2 4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 By including the identity transformation, we find that the set of substitution types: T f; ; ; g is a group under composition, acting on the nucleotide set fa; C; G; UðTÞg. Thus, for example, ððcþþ ðgþ A ðcþ, so. Consider the maps g 1, g : T! C f1; 1g defined by Fig. 1. KST, showing the three substitution types,,, and, applied either to the RNA nucleotides A, C, G, and U or to the DNA nucleotides A, C, G, and T. spectrum. We also define a Hadamard matrix H n of n rows and columns and show that the matrix products: H n Q T H n ; H n S T H n ; both relate to properties of path sets. We prove the major result by interpreting corresponding components of each entry of these matrices. In particular, we show that the ðe; FÞ entry in both matrices corresponds to certain evolutionary distances defined by path sets E and F. We note that we were motivated to provide this new proof as these variables served as the defining parameters in the likelihood equation in [5] while their biological interpretation has not been elaborated sufficiently. We start by describing the KST model over a single edge and then generalizing it to a set of edges in a tree. Next, we introduced the notion of a site pattern and the matrix S T. In Sections 5 and, we introduce the Hadamard matrices and path sets and the relationship between them. Section 8 is the main part of the proof, we show the relationship between corresponding entries of the matrices H n Q T H n and H n S T H n. We end by describing the derivation of the equations used in [5] leading to an analytical solution of the ML problem. We believe this is an important contribution that can serve in the burgeoning area of algebraic statistics in biology and phylogenetics, in particular (see, e.g., [1], [], [], [], [4], and []). KIMURA S ST MODEL In this section, we first describe the KST model. We then derive identities relating the substitution matrix M, and the matrix of expected numbers of substitution along each edge. Finally, we encode these relationship by means of two simpler matrices P and Q, and the Hadamard matrix H 1. KST [1] specified independent rates for each of the substitutions between pairs of RNA or DNA nucleotides. Here, we will refer to Kimura s three substitution rates as, and, and use,, and to refer to the substitution types, as illustrated in Fig. 1. These are defined formally as.. The substitutions A $ G, UðTÞ $ C (transitions)... The substitutions A $ UðTÞ, G $ C (transversions type )... The substitutions A $ C, UðTÞ $ G (transversions type ). g 1 :!1;!1;! 1;! 1; ð1þ g :!1;! 1;!1;! 1: We find that g 1 and g are both homomorphisms from ðt ; Þ onto the -group ðc ; Þ, and the map g :!ðg 1 ðþ;g ðþþ; T; is an isomorphism onto the group ðc C ; Þ. Observation 1. ðt ; Þ is isomorphic to the Klein 4-group, ðc C ; Þ. In contrast, the set of substitutions of the KST model and of the Jukes-Cantor model do not form groups, as products are not well defined (for example, a product of two transversions in KST could either be a transition or the identity). A related, however, different aspect is the property of generalization/specialization between models. For example, we can specialize from KST down to each of these models by imposing restrictions on parameters (for example, if the expected numbers of transitions and of transversions of each type are equated, then KST specializes to the Jukes-Cantor model). A different restriction on the values of the model parameters is imposed by the Molecular Clock constraint, however, this is beyond the scope of this work (see, e.g., [1] and []). Kimura modeled the expected differences between two sequences separated by time t. With the three specified rates, the expected numbers of substitutions of each type are therefore qðþ t; qðþ t; qðþ t: By setting, this model projects to KST, Kimura s better known two substitution type model [0]. Setting gives the simple Jukes-Cantor model [19]. The probabilities pðþ, pðþ, and pðþ of observing differences of each type over the time period t underestimate qðþ, qðþ, and qðþ, as multiple changes are not directly observed. Observed frequencies of differences can be taken as estimates of pðþ for f; ; g. In [1], Kimura derived expressions for the expected numbers qðþ as functions of the probabilities pðþ. These are equivalent to the standard expression of the stochastic matrix M, derived from the rate matrix R 4 ð þ þ Þ ð þ þ Þ ð þ þ Þ ð þ þ Þ 5 ;

3 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 4 over time t, so that (e.g., see []) M expðrtþ; pðþ pðþ pðþ pðþ pðþ pðþ pðþ pðþ M 4 pðþ pðþ pðþ pðþ 5 ; pðþ pðþ pðþ pðþ K qðþ qðþ qðþ qðþ K qðþ qðþ Rt 4 qðþ qðþ K qðþ 5 : qðþ qðþ qðþ K K qðþþqðþþqðþ is the total number of substitutions, and exp is the standard exponential function for square matrices. We note that as R and t are fixed, so too are M, p, q, and K as they are defined by them. Let H be the 4 4 Hadamard matrix: H : Observation. H diagonalizes both M and Rt. In particular, and H 1 MH pðþ pðþ pðþ pðþ 0 5 ; pðþ pðþ H 1 RtH qðþþqðþ qðþþqðþ 0 5 : qðþþqðþ Recall the exponential of a matrix is a power series, so expðh 1 RtH Þ n0 ðh 1RtH Þ n n! n0 H 1 ðrtþ n H 1 expðrtþh : As expðh 1RtH Þ is diagonal, so too is H 1 expðrtþh, with entries 1 e 0 ; e ðqðþþqðþþ ; e ðqðþþqðþþ ; e ðqðþþqðþþ : Now, using (), we observe H 1 MH H 1 ðexpðrtþþh : Equating the diagonal entries shows that the eigenvalues of M and expðrtþ are n! H ðþ 1 pðþþpðþþpðþþpðþ e 0 e KþqðÞþqðÞþqðÞ ; 1 ðpðþþpðþþ pðþ pðþþpðþ pðþ e ðqðþþqðþþ e K qðþþqðþ qðþ ; 1 ðpðþþpðþþ pðþþpðþ pðþ pðþ e ðqðþþqðþþ e KþqðÞ qðþ qðþ ; 1 ðpðþþpðþþ pðþ pðþ pðþþpðþ e ðqðþþqðþþ e K qðþ qðþþqðþ : These equations can be succinctly expressed (see [1]) as H ; P 1 1 H 1 PH 1 ExpðH 1 QH 1 Þ; pðþ pðþ ; Q pðþ pðþ ðþ K qðþ ; qðþ qðþ and Exp is the exponential function applied to each entry of the matrix. Equation () can be inverted (as the arguments of ln are all positive) to give H 1 QH 1 LnðH 1 PH 1 Þ; Ln is the natural logarithm applied to each entry of a matrix. The invertibility of () and (4) means that provided the parameters are in valid ranges, the model could be specified either by the three probabilities pðþ, pðþ, and pðþ, or by the three parameters qðþ, qðþ, and qðþ. This inversion does not rely on a rate/time specification and a Poisson process of substitution. Hence, we are able to test the validity of a constant rate model in analyzing observed data. SUBSTITUTIONS ACROSS THE EDGES OF A TREE We now extend the model of the previous section to handle sets of edges. We derive the probability of a substitution of type along a set of edges W and record it in a stochastic matrix M W. We also define the path length matrix Q W and by a similar fashion to that in the previous section, obtain the relationship between M W and Q W. Finally, we define the edge length spectrum that records all the tree parameters. Let ½nŠ f1; ;...;ng and ½nŠ 0 ½nŠ[f0g. Let T be a tree (phylogeny) with leaf set LðTÞ ½nŠ 0 and edge set eðtþ. For each edge e eðt Þ, we can postulate three independent Kimura probability parameters p e ðþ, p e ðþ, and p e ðþ. These are collected in a stochastic matrix: p e ðþ p e ðþ p e ðþ p e ðþ p M e e ðþ p e ðþ p e ðþ p e ðþ 4 p e ðþ p e ðþ p e ðþ p e ðþ 5 ; p e ðþ p e ðþ p e ðþ p e ðþ with eigenvalues 1, expð ðq e ðþþq e ðþþþ, expð ðq e ðþ þ q e ðþþþ, and expð ðq e ðþþq e ðþþþ. Let ð4þ

4 44 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 P e p eðþ p e ðþ p e ðþ p e ðþ ; Q e K e q e ðþ q e ðþ q e ðþ ; Thus, equating the diagonals, we have Observation. then by (), we see that the probabilities p e ðþ, p e ðþ, and p e ðþ are related to the three parameters q e ðþ, q e ðþ, and q e ðþ as H 1 P e H 1 ExpðH 1 Q e H 1 Þ: As the matrices M e, for e eðtþ, are each diagonalized by H, they commute. Hence, for any subset of edges W eðtþ, we can formally define the product: M W Y p W ðþ p W ðþ p W ðþ p W ðþ p M e W ðþ p W ðþ p W ðþ p W ðþ 4 p ew W ðþ p W ðþ p W ðþ p W ðþ 5 : ð5þ p W ðþ p W ðþ p W ðþ p W ðþ We note that the term p W ðþ is the probability that the product of the substitutions of all the edges of W is. In particular, if W is a path in T, then p W ðþ is the probability that the states at the endpoints of the path differ by the substitution. In addition, when W feg, we see M feg M e and p feg ðþ p e ðþ. We see that M W is diagonalized by H : H 1 M W H ðp W ðþþp W ðþþ ðp W ðþþp W ðþþ 0 5 ; ðp W ðþþp W ðþþ as is each factor in (5), so H 1 M W H H 1 Y M e!h ew Y H 1 M eh Y ew ew expð ðq eðþþq eðþþþ expð ðq e ðþþq e ðþþþ expð ðq e ðþþq e ðþþþ expð ððq W ðþþq W ðþþþ ; 0 0 expð ðq W ðþþq W ðþþþ expð ðq W ðþþq W ðþþþ q W ðþ q e ðþ; ew q W ðþ q e ðþ; ew q W ðþ q e ðþ: ew 1 p W ðþþp W ðþþp W ðþþp W ðþ expð0þ; 1 ðp W ðþþp W ðþþ p W ðþ p W ðþþp W ðþ p W ðþ expð ðq W ðþþq W ðþþþ; 1 ðp W ðþþp W ðþþ p W ðþþp W ðþ p W ðþ p W ðþ expð ðq W ðþþq W ðþþþ; 1 ðp W ðþþp W ðþþ p W ðþ p W ðþ p W ðþþp W ðþ expð ðq W ðþþq W ðþþþ: We now define the corresponding P and Q matrices for the edge set W: P W p W ðþ p W ðþ ; p W ðþ p W ðþ Q W K W q W ðþ Q e ; q W ðþ q W ðþ ew K W q W ðþþq W ðþþq W ðþ. The relationships of Observation, similar to (), can now be expressed as H 1 P W H 1 ExpðH 1 Q W H 1 Þ: ðþ As the Q e matrices are additive over edge sets of T, we refer to the expected numbers q e ðþ, q e ðþ, and q e ðþ as the three edge-length parameters, for each edge e eðtþ. We can thus specify our model by the set of jeðtþj independent edge-length parameters fq e ðþ : f; ; g; e eðtþg: Given T and the jeðtþj edge length parameters, we can model sequence evolution on T under the KST model, if we specify a sequence of nucleotides at one leaf and generate corresponding nucleotides at every other vertex according to the probabilities p e ðþ. We comment that the KST model induces uniform base distribution under equilibrium. However, since our work deals with the probabilities along the edges, our derivation is indifferent to the base distribution. Edge indexing. The deletion of an edge e eðtþ induces two subtrees, whose leaf label sets A, A 0 (with 0 A 0 ) partition LðTÞ ½nŠ 0. Thus, A is the set of leaves of T separated from reference leaf 0 by the edge e. We choose the subset A ½nŠ (the subset not containing 0) to index e as e A. Thus, for e e A eðtþ: A fi ½nŠ : e 0i g; 0i is the path (in T) connecting leaves 0 and i. A partition of a set into two subsets fa; A 0 g (so A \ A 0 ;and A [ A 0 ) is called a split of. When ½nŠ 0, we will identify each split fa; A 0 g by the subset A, which does not contain 0 and, hence, we see the set of splits of ½nŠ 0 f0; 1;...;ng is bijective with the set of subsets of ½nŠ f1; ;...;ng. Now, for each A ½nŠ, we define the three values, ðq Þ A, ðq Þ A, and ðq Þ A,by 8 < q ea ðþ if e A eðtþ; ðq Þ A K P e BeðT Þ q e B ðþ if A ;; : 0 else; for ; ;. We incorporate these values into three vectors q, q, and q, each of n entries. We order the

5 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 45 indicated by. are all zero and are zero for every tree. The entries indicated by 0 are zero for the topology of T, signifying that the splits represented by them are not part of T. Different topologies can have positive values for these entries. The nonzero entries (in the leading row, column, and main diagonal) should each be in the same coordinates as they identify the edge splits of T. For general trees on n þ 1 taxa, the edge length spectra are vectors and square matrices of order n. Fig.. Example: The edge-length spectrum of the tree T T 1. components of the vectors by the subsets of ½nŠ as follows: ;; f1g; fg; f1; g; fg; f1; g; f; g; f1; ; g; f4g; ; ½nŠ, etc. As ðq Þ ; K, the sum of the components in each vector is 0. We will also find it convenient to incorporate the vectors into a n n matrix: Q T q A;B A;B½nŠ ; 8 q ea ðþ if e A eðtþ;b;; >< q eb ðþ if A ;;e B eðtþ; q A;B q ea ðþ if A B; e A eðtþ; K T if A B ;; >: 0 else; K T ðq e ðþþq e ðþþq e ðþþ K e : eeðtþ eeðt Þ Thus, the leading row of Q T is q, the leading column is q, and the leading diagonal is q, all other entries are 0, apart from Q ;;; K T (hence, the sum of all entries of Q T is 0). Q T is referred to as the edge length spectrum for T. The positive entries of this spectrum identify the edges of T. Fig. shows an example of the tree T T 1 on n þ 1 4 taxa, and its edge-length spectrum as three vectors, and incorporated in the 8 8 matrix Q T. Corresponding coordinates of the vectors q, q, and q give the three edge length parameters for the corresponding edge. The 0 value indicates that there is no corresponding edge in T. These vectors are placed in the leading row, column, and main diagonal of the matrix Q T. This means that for A, B f1; ; g, q ;;B q B ðþ, q A;; q A ðþ, q A;A q A ðþ, and for all other entries, q A;B 0, except the first entry q ;;; K, K KðÞþKðÞþKðÞ. The entries ðþ 4 SITE PATTERNS In this section, we introduce the notion of a character. We also define the notion of site pattern and show that each site pattern is identified by an ordered pair of splits, ðc;dþ, ðc;d ½nŠÞ, and that every character can be recovered from the site pattern and the state at taxon 0. This leads to the definition of the sequence probability spectrum that records the probability of obtaining every site pattern. When we propose a sequence of nucleotides 1 at leaf 0, and an edge-length spectrum Q T on a phylogeny T with leaf set L ½nŠ 0, we can generate homologous sequences at each of the other leaves of T under this stochastic model. A common position in each of these sequences is called a site. An assignment of nucleotides at a given site is called a character. Specifically, : L!fA; C; G; Tg assigns a nucleotide to each leaf, with ðiþ the character state at leaf i. This assignment partitions L into subsets L A, L C, L G, and L T, for fa; C; G; Tg L fi L : ðiþ g: Given the character, we define the character substitution map : L!T such that ðiþðð0þþ ðiþ; and a pair of sets CðÞ, DðÞ ½nŠ, CðÞ fi ½nŠ : ðiþ f; gg; DðÞ fj ½nŠ : ðjþ f; gg: The pair of subsets ðc;dþ ðcðþ;dðþþ is called the site pattern for. Given the site pattern ðc;dþ and the character state ð0þ at the reference leaf 0, we can recover. For example, if i D C and ð0þ G, then ðiþ, so ðiþ ðgþ A. There are four characters (depending on the state of ð0þ) that correspond to the same site pattern ðc;dþ. Under equilibrium and by the symmetries of KST model, each has the same probability of being generated on T by this model. However, the transition matrices at the tree edges are not dependent on this. Let s C;D be the probability of obtaining the site pattern ðc;dþ (recall the site pattern ðc;dþ is obtained from four characters, as ð0þ takes each character value). We now define the n n matrix S T, the sequence probability spectrum, with rows and columns indexed by the subsets of ½nŠ, S T ½s C;D Š C;D½nŠ : 1. The assumption about nonuniform base frequency holds here as well.

6 4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 Fig.. (a) The path 0 fe 14 ;e 14 ;e g (dashed line). (b) The path 4 fe ;e 14 ;e 14 ;e 4 g (dashed line). (c) The path set 04 fe 14 ;e ;e ;e 14 ;e 4 g (dashed lines). Note that, in each case, E fe A eðtþ : ja \ Ej is oddg. We note 0 [ 4 can be partitioned into W 0 \ 4 fe 14 g, U 0 W fe 14 ;e g, and V 4 W fe ;e 14 ;e 4 g, as in (1). The main theorem of this paper (Theorem 10) links between the probability s C;D, for each C, D ½nŠ, and the edge length parameters q e ðþ : e eðtþ, T. We will derive explicit formulas for s C;D as a function of edge length parameters. 5 HADAMARD MATRICES We define recursively the family fh n : n Z þ g, (known as Sylvester matrices), for n H n H 1 H n 1 H n 1 H n 1 ; H n 1 H n 1 is a symmetric Hadamard matrix of order n, with H 1 and H as previously defined. It is easily seen that Hn 1 n H n. It is known [10] that if we index the rows and columns of H n by the subsets of ½nŠ, then, for A, B ½nŠ, we have the following observation. Observation 4. ½H n Š A;B hða; BÞ ð 1Þ ja\bj hðb; AÞ: Further, for B, C ½nŠ, we write their symmetric difference as BCð ðb [ CÞ ðb \ CÞÞ, and we see ð 1Þ ja\ðbcþj ð 1Þ ðja\bjþja\cj ja\ðb\cþjþ Hence, we have the following. Observation 5. ð 1Þ ja\bj ð 1Þ ja\cj : hða; ðbcþþ hða; BÞhðA; CÞ: 0i fe A eðtþ : i Ag: ij is obtained by deleting the common edges of 0i and 0j from their union, so ij 0i 0j fe A eðtþ : ja \fi; jgj 1g fe A eðtþ : hða; fi; jgþ 1g: For any E ½nŠ, let E fe A eðtþ : hða; EÞ 1g; so, in particular, for i, j ½nŠ, we see fi;jg ij ; fig 0i ; ; ;: Observation. In [14], it is shown that E is a collection of edge disjoint paths, with end-point set E or E [f0g. E is called a path set. Figs. a and b show the two paths 0; and ;4, respectively, while (Fig. c) shows the path set induced by the set {0,,, 4}. By similar arguments to the discussion above, we find the following. Observation. The set of path sets is a group (under symmetric difference) isomorphic to C n. In particular, EF E F. The sum of edge lengths on a path connecting two leaves can naturally be thought of as the distance between the leaves. We extend this distance concept, for each substitution type f; ; g, to path sets. We define the path-set distance of path set E to be the sum of the corresponding edge lengths of each edge of the path set, that is, d E ðþ e A E q A ðþ: ð8þ PATH SETS In this section, we show how to decompose a set of paths connecting an even number of leaves into a set of edge disjoint paths. We denote the latter as a path set. We then generalize the edge length into path-set distances with respect to each substitution T. For any i, j LðTÞ ½nŠ 0, we define the path ij to be the set of edges in T connecting leaves i and j. In particular, we note SITE PATTERNS AND THE HADAMARD MATRI Here, for the sake of the explanation, we extend the notion of a character to conceptually assign values to the internal vertices of T. This allows us to extend the notion of the substitution function to the context of edges and, subsequently, to path sets. Suppose we are given ð0þ, the character state at leaf 0 and assign a transformation ðvþ T to each vertex v of T, such that the character state at v is ðvþ ðvþðð0þþ. (In particular, note that ð0þ, the identity.) If we restrict

7 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 4 ourselves to the set of leaves ½nŠ, then we find that the consequent site pattern is ðc;dþ ðcðþ;dðþþ, CðÞ fi ½nŠ : ðiþ f; gg; DðÞ fj ½nŠ : ðjþ f; gg; are subsets of ½nŠ. Further, for each edge e ðu; vþ, the transformation across e is ðeþ ðuþ 1 ðvþ ðuþðvþ; (as ðt ; Þ is a Boolean group). For the path i;j connecting leaves i and j, we find Y Y ðeþ ðuþ 1 ðvþ ðiþ 1 ðjþ ðiþðjþ; ð9þ e i;j eðu;vþ i;j (as the products at each internal vertex cancel, and 1 ). We extend this to any path set E ( E ½nŠÞ) and define ðeþ Y ðeþ: e e Hence, as in (9), the products at all internal vertices cancel, so we find the following. Observation 8. ðeþ Y ie ðiþ: ð10þ Consider now a character, inducing a site pattern ðc;dþ ðcðþ;dðþþ. Recall (1) as the homomorphisms g 1, g : T! f1; 1g, g 1 ðþ 1 () f; g and g ðþ 1 () f; g: Then, for j 1; g j ððeþþ g j! Y ðiþ Y g j ððiþþ: ð11þ ie ie Then, g j ððeþþ 1 exactly when the number of factors g j ððiþþ 1 in the product 11 is even. Now, for i E so g 1 ððiþþ 1 () ðiþ f; g () i C; g 1 ððeþþ 1 () jc \ Ej 0ðmod Þ ()hðc;eþ 1: However, as g 1 ððeþþ 1()ðEÞ f; g, we find Similarly, we find ðeþ f; g () hðc;eþ 1: ð1þ ðeþ f; g ()g ððeþþ 1 () jd \ Ej 0ðmod Þ () hðd; EÞ 1: ð1þ Hence, we have shown the following. Observation 9. Given a character inducing a site pattern ðc;dþ ðcðþ;dðþþ; and path set E with ðeþ Q ie ðiþ, hðc;eþ g 1 ððeþþ; hðd; EÞ g ððeþþ: ð14þ 8 HADAMARD CONJUGATION This is the final section in the derivation in which we prove the main theorem of this paper. We start with the righthand side H n Q T H n and show that an entry in that matrix, corresponds to a path-set distance. We decompose these possibly overlapping distances into three disjoint edge sets using previous identities to derive probabilities of substitutions along each of these edge sets and then recombine them to the original path sets. Q T is the matrix containing the edge-length parameters across T. S T is the matrix of probabilities of patterns at the leaves of T. The link between these are the rotations H n S T H n and H n Q T H n. These both relate to path-set properties and enable us to state our major result. Theorem 10. S T Hn 1 ðexpðh nq T H n ÞÞHn 1 ; ð15þ which, provided the arguments of the logarithm are positive, is invertible and gives Q T Hn 1 ðlnðh ns T H n ÞÞHn 1 : ð1þ Proof. The proof of this theorem is based on interpreting the corresponding components, for E, F ½nŠ, ½H n S T H n Š E;F and ½H n Q T H n Š E;F : Expanding the second term, we find ½H n Q T H n Š E;F hða; EÞhðB; F Þq A;B A;B½nŠ q ;;; þ ðhða; EÞq A;; þ hða; F Þq ;;A e A eðt Þ þ hða; EÞhðA; FÞq A;A Þ; ð1þ as the only nonzero entries in Q T are q A;;, q ;;A, q A;A for e A eðtþ, and q ;;;. Recall (Observation 5) q ;; ðq A;; þ q ;;A þ q A;A Þ; and e A eðt Þ hða; EÞhðA; FÞ hða; EFÞ; hence, the RHS of (1) can be written as ððhða; EÞ 1Þq A;; þðhða; F Þ 1Þq ;;A e A eðt Þ þðhða; EF Þ 1Þq A;A Þ: ð18þ Now, as the terms with hða; EÞ 1 cancel, and by the definition of E, we can write (18) as

8 48 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 ½H n Q T H n Š E;F q A;; q ;;A q A;A : e A E e A F e A EF By Observation EF E F ; and by the definition of the matrix Q Q T so Hence, q A;; q ea ðþ; q ;;A q ea ðþ; q A;A q ea ðþ; q A;; d E ðþ; q ;;A d F ðþ; e A E e A F q A;A d E F ðþ: e A EF ð19þ ½H n Q T H n Š E;F ðd E ðþþd F ðþþd E F ðþþ: ð0þ We can partition E [ F into three parts U E F ; V F E ; W E \ F ; ð1þ as illustrated in Fig. c, with the path sets partitioned as E UW; F V W; EF E F UV: The path-set distances split into corresponding summands, as d U ðþ P eu q eðþ, etc., so that d E ðþ d U ðþþd W ðþ; d EF ðþ d U ðþþd V ðþ: Thus, d F ðþ d V ðþþd W ðþ; ½H n Q T H n Š E;F ½d U ðþþd W ðþþd V ðþþd W ðþþd U ðþ þ d V ðþš ½d U ðþþd U ðþš ½d V ðþþd V ðþš ½d W ðþþd W ðþš; Hence, () becomes ½ExpðH n Q T H n ÞŠ E;F ½p U ðþþp U ðþ p U ðþ p U ðþš ½p V ðþ p V ðþþp V ðþ p V ðþš ½p W ðþ p W ðþ p W ðþþp W ðþš " #" # g 1 ðþp U ðþ g ðþp V ðþ T T " # g 1 ð Þg ð Þp W ð Þ T ;; T g 1 ð Þg ð Þp U ðþp V ðþp W ð Þ ðexpanding the productsþ g 1 ðþg ðþp U ð Þp V ð Þp W ð Þ; ;; T ðwith and Þ g 1 ðþg ðþ ;;T " # p U ð Þp V ð Þp W ð Þ : ;T ðþ Now, as E and F are partitioned as UW, and V W, respectively, then ðeþ ðuþðwþ and ðf ÞðV ÞðW Þ. Hence, p U ð Þp V ð Þp W ð ÞPr½ðEÞ ^ ðfþ Š; T which is the joint probability that the product of substitutions across the edges of ðeþ is, and the product across the edges of ðf Þ is. Thus, () implies ½ExpðH n Q T H n ÞŠ E;F ;T g 1 ðþg ðþ Pr½ðEÞ ^ ðf ÞŠ: ð4þ and ½ExpðH n Q T H n ÞŠ E;F e ½dU ðþþdu ðþš e ½dV ðþþdv ðþš e ½dW ðþþdw ðþš : Now, by Observation e ½d U ðþþd U ðþš p U ðþþp U ðþ p U ðþ p U ðþ g 1 ðþp U ðþ; T e ½dV ðþþdv ðþš p V ðþ p V ðþþp V ðþ p V ðþ g ðþp V ðþ; T e ½d W ðþþd W ðþš p W ðþ p W ðþ p W ðþþp W ðþ g 1 ðþg ðþp W ðþ: T ðþ By Observation 9, given a character, inducing a site pattern ðc;dþ ðcðþ;dðþþ, we have g 1 ððeþþ hðc;eþ; g ððfþþ hðd; FÞ; ð5þ that is, under ðeþ f; g ()hðc;eþ 1; and ðf Þf; g () hðd; F Þ1: Summing the probabilities s C;D, both hðc;eþ 1 and hðd; FÞ 1, we find Pr½ðEÞ f; g;ðf Þf; gš C;D½nŠ:hðC;EÞ1;hðD;F Þ1 s C;D :

9 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 49 Similarly, we see Pr½ðEÞ f; g^ðfþf; gš ; s C;D ; C;D½nŠ:hðC;EÞ1;hðD;F Þ 1 Pr½ðEÞ f;g^ðfþf; gš s C;D ; C;D½nŠ:hðC;EÞ 1;hðD;FÞ1 Pr½ðEÞ f; g^ðfþf; gš s C;D : C;D½nŠ:hðC;EÞ 1;hðD;FÞ 1 Substituting these into (4), we obtain by Observation 9 ½ExpðH n Q T H n ÞŠ E;F hðc;eþhðd; FÞs C;D giving C;D½nŠ ½H n S T H n Š E;F ; ExpðH n Q T H n ÞH n S T H n ; from which (15) and (1) follow. 9 APPLYING THE HADAMARD CONJUGATION ðþ In this section, we provide an example application of using the Hadamard conjugation in real biological problems. We use the application reported in [5] of obtaining an analytical solution for the maximum likelihood problem of a phylogenetic reconstruction. As was shown above, a tree T uniquely determines the sequence spectrum S S T. In real life, however, we do not find such a perfect S. Given a set of input aligned sequences, every column induces a site pattern. The matrix ^S ½^sŠ C;D, denoted as the observed sequence spectrum records the frequency of each site pattern ðc;dþ. For a tree T, the likelihood function is defined as LðTÞ Y C;D½nŠ s^sc;d C;D ; tu ðþ s C;D comes from (15). That is, () expresses the probability of seeing ^S given T. The maximum likelihood problem is to find a tree such that the probability of obtaining the given data ^S is maximized. In [5], the ML problem of a triplet tree under the Jukes- Cantor model and the molecular clock hypothesis was studied (see Fig. 4). The Jukes-Cantor model of evolution [19] is the simplest model for four states DNA evolution. The assumption in this model is that when a base changes, it has equal probabilities to change to each of the other three bases. This model can be derived from the more general KST model by setting, for each edge of T, each of the three edge length parameters equal to a common value, namely, setting q e ðþ q e ðþ q e ðþ q e. We now look at a general tree T on three taxa {0, 1, } before determining the root is. The molecular clock hypothesis and determination of the root location are done at a later stage. T has just one topology, the star with the three edges e f1g, e fg, and e f1;g Fig. 4. (a) A general triplet tree over the species {0, 1, }. (b) A rooted triplet tree under Jukes-Cantor model and the molecular clock hypothesis. (see Fig. 4a). For simplicity we denote them as e 1, e and e 1; respectively. The edge-length spectrum of an arbitrary -tree can be expressed as Q 4 Now, we see that ðq 1 þ q þ q 1 Þ q 1 q q 1 q 1 q q 0 q 0 q q 1 5 : HQH 0 q 1 þ q 1 q þ q 1 q 1 þ q q 1 þ q 1 q 1 þ q 1 q 1 þ q þ q 1 q 1 þ q þ q q þ q 1 q 1 þ q þ q 1 q þ q 1 q 1 þ q þ q 1 5 ; q 1 þ q q 1 þ q þ q 1 q 1 þ q þ q 1 q 1 þ q and by (0), these are minus twice the sum of distances induced by every two path sets E, F for every entry ½HQHŠ E;F. For example, ½H n Q T H n Š 1;1 ðd 1 ðþþd 1 ðþþd 1 1 ðþþ ðd 1 ðþþd 1 ðþþd ðþþ 4ðq 1 þ q þ q 1 Þ: When applying the exponential function to each element of the matrix HQH, we obtain the so called path-set spectrum, R: 1 x 1 x 1 x x 1 x 1 x x R expðhqhþ 1 x 1 x 1 x 1 x 1 x x 1 x 1 x x 1 4 x x 1 x 1 x x 1 x x 1 x 1 x x 1 5 ; x 1 x x 1 x x 1 x 1 x x 1 x 1 x x i e 4q i : ð8þ ð9þ The x i values can replace the s C;D values as the defining parameters in the likelihood function (). The entries of R relate to the joint probabilities of differences between the end-points of the corresponding path sets in T, as implied by (4). By using our main Theorem 10, (15), the sequence probability spectrum equals

10 40 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 5, NO., JULY-SEPTEMBER 008 S H 1 RH 1 ; ð0þ a 0 a 1 a a 1 a 1 a 1 a 4 a a a 4 a a 4 5 ; ð1þ a a 4 a 4 a a 0 ð1þx 1 x þ x 1 x 1 þ x x 1 þ x 1 x x 1 Þ; a 1 ð1 x 1 x x 1 x 1 þ x x 1 x 1 x x 1 Þ; a ð1 x 1 x þ x 1 x 1 x x 1 x 1 x x 1 Þ; a ð1þx 1 x x 1 x 1 x x 1 x 1 x x 1 Þ; a 4 ð1 x 1 x x 1 x 1 x x 1 þ x 1 x x 1 Þ: ðþ Thus, we see that each expected sequence frequency takes one of the above values, which are the functions of the three parameters x 1, x, and x 1. We now apply the molecular clock constraint that asserts q 1 q ) x 1 x (see Fig. 4b). From (), it can be seen that under this constraint, a 1 a, so the number of free variables in the likelihood equation reduces to 4, leading to a further simplification. The rest of the ML solution is orthogonal to the material discussed in this paper and can be found in [5]. We remark here that the choice of these parameters was proved crucial in the derivation of the analytical solution. In former works [], [], [8], the defining parameters were the sequence probability variables themselves, and additional constraints were required to guarantee that they reside on a tree surface. This approach failed in this case, and as can be seen using the path-set variables, these constraints are removed. ACKNOWLEDGMENT The authors would like to thank the anonymous referee II who provided very helpful comments and suggestions. REFERENCES [1] E.S. Allman and J.A. Rhodes, Phylogenetic Invariants for the General Markov Model of Sequence Mutation, Math. Biosciences, vol. 18, pp , 00. [] E.S. Allman and J.A. Rhodes, Quartets and Parameter Recovery for the General Markov Model of Sequence Mutation, Applied Math. Research epress, vol. 4, pp , 004. [] E.S. Allman and J.A. Rhodes, Phylogenetic Ideals and Varieties for the General Markov Model, Advances in Applied Math., vol. 40, no., pp , 008. [4] M. Casanellas and J. Fernández-Sánchez, Performance of a New Invariants Method on Homogeneous and Nonhomogeneous Quartet Trees, Molecular Biology and Evolution, vol. 4, no. 1, pp. 88-9, 00. [5] B. Chor, M.D. Hendy, and S. Snir, Maximum Likelihood Jukes- Cantor Triplets: Analytic Solutions, Molecular Biology and Evolution, vol., no., pp. -, 00. [] B. Chor, A. Khetan, and S. Snir, Maximum Likelihood on Four Taxa Phylogenetic Trees: Analytic Solutions, Proc. Seventh Ann. Int l Conf. Computational Molecular Biology (RECOMB 0), pp. - 8, 00. [] B. Chor and S. Snir, Molecular Clock Fork Phylogenies: Closed Form Analytic Maximum Likelihood Solutions, Systematic Biology, vol. 5, no., pp. 9-9, 004. [8] B. Chor, M. Hendy, B. Holland, and D. Penny, Multiple Maxima of Likelihood in Phylogenetic Trees: An Analytic Approach, Molecular Biology and Evolution, vol. 1, pp , 000. [9] S.N. Evans and T.P. Speed, Invariants of Some Probability Models Used in Phylogenetic Inference, Annals of Statistics, vol. 1, pp. 55-, 199. [10] M.D. Hendy, The Relationship between Simple Evolutionary Tree Models and Observable Sequence Data, Systematic Zoology, vol. 8, pp. 10-1, [11] M.D. Hendy, A Combinatorial Description of the Closest Tree Algorithm for Finding Evolutionary Trees, Discrete Math., vol. 9, pp , [1] M.D. Hendy, Hadamard Conjugation: An Analytic Tool for Phylogenetics, Math. of Evolution and Phylogeny, chapter, first ed., O. Gascuel, ed., pp. 14-1, Oxford Univ. Press, 005. [1] M.D. Hendy and D. Penny, A Framework for the Quantitative Study of Evolutionary Trees, Systematic Zoology, vol. 8, pp. 9-09, [14] M.D. Hendy and D. Penny, Spectral Analysis of Phylogenetic Data, J. Classification, vol. 10, pp. 5-4, 199. [15] M.D. Hendy and D. Penny, Complete Families of Linear Invariants for Some Stochastic Models of Sequence Evolution with and without the Molecular Clock Assumption, J. Computational Biology, vol., pp. 19-1, 199. [1] M.D. Hendy, D. Penny, and M.A. Steel, A Discrete Fourier Analysis for Evolutionary Trees, Proc. Nat l Academy of Sciences, vol. 91, pp. 9-4, [1] B. Holland, D. Penny, and M. Hendy, Outgroup Misplacement and Phylogenetic Inaccuracy under a Molecular Clock A Simulation Study, Systematic Biology, vol. 5, pp. 9-8, 00. [18] K.T. Huber, M. Langton, D. Penny, V. Moulton, and M. Hendy, Spectronet: A Package for Computing Spectra and Median Networks, Applied Bioinformatics, vol. 1, pp , 00. [19] T.H. Jukes and C.R. Cantor, Evolution of Protein Molecules, Mammalian Protein Metabolism III, H.N. Munro, ed., Academic Press, 199. [0] M. Kimura, A Simple Method for Estimating Evolutionary Rates of Base Substitutions through Comparative Studies of Nucleotide Sequences, J. Molecular Evolution, vol. 1, pp , [1] M. Kimura, Estimation of Evolutionary Distances between Homologous Nucleotide Sequences, Proc. Nat l Academy of Sciences, vol. 8, pp , [] J.L. Neyman, Molecular Studies of Evolution: A Source of Novel Statistical Problems, Statistical Decision Theory and Related Topics, S.S. Gupta and J. Yackel, eds., Academic Press, 191. [] L. Pachter and B. Sturmfels, Algebraic Statistics for Computational Biology. Cambridge Univ. Press, 005. [4] L. Pachter and B. Sturmfels, The Mathematics of Phylogenomics, submitted for publication. [5] M.A. Steel, M.D. Hendy, L.A. Székely, and P.L. Erdös, Spectral Analysis and a Closest Tree Method for Genetic Sequences, Applied Math. Letters, vol. 5, pp. -, 199. [] M.A. Steel, M.D. Hendy, and D. Penny, Reconstructing Phylogenies from Nucleotide Pattern Probabilities: A Survey and Some New Results, Discrete Applied Math., vol. 88, pp. -9, [] B. Sturmfels and S. Sullivant, Toric Ideals of Phylogenetic Invariants, J. Computational Biology, vol. 1, pp. 04-8, 005. [8] L. Székely, P.L. Erdös, M.A. Steel, and D. Penny, A Fourier Inversion Formula for Evolutionary Trees, Applied Math. Letters, vol., pp. 1-1, 199. [9] L. Székely, M.A. Steel, and P.L. Erdös, Fourier Calculus on Evolutionary Trees, Advances in Applied Math., vol. 14, pp. 00-1, 199. [0] P.J. Waddell and M.D. Hendy, Using Phylogenetic Invariants to Enhance Spectral Analysis of Nucleotide Sequence Data, Information and Math. Sciences Report Series B, Massey Univ., 199.

11 HENDY AND SNIR: HADAMARD CONJUGATION FOR THE KIMURA ST MODEL: COMBINATORIAL PROOF USING PATH SETS 41 Michael D. Hendy received the PhD degree in algebraic number theory from the University of New England, NSW, Australia, in 19. He was with Massey University, he began research into phylogenetics in collaboration with molecular biologist, David Penny. Since 199, he has been a personal chair in mathematical biology at Massey University, Palmerston North, New Zealand. He is also the executive director of the Allan Wilson Centre for Molecular Ecology and Evolution, one of the seven Centres of Research Excellence, New Zealand the Allan Wilson Centre is a group of nearly 100 researchers in biology and mathematics across five New Zealand universities. In , he held a 10-month mercator professorship in biomathematics at the University of Greifswald, Germany. He has published more than100 research papers in the fields of algebraic number theory and phylogenetics. Sagi Snir received the BA degree in computer science and economics from Bar Ilan University, Israel, and the MSc and PhD degrees in computer science from the Technion, Israel. He was with various information technologies companies, including IBM Haifa Research Lab. He is currently a postdoctoral researcher at the University of California, Berkeley. His main research interest is computational biology and in particular phylogenetics.. For more information on this or any other computing topic, please visit our Digital Library at

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites Benny Chor Michael Hendy David Penny Abstract We consider the problem of finding the maximum likelihood rooted tree under

More information

Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions

Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions Benny Chor,* Michael D. Hendy, and Sagi Snirà *School of Computer Science, Tel-Aviv University, Israel; Allan Wilson Centre for Molecular Ecology

More information

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005

arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005 Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions arxiv:q-bio/0505054v1 [q-bio.pe] 27 May 2005 Benny Chor Michael D. Hendy Sagi Snir December 21, 2017 Abstract Complex systems of polynomial

More information

Algebraic Statistics Tutorial I

Algebraic Statistics Tutorial I Algebraic Statistics Tutorial I Seth Sullivant North Carolina State University June 9, 2012 Seth Sullivant (NCSU) Algebraic Statistics June 9, 2012 1 / 34 Introduction to Algebraic Geometry Let R[p] =

More information

Phylogenetic Algebraic Geometry

Phylogenetic Algebraic Geometry Phylogenetic Algebraic Geometry Seth Sullivant North Carolina State University January 4, 2012 Seth Sullivant (NCSU) Phylogenetic Algebraic Geometry January 4, 2012 1 / 28 Phylogenetics Problem Given a

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION JOURNAL OF COMPUTATIONAL BIOLOGY Volume 13, Number 5, 2006 Mary Ann Liebert, Inc. Pp. 1101 1113 The Identifiability of Tree Topology for Phylogenetic Models, Including Covarion and Mixture Models ELIZABETH

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

More information

The Generalized Neighbor Joining method

The Generalized Neighbor Joining method The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1 Challenge We would like to

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.

More information

Reconstructing Trees from Subtree Weights

Reconstructing Trees from Subtree Weights Reconstructing Trees from Subtree Weights Lior Pachter David E Speyer October 7, 2003 Abstract The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

arxiv: v1 [q-bio.pe] 27 Oct 2011

arxiv: v1 [q-bio.pe] 27 Oct 2011 INVARIANT BASED QUARTET PUZZLING JOE RUSINKO AND BRIAN HIPP arxiv:1110.6194v1 [q-bio.pe] 27 Oct 2011 Abstract. Traditional Quartet Puzzling algorithms use maximum likelihood methods to reconstruct quartet

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution Elizabeth S. Allman Dept. of Mathematics and Statistics University of Alaska Fairbanks TM Current Challenges and Problems

More information

Distances that Perfectly Mislead

Distances that Perfectly Mislead Syst. Biol. 53(2):327 332, 2004 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150490423809 Distances that Perfectly Mislead DANIEL H. HUSON 1 AND

More information

INVARIANTS STEVEN N. EVANS AND XIAOWEN ZHOU. Abstract. The method of invariants is an approach to the problem of reconstructing

INVARIANTS STEVEN N. EVANS AND XIAOWEN ZHOU. Abstract. The method of invariants is an approach to the problem of reconstructing DIFFERENT TREES HAVE DISTINCT PHLOGENETIC INVARIANTS STEVEN N. EVANS AND XIAOWEN ZHOU Abstract. The method of invariants is an approach to the problem of reconstructing the phylogenetic tree of a collection

More information

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES MAGNUS BORDEWICH 1, CATHERINE MCCARTIN 2, AND CHARLES SEMPLE 3 Abstract. In this paper, we give a (polynomial-time) 3-approximation

More information

Phylogenetic invariants versus classical phylogenetics

Phylogenetic invariants versus classical phylogenetics Phylogenetic invariants versus classical phylogenetics Marta Casanellas Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya Algebraic

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

Geometry of Phylogenetic Inference

Geometry of Phylogenetic Inference Geometry of Phylogenetic Inference Matilde Marcolli CS101: Mathematical and Computational Linguistics Winter 2015 References N. Eriksson, K. Ranestad, B. Sturmfels, S. Sullivant, Phylogenetic algebraic

More information

series. Utilize the methods of calculus to solve applied problems that require computational or algebraic techniques..

series. Utilize the methods of calculus to solve applied problems that require computational or algebraic techniques.. 1 Use computational techniques and algebraic skills essential for success in an academic, personal, or workplace setting. (Computational and Algebraic Skills) MAT 203 MAT 204 MAT 205 MAT 206 Calculus I

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

arxiv: v1 [q-bio.pe] 1 Jun 2014

arxiv: v1 [q-bio.pe] 1 Jun 2014 THE MOST PARSIMONIOUS TREE FOR RANDOM DATA MAREIKE FISCHER, MICHELLE GALLA, LINA HERBST AND MIKE STEEL arxiv:46.27v [q-bio.pe] Jun 24 Abstract. Applying a method to reconstruct a phylogenetic tree from

More information

arxiv: v1 [math.ra] 13 Jan 2009

arxiv: v1 [math.ra] 13 Jan 2009 A CONCISE PROOF OF KRUSKAL S THEOREM ON TENSOR DECOMPOSITION arxiv:0901.1796v1 [math.ra] 13 Jan 2009 JOHN A. RHODES Abstract. A theorem of J. Kruskal from 1977, motivated by a latent-class statistical

More information

Metric learning for phylogenetic invariants

Metric learning for phylogenetic invariants Metric learning for phylogenetic invariants Nicholas Eriksson nke@stanford.edu Department of Statistics, Stanford University, Stanford, CA 94305-4065 Yuan Yao yuany@math.stanford.edu Department of Mathematics,

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

A concise proof of Kruskal s theorem on tensor decomposition

A concise proof of Kruskal s theorem on tensor decomposition A concise proof of Kruskal s theorem on tensor decomposition John A. Rhodes 1 Department of Mathematics and Statistics University of Alaska Fairbanks PO Box 756660 Fairbanks, AK 99775 Abstract A theorem

More information

What Is Conservation?

What Is Conservation? What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

More information

Tree-average distances on certain phylogenetic networks have their weights uniquely determined

Tree-average distances on certain phylogenetic networks have their weights uniquely determined Tree-average distances on certain phylogenetic networks have their weights uniquely determined Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Minimum evolution using ordinary least-squares is less robust than neighbor-joining Minimum evolution using ordinary least-squares is less robust than neighbor-joining Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu November

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Decomposition of random graphs into complete bipartite graphs

Decomposition of random graphs into complete bipartite graphs Decomposition of random graphs into complete bipartite graphs Fan Chung Xing Peng Abstract We consider the problem of partitioning the edge set of a graph G into the minimum number τg) of edge-disjoint

More information

Remarks on Hadamard conjugation and combinatorial phylogenetics

Remarks on Hadamard conjugation and combinatorial phylogenetics AUSTRALASIAN JOURNAL OF COMBINATORICS Volume 66() (06), Pages 77 9 Remarks on Hadamard conjugation and combinatorial phylogenetics Cayla D. McBee Department of Mathematics and Computer Science Providence

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

More information

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection

The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection The statistical and informatics challenges posed by ascertainment biases in phylogenetic data collection Mark T. Holder and Jordan M. Koch Department of Ecology and Evolutionary Biology, University of

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe? How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Combinatorial semigroups and induced/deduced operators

Combinatorial semigroups and induced/deduced operators Combinatorial semigroups and induced/deduced operators G. Stacey Staples Department of Mathematics and Statistics Southern Illinois University Edwardsville Modified Hypercubes Particular groups & semigroups

More information

On the Logarithmic Calculus and Sidorenko s Conjecture

On the Logarithmic Calculus and Sidorenko s Conjecture On the Logarithmic Calculus and Sidorenko s Conjecture by Xiang Li A thesis submitted in conformity with the requirements for the degree of Msc. Mathematics Graduate Department of Mathematics University

More information

Boolean Inner-Product Spaces and Boolean Matrices

Boolean Inner-Product Spaces and Boolean Matrices Boolean Inner-Product Spaces and Boolean Matrices Stan Gudder Department of Mathematics, University of Denver, Denver CO 80208 Frédéric Latrémolière Department of Mathematics, University of Denver, Denver

More information

Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring

Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring Computational Issues in Phylogenetic Reconstruction: Analytic Maximum Likelihood Solutions, and Convex Recoloring Sagi Snir Computational Issues in Phylogenetic Reconstruction: Analytic Maximum Likelihood

More information

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS KT Huber, V Moulton, C Semple, and M Steel Department of Mathematics and Statistics University of Canterbury Private Bag 4800 Christchurch,

More information

REPRESENTATION THEORY OF S n

REPRESENTATION THEORY OF S n REPRESENTATION THEORY OF S n EVAN JENKINS Abstract. These are notes from three lectures given in MATH 26700, Introduction to Representation Theory of Finite Groups, at the University of Chicago in November

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Transversal and cotransversal matroids via their representations.

Transversal and cotransversal matroids via their representations. Transversal and cotransversal matroids via their representations. Federico Ardila Submitted: May, 006; Accepted: Feb. 7, 007 Mathematics Subject Classification: 05B5; 05C8; 05A99 Abstract. It is known

More information

Spectral Generative Models for Graphs

Spectral Generative Models for Graphs Spectral Generative Models for Graphs David White and Richard C. Wilson Department of Computer Science University of York Heslington, York, UK wilson@cs.york.ac.uk Abstract Generative models are well known

More information

Background on Chevalley Groups Constructed from a Root System

Background on Chevalley Groups Constructed from a Root System Background on Chevalley Groups Constructed from a Root System Paul Tokorcheck Department of Mathematics University of California, Santa Cruz 10 October 2011 Abstract In 1955, Claude Chevalley described

More information

Decomposition of random graphs into complete bipartite graphs

Decomposition of random graphs into complete bipartite graphs Decomposition of random graphs into complete bipartite graphs Fan Chung Xing Peng Abstract We consider the problem of partitioning the edge set of a graph G into the minimum number τg of edge-disjoint

More information

Linear Algebra. Min Yan

Linear Algebra. Min Yan Linear Algebra Min Yan January 2, 2018 2 Contents 1 Vector Space 7 1.1 Definition................................. 7 1.1.1 Axioms of Vector Space..................... 7 1.1.2 Consequence of Axiom......................

More information

1. Can we use the CFN model for morphological traits?

1. Can we use the CFN model for morphological traits? 1. Can we use the CFN model for morphological traits? 2. Can we use something like the GTR model for morphological traits? 3. Stochastic Dollo. 4. Continuous characters. Mk models k-state variants of the

More information

Consistency Index (CI)

Consistency Index (CI) Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS

NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS NOTE ON THE HYBRIDIZATION NUMBER AND SUBTREE DISTANCE IN PHYLOGENETICS PETER J. HUMPHRIES AND CHARLES SEMPLE Abstract. For two rooted phylogenetic trees T and T, the rooted subtree prune and regraft distance

More information

On graphs having a unique minimum independent dominating set

On graphs having a unique minimum independent dominating set AUSTRALASIAN JOURNAL OF COMBINATORICS Volume 68(3) (2017), Pages 357 370 On graphs having a unique minimum independent dominating set Jason Hedetniemi Department of Mathematical Sciences Clemson University

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Possible numbers of ones in 0 1 matrices with a given rank

Possible numbers of ones in 0 1 matrices with a given rank Linear and Multilinear Algebra, Vol, No, 00, Possible numbers of ones in 0 1 matrices with a given rank QI HU, YAQIN LI and XINGZHI ZHAN* Department of Mathematics, East China Normal University, Shanghai

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

Lecture 9 : Identifiability of Markov Models

Lecture 9 : Identifiability of Markov Models Lecture 9 : Identifiability of Markov Models MATH285K - Spring 2010 Lecturer: Sebastien Roch References: [SS03, Chapter 8]. Previous class THM 9.1 (Uniqueness of tree metric representation) Let δ be a

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

The least-squares approach to phylogenetics was first suggested

The least-squares approach to phylogenetics was first suggested Combinatorics of least-squares trees Radu Mihaescu and Lior Pachter Departments of Mathematics and Computer Science, University of California, Berkeley, CA 94704; Edited by Peter J. Bickel, University

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information

Sequential dynamical systems over words

Sequential dynamical systems over words Applied Mathematics and Computation 174 (2006) 500 510 www.elsevier.com/locate/amc Sequential dynamical systems over words Luis David Garcia a, Abdul Salam Jarrah b, *, Reinhard Laubenbacher b a Department

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

The Hurewicz Theorem

The Hurewicz Theorem The Hurewicz Theorem April 5, 011 1 Introduction The fundamental group and homology groups both give extremely useful information, particularly about path-connected spaces. Both can be considered as functors,

More information

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions PLGW05 Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions 1 joint work with Ilan Gronau 2, Shlomo Moran 3, and Irad Yavneh 3 1 2 Dept. of Biological Statistics and Computational

More information

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Daniel Štefankovič Eric Vigoda June 30, 2006 Department of Computer Science, University of Rochester, Rochester, NY 14627, and Comenius

More information

Computational Biology and Chemistry

Computational Biology and Chemistry Computational Biology and Chemistry 33 (2009) 245 252 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem Research Article

More information

A few logs suce to build (almost) all trees: Part II

A few logs suce to build (almost) all trees: Part II Theoretical Computer Science 221 (1999) 77 118 www.elsevier.com/locate/tcs A few logs suce to build (almost) all trees: Part II Peter L. Erdős a;, Michael A. Steel b,laszlo A.Szekely c, Tandy J. Warnow

More information

Supplemental for Spectral Algorithm For Latent Tree Graphical Models

Supplemental for Spectral Algorithm For Latent Tree Graphical Models Supplemental for Spectral Algorithm For Latent Tree Graphical Models Ankur P. Parikh, Le Song, Eric P. Xing The supplemental contains 3 main things. 1. The first is network plots of the latent variable

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Nucleotide substitution models

Nucleotide substitution models Nucleotide substitution models Alexander Churbanov University of Wyoming, Laramie Nucleotide substitution models p. 1/23 Jukes and Cantor s model [1] The simples symmetrical model of DNA evolution All

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

The Strong Largeur d Arborescence

The Strong Largeur d Arborescence The Strong Largeur d Arborescence Rik Steenkamp (5887321) November 12, 2013 Master Thesis Supervisor: prof.dr. Monique Laurent Local Supervisor: prof.dr. Alexander Schrijver KdV Institute for Mathematics

More information

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D 7.91 Lecture #5 Database Searching & Molecular Phylogenetics Michael Yaffe B C D B C D (((,B)C)D) Outline Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood

More information

Representations of Sp(6,R) and SU(3) carried by homogeneous polynomials

Representations of Sp(6,R) and SU(3) carried by homogeneous polynomials Representations of Sp(6,R) and SU(3) carried by homogeneous polynomials Govindan Rangarajan a) Department of Mathematics and Centre for Theoretical Studies, Indian Institute of Science, Bangalore 560 012,

More information

Sylvie Hamel 848 de la Gauchetière Est

Sylvie Hamel 848 de la Gauchetière Est Sylvie Hamel 848 de la Gauchetière Est Montréal, Québec H2L 2N2 General informations. Sex: Female Nationality: Canadian Languages: French, english Web address: http://www.lacim.uqam.ca/ hamel E-mail: sylvie@cs.mcgill.ca

More information

Properties of normal phylogenetic networks

Properties of normal phylogenetic networks Properties of normal phylogenetic networks Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu August 13, 2009 Abstract. A phylogenetic network is

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation

Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation Probabilities of Evolutionary Trees under a Rate-Varying Model of Speciation Mike Steel Biomathematics Research Centre University of Canterbury, Private Bag 4800 Christchurch, New Zealand No. 67 December,

More information

Phylogeny of Mixture Models

Phylogeny of Mixture Models Phylogeny of Mixture Models Daniel Štefankovič Department of Computer Science University of Rochester joint work with Eric Vigoda College of Computing Georgia Institute of Technology Outline Introduction

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information