Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions

Similar documents
arxiv:q-bio/ v1 [q-bio.pe] 27 May 2005

Analytic Solutions for Three Taxon ML MC Trees with Variable Rates Across Sites

Maximum Likelihood on Four Taxa Phylogenetic Trees: Analytic Solutions

Effects of Gap Open and Gap Extension Penalties

HADAMARD conjugation is an analytic formulation of the

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Letter to the Editor. Department of Biology, Arizona State University

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Phylogenetic Tree Reconstruction

Phylogenetic Networks, Trees, and Clusters

Lecture Notes: Markov chains

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Using algebraic geometry for phylogenetic reconstruction

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Concepts and Methods in Molecular Divergence Time Estimation

TheDisk-Covering MethodforTree Reconstruction

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Unsupervised Learning in Spectral Genome Analysis

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

Constructing Evolutionary/Phylogenetic Trees

Algebraic Statistics Tutorial I

arxiv: v1 [q-bio.pe] 27 Oct 2011

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

The Generalized Neighbor Joining method

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

ELIZABETH S. ALLMAN and JOHN A. RHODES ABSTRACT 1. INTRODUCTION

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Phylogenetic inference

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Phylogenetic Algebraic Geometry

Algorithms in Bioinformatics

Constructing Evolutionary/Phylogenetic Trees

EVOLUTIONARY DISTANCES

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Computational Issues in Phylogenetic. Reconstruction: Analytic Maximum. Likelihood Solutions, and Convex. Recoloring

Lecture 4. Models of DNA and protein change. Likelihood methods

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

RECOVERING A PHYLOGENETIC TREE USING PAIRWISE CLOSURE OPERATIONS

What Is Conservation?

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

C.DARWIN ( )

A Phylogenetic Network Construction due to Constrained Recombination

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Non-independence in Statistical Tests for Discrete Cross-species Data

Phylogenetics. BIOL 7711 Computational Bioscience

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Introduction to Algebraic Statistics

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Algebraic Statistics progress report

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

The least-squares approach to phylogenetics was first suggested

Quartet Inference from SNP Data Under the Coalescent Model

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Estimating Divergence Dates from Molecular Sequences

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

arxiv: v1 [q-bio.pe] 4 Sep 2013

INVARIANTS STEVEN N. EVANS AND XIAOWEN ZHOU. Abstract. The method of invariants is an approach to the problem of reconstructing

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Distances that Perfectly Mislead

Metric learning for phylogenetic invariants

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

Reading for Lecture 13 Release v10

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

Reconstructing Trees from Subtree Weights

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia


Limitations of Markov Chain Monte Carlo Algorithms for Bayesian Inference of Phylogeny

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Reconstruire le passé biologique modèles, méthodes, performances, limites

Phylogenetic hypotheses and the utility of multiple sequence alignment

Species Tree Inference using SVDquartets

A 3-APPROXIMATION ALGORITHM FOR THE SUBTREE DISTANCE BETWEEN PHYLOGENIES. 1. Introduction

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

PHYLOGENETIC ALGEBRAIC GEOMETRY

Computational Biology and Chemistry

Phylogenetics: Building Phylogenetic Trees

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

arxiv: v1 [q-bio.pe] 1 Jun 2014

Transcription:

Maximum Likelihood Jukes-Cantor Triplets: Analytic Solutions Benny Chor,* Michael D. Hendy, and Sagi Snirà *School of Computer Science, Tel-Aviv University, Israel; Allan Wilson Centre for Molecular Ecology and Evolution, Massey University, Palmerston North, New Zealand; and àmathematics Department, University of California, Berkeley Maximum likelihood (ML) is a popular method for inferring a phylogenetic tree of the evolutionary relationship of a set of taxa, from observed homologous aligned genetic sequences of the taxa. Generally, the computation of the ML tree is based on numerical methods, which in a few cases, are known to converge to a local maximum on a tree, which is suboptimal. The extent of this problem is unknown, one approach is to attempt to derive algebraic equations for the likelihood equation and find the maximum points analytically. This approach has so far only been successful in the very simplest cases, of three or four taxa under the Neyman model of evolution of two-state characters. In this paper we extend this approach, for the first time, to four-state characters, the Jukes-Cantor model under a molecular clock, on a tree T on three taxa, a rooted triple. We employ spectral methods (Hadamard conjugation) to express the likelihood function parameterized by the path-length spectrum. Taking partial derivatives, we derive a set of polynomial equations whose simultaneous solution contains all critical points of the likelihood function. Using tools of algebraic geometry (the resultant of two polynomials) in the computer algebra packages (Maple), we are able to find all turning points analytically. We then employ this method on real sequence data and obtain realistic results on the primate-rodents divergence time. Introduction Maximum likelihood (ML) is increasingly used as an optimality criterion for selecting evolutionary trees (Felsenstein 1981), but finding the global optimum is a hard computational task, which led to using mostly numeric methods. So far, analytical solutions have been derived only for the simplest models (Yang ; Chor, Hendy, and Penny 1; Chor, Khetan, and Snir ) three and four taxa under a molecular clock, with just two-state characters (Neyman 191). In this work we present, for the first time, the analytic solutions for a general family of trees with four-state characters, such as DNA or RNA. The model of substitution we use is the Jukes-Cantor model (Jukes and Cantor 199) under a molecular clock, the simplest four-state model all substitutions have the same rate. The trees we deal with are on just three taxa, namely, rooted triplets (see fig. 1b). The change from two to four character states incurs a major increase in the complexity of the underlying algebraic system. Like previous works, our starting point is to present the general ML problem on phylogenetic trees as a constrained optimization problem. This gives rise to a complex system of polynomial equations, and the goal is to solve them. The complexity of this system prevents any manual solution, so we relied heavily on Maple, a symbolic mathematical system. However, even with Maple, a simple attempt to solve the system (e.g., just typing solve) fails, and additional tools are required. Spectral analysis (Hendy and Penny 199; Hendy, Penny, and Steel 199; Hendy and Snir 5) uses Hadamard conjugation to express the expected frequencies of site patterns among sequences as a function of an evolutionary tree T and a model of sequence evolution along the edges of T. As in previous works, we use the Hadamard conjugation as a basic building block in our method of solution. However, it turns out that the specific way we represent the system, Key words: maximum likelihood, phylogenetic trees, Jukes-Cantor, Hadamard conjugation, analytical solutions, symbolic algebra. E-mail: m.hendy@massey.ac.nz. Mol. Biol. Evol. ():. doi:1.19/molbev/msj9 Advance Access publication November, 5 which is determined by the choice of variables, plays a crucial role in the ability to solve it. In previous works (Chor, Hendy, and Penny 1; Chor, Khetan, and Snir ), the variables in the polynomials were based on the expected sequence spectrum (Hendy and Penny 199). This representation leads to a system with two polynomials of degrees 11 and 1. This system is too complex to resolve with the available analytic and computational tools. By employing a different set of variables, based on the path-set spectrum (Hendy and Snir 5), we were able to arrive at a simpler system of two polynomials whose degrees are 1 and 8. To finesse the construction, we use algebraic geometry combined with Maple. Specifically, we now compute the resultant of the two polynomials, which yields a single, degree 11 polynomial in one variable. The roots of this polynomial yield the desired ML solution. We remark that similar results to those shown here were obtained by Hosten, Khetan, and Sturmfels (), using somewhat different techniques. A set of computational algebraic tools for analyzing similar models on small trees based on the methods reported in Catanese et al. (); Hosten, Khetan, and Sturmfels (); and Sturmfels and Sullivant (5) is detailed in the web page http://math.tamu.edu/;lgp/small-trees/. It is evident that the currently available heuristic methods can fail to infer the correct tree, even for small number of taxa. This is true not only in the presence of multiple ML points but also in cases the neighborhood of the (single) ML point is relatively flat. Therefore, a practical consequence of this work is the use of rooted triplets in supertree methods (constructing a big tree from small subtrees). For unrooted trees, the subtrees must have at least four leaves (e.g., the tree from quartets problem). For rooted trees, it is sufficient to amalgamate a set of rooted triplets (Aho et al. 1981). Our work enables such approaches to rely on rooted ML triplets based on four characters states rather than two. Additionally, analytic solutions are able to reveal properties of the ML points that are not obtainable numerically. For small trees, our result can serve as a method for checking the accuracy of the heuristic methods used in practice. Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19 Ó The Author 5. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org

Maximum Likelihood Jukes-Cantor Triplets FIG. 1. (a) A tree T on three leaves, 1,, and, with edges labeled. (b) A rooted tree on three leaves, with root r, with edge weights q 1, q, and q.ifq 1 5 q q, then this tree satisfies the molecular clock as the path lengths from r to each leaf are the same. The remainder of this work is organized as following. In the next section we describe the materials and methods we used in this work. Specifically, in Definitions, Notations, and the Hadamard Conjugation we provide definitions and notations used throughout the rest of this work. In Definitions, Notations, and the Hadamard Conjugation we develop the identities and structures induced by the Jukes- Cantor model, while in Obtaining the ML Solution we develop ML on phylogenetic trees and subsequently solves the system for the special case of three species under Jukes-Cantor and molecular clock. In Results and Discussion we give results and discuss future research directions: Results on Genomic Sequences reports on experimental results of applying our method on real genomic sequences, while in Directions for Future Research we conclude with three open problems. Materials and Methods Here we detail on the methods and tools we employed in order to obtain our results. These are developed specifically for the tree T on three taxa referred to as 1,, and. We label the leaves of T as 1,, and and the edges (branches) as e 1, e,ande, as illustrated in figure 1a. Taxon is chosen to be the reference taxon. Our analysis is focused on the site substitutions required to transform the reference sequence to those of 1 and under Kimura s -substitution model (KST) (Kimura 1981). We will subsequently impose the constraints of the Jukes-Cantor model (Jukes and Cantor 199), and under a molecular clock, on the rooted tree of figure 1b. Definitions, Notations, and the Hadamard Conjugation Given aligned nucleotide sequences for the three taxa 1,, and, there are nucleotide combinations possible at each site. We refer to each of these as a character pattern. v 1 A character pattern v 5 can be described by the state v v v at taxon, together with the substitutions,v 1 : v,v FIG.. KST model. Substitution types a and c cross the line aa#, and b and c cross the line bb#. Kimura (1981) in his KST model, describes three classes of substitution: the transitions; and two types of transversions. Following Hendy and Snir (5), we use a to denote a transition, b and c to denote the two transversion types, together with e to indicate no substitution, as described in figure. For example, the character pattern T A 5 is obtained with the assignment of C to taxon, and C C,T the substitution pattern 5 a for taxa 1 and. C,A c In Hendy, Penny, and Steel (199) it is shown that under the KST model, the probability of a substitution pattern is independent of the character state at taxon. We index each site pattern by a pair (X, Y) of subsets of f1, g, X is the set of taxa with substitutions which cross the line aa# in (fig. ), and Y is the set of taxa with substitutions which cross the line bb#. We denote the probability of site pattern (X, Y)ass X,Y, which when not ambiguous, we index by the lists of elements of X and of Y, for brevity. Thus, for example, the substitution pattern a is indexed c by the pair of subsets (f1, g, fg), and the probability of obtaining this substitution pattern is written as s 1,. The 1 site pattern probabilities are arranged as a matrix S 5 s B;B s B;1 s B; s B;1 s 1;B s 1;1 s 1; s 1;1 s ;B s ;1 s ; s ;1 s 1;B s 1;1 s 1; s 1;1 5 ; called the sequence spectrum. These probabilities can be parameterized in several ways, in Hendy and Penny (199) and Hendy, Penny, and Steel (199) for the KST model, Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19

8 Chor et al. they are given as functions of the expected numbers of substitutions of each type on each edge. In the Jukes-Cantor model, the expected numbers of the three substitution types on an edge e i are the same, that is, q a ðe i Þ 5 q b ðe i Þ 5 q c ðe i Þ: We will write this common value as q i. Thus, q 1, q, and q are the expected numbers of substitutions of each type, on edges e 1, e, and e, respectively. Following the results of Hendy, Penny, and Steel (199); Steel, Hendy, and Penny (1998); and Hendy and Snir (5) specialized to the Jukes- Cantor model on T for three taxa, we express these expected numbers in the matrix Q 5 ÿðq 1 1 q 1 q Þ q 1 q q q 1 q 1 q q q q 5 ; called the edge-length spectrum. The major result of Hendy, Penny, and Steel (199) then becomes the following. Theorem 1 S 5 H ÿ1 expðhqhþh ÿ1 ; ð1þ 1 ÿ1 1 ÿ1 H 5 1 1 ÿ1 ÿ1 5 ; ðþ 1 ÿ1 ÿ1 1 H ÿ1 5ð1=ÞH; and the log function ln is applied to each component of the matrix HQH. Equation () in Theorem 1 is equivalent to earlier expressions (Steel et al. 199; Székely et al. 199) of Hadamard conjugations for the KST model, which used Hadamard matrices of n rows and columns applied to vectors of n entries, with q i 5 q a (e i ) 5 q b (e i ) 5 q c (e i ). The current expression has recently been developed in Hendy and Snir (5). A full proof of this result follows directly from applying Theorem of Hendy and Snir (5) on the tree T. This approach has been followed in order to clarify the role of path sets, which explain the intermediate terms in the conjugations, enabling us to derive and interpret the values in equation (8) directly. We now define several auxiliary matrices that will be useful in the sequel: 1 J 5 5 ; A 5 5 ; 1 1 1 A 1 5 5 ; 1 1 A 5 1 1 5 ; A 5 5 ; 1 1 1 1 A 5 1 15 : 1 1 The following identities relating H and these six matrices hold: A 1 A 1 1 A 1 A 1 A 5 J; HJH 5 1A ; HA H 5 J; HA 1 H 5 ða 1 A ÞÿJ; HA H 5 ða 1 A 1 ÞÿJ; HA H 5 ða 1 A ÞÿJ: ðþ ðþ ð5þ ðþ ðþ The edge-length spectrum of an arbitrary -tree can be expressed as the matrix, ÿðq 1 1 q 1 q Þ q 1 q q q 1 q 1 Q 5 q q 5 q q 5 q 1 A 1 1 q A 1 q A ÿ ðq 1 1 q 1 q ÞA : Now, from equations ( ), we see HQH 5 ÿ ½ðq 1 1 q ÞA 1 1 ðq 1 q ÞA 1 ðq 1 1 q ÞA 1 ðq 1 1 q 1 q ÞA Š q 1 1 q q 1 q q 1 1 q q 1 1 q q 1 1 q q 1 1 q 1 q q 1 1 q 1 q 5 ÿ q 1 q q 1 1 q 1 q q 1 q q 1 1 q 1 q 5 ; q 1 1 q q 1 1 q 1 q q 1 1 q 1 q q 1 1 q so applying the exponential function to each element of the matrix HQH, we obtain the so called path-set spectrum, R: 1 x 1 x x x x 1 x x 1 x x 1 x x 1 x x x 1 x x R 5 expðhqhþ 5 ð8þ x x x 1 x x x x x 1 x x 5 x 1 x x 1 x x x 1 x x x 1 x 5 A 1 x 1 x A 1 1 x x A 1 x 1 x A 1 x 1 x x A ; x i 5 e ÿq i : ð9þ Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19

Maximum Likelihood Jukes-Cantor Triplets 9 The x i values can replace the q i values as the defining parameters and are called the path-set variables. The entries of R relate to the probabilities of differences between the end points of paths in T (Hendy and Snir 5). From Theorem 1, we find the expected sequence spectrum is S 5 H ÿ1 RH ÿ1 ð1þ a a 1 a a a 5 1 a 1 a a a a a a 5 ; ð11þ a a a a a 1 1 a 1 a a 5 5 a 5 1 1 ÿ1 ÿ1 ÿ x x 1 ÿ1 ÿ1 ÿ 1 x 1 x 1 ÿ1 ÿ1 ÿ 5 x 1 x 5 : ð1þ a 1 ÿ1 ÿ1 ÿ1 x 1 x x Thus, we see that the expected frequency of each site pattern takes one of the five a i values, each of which is a function of the three parameters x 1, x, and x. In particular, when we impose the molecular clock condition q 1 5 q, and hence x 1 5 x, equation (1) reduces to 1 a a 1 a a a 5 5 a 5 1 1 ÿ1 ÿ 1 ÿ1 ÿ 1 1 ÿ ÿ 5 1 ÿ ÿ1 and we observe a 1 5 a. 1 x 1 x x 1 x x 1 5 ; ð1þ Obtaining the ML Solution When we have an aligned sequence of length c for each of the three taxa 1,, and, we can count the relative frequencies (normalized to sum to 1) of the substitution patterns among the sites. For (X, Y) j X, Y f1, g, we set f X,Y to be the relative frequency of observing the site pattern (X, Y), and let F be the matrix with entries f(x, Y) indexed as in the sequence spectrum S. If the sequences were generated under the Jukes-Cantor model on the tree T with edge weights q 1, q, q, then the expected number of substitution pattern (X, Y)isf X,Y 5 cs X,Y, c is the number of sites. However, in c independent samples, the observed frequencies will include sampling error, so we cannot directly conclude S 5 F. ML provides a method of estimating the parameters q 1, q, and q from the observed frequencies F. Given a set of edge lengths q 1, q, and q, the likelihood of observing F under the Jukes-Cantor model on T is LðTÞ 5 Y s f X;Y ; ð1þ X;Y X;Yf1;g the entries in S are obtained from equation (11). This gives identities among the pattern probabilities s X,Y,so grouping the common factors in equation (1) gives LðTÞ 5 Y j 5 a f j j ; f 5 F Ø;Ø ; f 1 5 F Ø;1 1 F 1;Ø 1 F 1;1 ; f 5 F Ø; 1 F ;Ø 1 F ; ; f 5 F Ø;1 1 F 1;Ø 1 F 1;1 ; f 5 F 1; 1 F 1;1 1 F ;1 1 F ;1 1 F 1;1 1 F 1; : ð15þ The expected sequence spectrum S can be expressed as a function of the three variables x 1, x, and x, so the values which maximize the likelihood L are obtained when each of the three partial derivatives, @L=@x j 5 ðj51; ; Þ: In contrast to previous works (Chor et al. ; Chor, Hendy, and Penny 1; Chor, Khetan, and Snir ; Chor and Snir ), that operated in the space of the expected sequence variables, S D,E, here we are operating in the space of the path-set variables x5½x 1 x x Š: This eliminates the need to introduce the constraints of the ML points being on a tree surface. By the chain rule, we get: @L @x 5 L X f i @a i a i 5 i @x ; ð1þ which must be at each ML point. From equation (1), we can determine the matrix of partial derivatives, @a @x 5 @a i @x j x x 5 1 ÿ1 ÿ1 ÿ x ÿ1 ÿ1 ÿ x 1 1 x x 1 5 ; ÿ1 ÿ1 ÿ 5 x x x 1 x x 1 x ÿ1 ÿ1 ÿ1 ð1þ hence, at each turning point ( @L=@x5) we have from equation (1) ÿ1 ÿ1 ÿ f f 1 f f f ÿ1 ÿ1 ÿ a a 1 a a a ÿ1 ÿ1 ÿ 5 ÿ1 ÿ1 ÿ1 x x x x 1 x x 1 5 5 5: ð18þ x x x 1 x x 1 x The three relations of equation (18) are rational functions in the three variables x i. If we multiplied by the common denominators, we obtain three polynomials, each of degree 1 in the variables. Solving these simultaneously might be Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19

Chor et al. feasible, however, we can simplify the system by imposing the additional constraints q 1 5 q q ; which constrain the edge lengths to satisfy the molecular clock, as illustrated in figure 1b. These constraints imply x 1 5 x x. In our analysis below, we will explicitly impose the equality x 1 5 x to find the turning points. The inequality will need to be tested on any potential solution and, if it were not satisfied, a maximum could be sought on the boundary of the valid tree domain, x 1 5 x 5 x. The constraint x 1 5 x implies a 1 5 a, so by replacing x 1 and a 1, equation (18) reduces to f f 1 1 f f f ÿ ÿ a a a a ÿ ÿ 5 ÿ ÿ1 x x x 5 5 : ð19þ x x x From equation (1), each a i is a polynomial in x, x,so multiplying equation (18) by a a a a gives two polynomial equations in the two variables x, x and the observed frequencies f j. We refer to these polynomial equations as E 1 and E. We now show that the system of two resulting polynomials fe 1, E g has only finitely many solutions, all of which we can find. The major tool used here is the resultant of two polynomials. Let f ðxþ5 P d i5 a ix i and gðxþ5 P d j5 b jx j be two polynomials in one variable, x. The resultant of f and g, denoted Res(f, g, x), is a polynomial in the coefficients a i and b j of f and g, which is whenever f and g have a common zero. The coefficients can themselves be unknowns or functions of other variables, in which case, the resultant replaces the two polynomials f and g by a single polynomial in one fewer variable. Computing the resultant is a classical technique for eliminating one variable from two equations. There is an elegant formula for computing it due to Sylvester and another due to Bezout, which have been implemented in the computer algebra package Maple. We can compute the resultant ER 5 Res(E 1, E, x )of E 1 and E with respect to x. This eliminates x from the equations and yields a single polynomial ER, in just x and the parameters. The polynomial ER has the form: ER 5 kf f 1 f f x 1 ðx 1 1Þðx 1 x 1 1Þðx 1 1Þ ðx 1 x 1 Þðx ÿ 1Þ ðx 1 1Þ P ; ðþ P is the degree 11 polynomial with 88 monomials displayed in the appendix and k is some large constant. Theorem The turning points of L (eq. 1) corresponding to realistic trees (namely, trees with positive edge lengths) are exactly the roots of P. Proof. The only factors in ER (eq. ) that admit positive real roots are P and ðx ÿ 1Þ. However, x 5 1 implies q 5, which is not realistic. h Corollary The Jukes-Cantor triplet has a finite number of ML points. Proof. P has at most 11 different solutions and, for each such solution, we can back substitute to obtain all the values of x. h Maxima on the Boundaries of the Parameter Space The realistic parameter space R is bounded by the constraints q 1 5 q q N; which implies x x 1 5 x 1: The analysis above finds maxima at turning points in the interior of R, it is also possible that the maximum value of L can be on the boundary, which is not a turning point. To find these we must consider the turning points of L when the parameters are constrained to the boundary. We identify three boundary subspaces I : x 5 x 1 5 x 1; II : x 5 x 1 5 x 1; II : x x 1 5 x 5 1: We must independently examine maxima on each boundary subspace. In each case, the likelihood function on the boundary can be described by a single parameter x i, then solving dl=dx i 5 gives a single polynomial constraint. In case I, with x 5, x 5 x 1 5 x, we find dl dx 5Lxððf 1 f Þð1ÿx Þÿðf 1 1 f 1 f Þð1 1 x ÞÞ5; which has a single positive root x 5 1 ÿ 1 ; ðf 1 1 f 1 f Þ and a zero at x 5. In case II, with x 5 x 1 5 x 5 x, we find dl dx 5 Lxð1 ÿ xþ ð9f ð1 1 xþð1 ÿ xþð1 1 x 1 x Þ 1 ðf 1 1 f 1 f Þð1 1 9x 1 x Þð1 ÿ xþð1 1 xþ ÿ f ð1 1 9x 1 x Þð1 1 x 1 x ÞÞ 5 ; which factorizes into a polynomial of degree 5, and x(1 ÿ x) which has zeros at x 5, 1. In case III, with x 5 x, x 1 5 x 5 1, L 5 (unless f 1 5 f 5 f 5, whence L is undefined). Results and Discussion Here we report experimental results obtained by employing our method on genomic sequences. We conclude by discussion and pointing out future research directions. Results on Genomic Sequences In order to evaluate our method, we tested it on three sets of real genomic sequences. In each case, we found a single solution to P 5 in the realistic parameter space, Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19

Maximum Likelihood Jukes-Cantor Triplets 1 for each of the three rooted triples. By evaluating the likelihood at each of these turning points, we found each was a maximum. We looked at the natural killer cell receptor D gene on human, mouse, and rat (accession numbers AF15, AF1 and AF9511, respectively). We aligned the sequences using ClustalW (Thompson, Higgins, and Gibson 199). Next, we computed the observed sequence spectrum F (eq. 1, as explained in Obtaining the Maximum Likelihood Solution). F 5 18 18 8 1 1 5 : ð1þ We calculated the ML value for each of the three rooted trees under the model for the three species. The (rat, mouse, and human) tree was maximal, with edge lengths q 1 5 q 5.19 to rat and mouse and q 5.11 to human, giving the log likelihood ln L 5 ÿ8.. This result suggests the human-rodent split at MYA and the mouse-rat split at MYA, a result consistent with commonly accepted dates. We also calculated the ML value for each of the three rooted trees for the beta actin gene, for the three species guinea pig, goose, and Caenorhabditis elegans (accession numbers AF589, M111, and NM_, respectively), finding the ((guinea pig, goose), C. elegans) tree maximal, with q 1 5 q 5.1819 and q 5.5188 giving ln L 5 ÿ11.5. Finally, we calculated the ML value for each of the three rooted trees for the histone gene of Drosophila melangoster, Hydra vulgaris, and human (accession numbers AY851, AY85, and NM_1, respectively), finding the ((D. melangoster, H. vulgaris), Human) tree maximal, with q 1 5 q 5.1555 and q 5.1 with ln L 5 ÿ8.851. Each of the results above agree closely with the numerical values obtained using the popular phylogenetic reconstruction packages PHYLIP (Felsenstein 1995) and PAUP* (Swofford 1998), which use iterative methods to estimate the maxima. In each case, the likelihood function had a unique maximum in the parameter range. Directions for Future Research The progress made here brings up a number of open problems: Our ML solutions are derived from the roots of a univariate, degree 11 polynomial. This implies that the number of ML solutions is finite. It would be interesting to explore the question of uniqueness of the solution. If this is the case, it will most likely follow from the existence of a single solution corresponding to a realistic tree, as in Chor, Khetan, and Snir (). The Jukes-Cantor substitution model is a special case of the family of Kimura substitution models. It would be interesting to further extend the result in this paper for the other models (two and three parameters) of the Kimura family. It would be interesting to extend these results to rooted trees with four leaves under Jukes-Cantor model and a molecular clock. Here we have two different topologies the fork and the comb (Chor, Khetan, and Snir ). It is expected that such extension will face substantial technical difficulties. Acknowledgments Thanks to Joseph Felsenstein for his fruitful discussions, to Bernd Sturmfels for enlightening comments on this manuscript and informing us about his manuscript (Hosten, Khetan, and Sturmfels ), and to the two other reviewers for their constructive comments. This research was supported by the Israel Science Foundation grant 18/. Appendix The polynomial P is presented the coefficients are functions of the observed pattern frequencies, f i with f 1 5 f 1 1 f. Using Maple we find: P 5 ðf 1 1f 1 18f f 1 1 1 18f f 1 f 1 19f f 1 f 1 18f f 1 1f 1 18f f 1 18f f 1 1 8f f 1 8f f 1 8f 1 1f 1 19f f 1 1f f f 1 1f f f 1 1 18f f 1 f 1 8f f 1 1 8f f 1 8f 1 f Þx11 1 ð1f f 1 8f f 1 1 9f 1 f f 1 88f 1 f f 1 15f 1 88f f 1 f 1 1f 1 15f 1f 1 8f f 1 1 89f f f 1 1 18f f 1 111f f 1 158f f 1 5f f 1 9f f 1 f 1 988f f 1 8f f 1 1 918f 1 5f f 1 1 Þx 1 1 ðf f 1 15f 1 f 1 55f f ÿ 5f 1 f ÿ 15f 1f 1 5f f f 1 ÿ 1f 1 19f f 1 ÿ 8f 1 f f f ÿ 18f f 1 1f f 1 1 1f f 1 f 1 189f f 1 5f f 1 1 8f 1 88f f 1 f 1 99f f 1 18f 1 ÿ 11f f Þx 9 1 ð18f 1 f f ÿ 1f f 1 f 1 99f f f 1 ÿ f f 1 ÿ 88f 1 f ÿ 15f f 1 558f f ÿ f f 1 1 5f f 1 ÿ 59f ÿ f f ÿ 1f f 1 88f ÿ 15f ÿ f f 1 1 1 f f ÿ f 1 f 1 f ÿ 5f f ÿ 5f f f Þx 8 1 ðÿ11f 1 5f f 1 1 5f f 1 1f 1 ÿ 88f f 1 f ÿ 1f f ÿ 95f f 1 ÿ 9f f 1 f ÿ 1f 1 f ÿ 195f f 1 5f 1 ÿ 58f f f 1 ÿ f f 1 ÿ 185f f 1 1f ÿ 99f f ÿ 115f 1 f ÿ f f f ÿ 1f f 1 f f Þx 1 ð18f 1 f 1 15f ÿ 9f f 1 1 1 98f f 1 1 18f f 1 8f ÿ 5f f f ÿ 19f f 1 f 1 f f ÿ 19f f 1 51f 1 5f ÿ 18f f ÿ 19f f f 1 ÿ 89f f 1 1818f 1 f ÿ 1f f Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19

Chor et al. ÿ 18f f 1 1 18f f 1 1 f f 1 f Þx 1 ð5f f 1 1 f 1 9f f 1 f 1 f 1 8f f 1 1f f f ÿ 1f f ÿ 1f f ÿ 8f 1 f ÿ 188f f 1 f 1 1f 1 11f 1 8f 1 f ÿ f 1 f 1 88f 1 f 1 9f f 1 5f f 1 1 f 1 f ÿ 18f f f 1 1 18f f Þx 5 1 ðÿf f 1 ÿ 5f f f 1 ÿ 1f f 1 f f 1 f 1 8f f 1 ÿ f f ÿ 1f f 1 1 ÿ 98f f 1 f ÿ f f 1 8f 1 f 1 1f f f 1 f f ÿ f ÿ f f 1 1 f 1 11f f ÿ f f ÿ f 1 f f 1 ÿ 5f Þx 1 ðÿ18f 1 8f f 1 f 1 f 1 8f f 1 ÿ f f ÿ f f 1 ÿ f ÿ 1f f 1 f 1 f f f 1 18f f 1 f f 1 1 f f 1 ÿ 8f f f 1 1 1f 1 f ÿ 1f f 1 1 f f 1 ÿ f f ÿ 11f f 1 f f ÿ 1f Þx 1 ð5f f f 1 ÿ 8f f ÿ f 1 f f 1 1 1 1 f f 1 18f f 1 f ÿ f ÿ 1f f ÿ 1f f 1 8f f f ÿ f f 1 1 f f ÿ 5f 1 f 1 f 1 18f f ÿ 9f ÿ 18f 1 f ÿ f f 1 1 9f f 1 f ÿ 8f f Þx 1 ðÿf f 1 f 1 9f 1 f ÿ 15f f ÿ 5f 1 f f ÿ f 1 f f 1 f f 1 1 ÿ f f 1 11f f f 1 1 f f 1 1 f f ÿ f f 1 15f f 1 ÿ 1f 1 f f 1 f ÿ 8f f Þx 1 ðf f ÿ 9f f 1 f f 1 f f ÿ f f 1 1 1 f f 1 f f 1 f ÿ f f f 1 ÿ f f 1 f Þ: Literature Cited Aho, A. V., Y. Sagiv, T. G. Szymanski, and J. D. Ullman. 1981. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J. Comput. 1:5 1. Catanese, F., S. Hosten, A. Khetan, and B. Sturmfels.. The maximum likelihood degree. (http://front.math.ucdavis.edu/ math.ag/5). Chor, B., M. Hendy, B. Holland, and D. Penny.. Multiple maxima of likelihood in phylogenetic trees: an analytic approach. Mol. Biol. Evol. 1:159 151. (Earlier version appeared in RECOMB ). Chor, B., M. Hendy, and D. Penny. 1. Analytic solutions for three taxon mlmc trees with variable rates across sites. Pp. 1 in B. Moret and O. Gascuel, eds. Lecture notes in Computer Science 19 in Workshop on Algorithms in BioInformatics 1. Springer-Verlag, Heidelberg, Germany. Chor, B., A. Khetan, and S. Snir.. Maximum likelihood on four taxa phylogenetic trees: analytic solutions. Pp. 8 in W. Miller, M. Vingron, S. Istrail, P. Pevzner, and M. Waterman, eds. Proceedings of the Seventh Annual International Conference on Computational Molecular Biology (RECOMB). AMC Press, Berlin. Chor, B., and S. Snir.. Molecular clock fork phylogenies: closed form analytic maximum likelihood solutions. Syst. Biol. 5():9 9. Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 1: 8.. 1995. PHYLIP (phylogeny inference package). Distributed by the author, Department of Genetics, University of Washington, Seattle. Hendy, M. D., and D. Penny. 199. Spectral analysis of phylogenetic data. J. Classif. 1:5. Hendy, M. D., D. Penny, and M. A. Steel. 199. Discrete Fourier analysis for evolutionary trees. Proc. Natl. Acad. Sci. USA 91:9. Hendy, M. D., and S. Snir. 5. Hadamard conjugation for the Kimura st model: combinatorial proof using pathsets. (http:// arxiv.org/abs/q-bio.pe/5555). Hosten, S., A. Khetan, and B. Sturmfels.. Solving the likelihood equations. (http://front.math.ucdavis.edu/math.st/ 8). Jukes, T. H., and C. R. Cantor. 199. Evolution of protein molecules. Pp. 1 1 in H. N. Munro, ed. Mammalian protein metabolism. Academic Press, New York. Kimura, M. 1981. Estimation of evolutionary sequences between homologous nucleotide sequences. Proc. Natl. Acad. Sci. USA 8:5 58. Neyman, J. 191. Molecular studies of evolution: a source of novel statistical problems. Pp. 1 in S. Gupta and Y. Jackel, eds. Statistical decision theory and related topics. Academic Press, New York. Steel, M., M. D. Hendy, and D. Penny. 1998. Reconstructing phylogenies from nucleotide pattern probabilities: a survey and some new results. Discrete Appl. Math. 88: 9. Steel, M. A., M. D. Hendy, L. A. Székely, and P. L. Erdös. 199. Spectral analysis and a closest tree method for genetic sequences. Appl. Math. Lett. 5:. Sturmfels, B., and S. Sullivant. 5. Toric ideals of phylogenetic invariants. J. Comput. Biol. 1: 8. Swofford, D. L. 1998. PAUP*beta. Sinauer Associates, Sunderland, Mass. Székely, L., P. L. Erdös, M. A. Steel, and D. Penny. 199. A Fourier inversion formula for evolutionary trees. Appl. Math. Lett. :1 1. Thompson, J. D., D. G. Higgins, and T. J. Gibson. 199. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalty and weight matrix choice. Nucleic Acids Res. : 8. Yang, Z.. Complexity of the simplest phylogenetic estimation problem. Proc. R. Soc. Lond. B :19 11. William Martin, Associate Editor Accepted November 1, 5 Downloaded from https://academic.oup.com/mbe/article-abstract////1115 by guest on 1 January 19