A COMMON APPROACH TO FINDING THE OPTIMAL SCENARIOS OF A MARKOV STOCHASTIC PROCESS OVER A PHYLOGENETIC TREE

Size: px

Start display at page:

Download "A COMMON APPROACH TO FINDING THE OPTIMAL SCENARIOS OF A MARKOV STOCHASTIC PROCESS OVER A PHYLOGENETIC TREE"

Clare Goodwin
5 years ago
Views:

1 Article BIOIFORMATICS DOI:.554/bbeq.22.7 A COMMO APPROACH TO FIDIG THE OPTIMAL SCEARIOS OF A MARKOV STOCHASTIC PROCESS OVER A PHYLOGEETIC TREE Petar Konovski University of London, Department of Computer Science and Information Systems Birkbeck, London, UK petar@dcs.bbk.ac.uk ABSTRACT Inferring phylogenetic trees is a general approach in the reconstruction of the evolutionary histories of organisms. In order to estimate the events over a phylogenetic tree, several criteria and algorithms are used. In the present work, a common approach, together with an effective algorithm is proposed. The approach aims to unify the applying of the Maximum parsimony and Maximum likelihood criteria over a phylogenetic tree with various edge lengths. The events over an edge of the phylogenetic tree are described by a Markov stochastic process. Explicit formulae for the simplest case are given. Maximum likelihood is reformulated from the point of view of Information Theory as Minimum surprisal. Additionally, a new scoring criterion, Minimum entropy, is proposed. Biotechnol. & Biotechnol. Eq. 2 26(5), Keywords: phylogenetic tree, maximum parsimony, maximum likelihood, Markov stochastic process, minimum entropy Introduction Mapping the evolutionary events on a phylogenetic tree is a widely used approach to elucidate the inheritance of given treats in a set of species. Here we make an attempt to create a common framework for the various approaches to such mapping and to introduce some improvements of the existing algorithms. Maximum parsimony Maximum parsimony (MP) is the oldest approach to estimate the events over a phylogenetic tree. It is the simplest one and inevitably has its limitations, but nevertheless it is still widely used. For the needs of estimating the phylogenetic events, it can be formulated as explaining the present-day distribution of the hereditary traits in the investigated group of species using minimal assumptions about the events which have happened in the common ancestors. In our case, we consider only two events to happen in the nodes of the tree: a gain of the trait and a loss of the trait. The mere inheritance is not considered as an event: it happens by default every time when the trait is present in the ancestor and no loss happens. Because the loss of the traits is expected to happen more often than the gain, a gain penalty coefficient G > is introduced, while the loss penalty is assumed to be equal to. Thus, we try to minimize the sum of the penalties. In a formal way, this is described as follows: The scoring function over the nodes of the tree is min if loss happens F( ) = G if gain happens if nothing happens (Eq. ) The Maximum parsimony goal is to find T F The minimum is over all speculative event sets which can explain the distribution of the trait in the leaves. In this case, Minimal penalties Minimum expenditure Maximum parsimony. An efficient algorithm, PARS, (PARsimonious Scenario) for computing the Maximum parsimony over a given phylogenetic tree is given by Mirkin et al. (6). Although the assumption in (6) is that the gain and loss penalties are uniform over the tree, the recursive structure of PARS allows the use of node-specific penalties. Maximum parsimony has been used in different areas long before the arising of the contemporary evolutionary theory. Actually, it can be considered as an implementation of the Occam s razor. An example of a phylogenetic tree In the example shown the protein content of 26 prokaryotic species is investigated. Given the present-day protein content of the species (represented by the leaves of the tree), the protein content of their ancestors (represented by the internal nodes of the tree) is estimated using the Maximum parsimony. The proteins are grouped in 337 COGs (Clusters of Orthologous Groups of proteins). A graphic representation of the phylogenetic tree is given in Fig.. For every node of the tree, the numbers of gained and lost COGs are shown. The graphics is generated by the ER package (in testing phase). The 3296

2 Fig. Gains/losses of 337 COGs mapped over a phylogenetic tree. used data set is the same as in (6). The data are extracted from the COG database (2). Maximum likelihood In the phylogenetic research, Maximum likelihood (ML) is widely used in recent days and is considered more accurate than MP. It is a general statistical method, and its application in phylogenetic research originates from Farris (4). The method relies on the estimation of the likelihoods for gain or loss in every node of the tree. This allows applying the calculations on trees with variable edge lengths or with gain and loss rates which vary from edge to edge. The variable edge lengths reflect the variable time intervals passed between various ancestor child pairs represented in the tree. ow, the probabilities of the gains and losses for different nodes are not equal. Let γ be the probabilities for the gain and loss in node and γ = γ, λ = λ. The scoring function is λ if loss happens γ if gain happens L( ) = λ if no loss happens γ if no gain happens (Eq. 2) Π over all speculative event sets which can explain the pattern in the leaves. The main obstacle for the algorithm is the initial set of data for γ, because at the beginning the researcher has the information for the contents of the leaves only. This is because initially the algorithm is fed with these data from external source (e.g., the probabilities are calculated using the output from MP). After that, an iterative process can be applied, by calculating the new values of γ using the newly obtained values for gains and losses. This iterative procedure together with the first open problem about it (described below) originates from Mirkin et al. (7). The procedure is a part of the algorithm MALS (MAximum Likelihood Scenario) (7), on which some of the following considerations are based. Because the gains and losses obtained for a single trait are too scarce to make reliable conclusions about the nodedependent gain and loss rates, these rates are calculated using the data for all traits (e.g. in the example cited above: for all 337 COGs). This is based on the assumption that the gain and loss rates are equal for all traits, though they can be nodespecific. Open problems: The following problems concerning the iteration procedure still need answers: Then we search for max L T ( ) 3297

3 . Is the iteration process always convergent? So far, no situation has been observed when the iteration enters an endless cycle. Using data similar to the example given above, the iteration procedure reaches an extremum within maximum iterations. 2. Are there any local extrema in the iteration procedure? Chor et al. () give counterexamples of phylogenetic trees for which the Maximum likelihood function has multiple extrema, even a continuum set of extrema. A similar maximum likelihood problem is investigated analytically by Vandev et al. (3). Materials and Methods The Phylogenetic Scenarios from the Point of View of the Information Theory Claude Shannon in his work () founded the contemporary Information Theory, or Mathematical Theory of Communication. We will use some basic elements from it. The ideas of Claude Shannon influence various branches of engineering and science, from the communication lines through data storage through cryptography till compression algorithms. The information content of the outcome of a random event Let X be a discrete random event and ω be one of its outcomes with probability P(ω). Then, the knowledge that the outcome is ω gives us the amount of information (or self-information, surprisal) I ( ω) ( P( ω) ) = log. If the base of the logarithm is the measure is in the well-known bits; if the base is e, the measure is in nats; if the base is, the measure is in hartleys. Example: When flipping a coin, the knowledge that the outcome is head, gives us an amount of surprisal log 2 (/ 2) = bit. The information content of a scenario over a phylogenetic tree Let ω be one of the alternative scenarios over the phylogenetic tree T. It is described by its set of specific gains and losses in the nodes of the tree, as mentioned before. The probability of the scenario is P( ω) = max Π L T ( ) and its information content is I ( ω) = - log L( ). The phylogenetic scenario which minimizes the above function, should maximize the likelihood function and vice versa. This follows from the fact that the function log is strictly decreasing. Hence, one can use the scoring function T log( λ ) if loss happens log( γ ) if gain happens I ( ) = log( λ ) if no loss happens log( γ ) if no gain happens (Eq. 3) instead of (Eq. 2) and search for min I ( ) T It can be seen that the function which we intend to minimize, is very similar to that of the maximum parsimony. This allows us to apply a common minimization algorithm. The entropy of a random variable Another concept from the Information theory which can be useful, is the entropy of a random variable. It is a different entity from the physical concept for entropy, though they have some important similarities. ow, let X be a random variable with outcomes { x,..., x n} (not necessarily numbers) which happen with the corresponding probabilities { λ,..., λ n}. Let us assume that the output of the experiment is xi, the information content of the outcome is I ( x ) = log( λ ) i i and the entropy of X is defined as: I ( x ) H x n = λ = λ log( λ ) i i i i i k = k = n The entropy of a scenario over a phylogenetic tree Another approach, which can benefit from the common framework, is to investigate the entropy of the scenarios over a phylogenetic tree. Let γ be the probabilities for the gain or loss in node and γ = γ, λ = λ. The scoring function is H λlog( λ ) if loss happens γ nlog( γ n ) if gain happens = λ log( λ ) if no loss happens γ log( γ ) if no gain happens (Eq. 4) The Phylogenetic Tree Events As Outputs of a Markov Process It is considered as a paradigm that given the genetic contents of a specific organism, the genetic contents of its children depend on this information only and do not depend on the genetic contents of other ancestors and siblings. This is the justification to declare that the processes of inheritance have Markov property. Following this point of view, the natural way to introduce lengths for the edges of a phylogenetic tree is to consider them as time intervals. So, talking about the edge lengths, we assume that they are time intervals between the considered events, often without being able to specify anything about the measurement units of this time. Definitions and examples Let A be a parent node, S be one of its children, and the edge length be arbitrary. Let the trait of interest be a set of characters { c, c,..., c m}, only one of which can be present in A or in S. The outcomes of the process can be described by the matrix (P ij ) where P ij is the probability of replacing the character c i (if present in A) by the character c j in S.

4 Example : In the microevolution s studies of Single ucleotide Polymorphism (SP), the characters can be chosen as {, A, c, G,t } (empty, Adenine, Cytosine, Guanine, Thymine). In a specified position in a DA string, all possible events which can happen are described by a 5 5 matrix containing the corresponding probabilities. Example 2: When a given amino acid is substituted by another one in a polypeptide chain, the matrix describing the process is 2 2. If the character (absence of the amino acid) is considered also, the matrix becomes 2 2. Example 3: In our case, we need only two characters: : the trait k is absent and : the trait k is present. The matrix is 2 2. Formulation as a Markov process: Kolmogorov s forward equation Further in this section, we follow Ross (9). Let g be the gain rate and l be the loss rate over the edge. The infinitesimal generator of the Markov process is given by the matrix M g g = l l The probability that a given character i will be replaced by the character j along an edge of length t is Pi j ( t ) where i and j take values or. P(t) is the solution of the Kolmogorov s forward equations which in matrix form can be written as = P ' t P t M. The solution is represented as P( t) exp( Mt) =. Here, exp( Mt ) is the matrix exponent of Mt. The solutions Luckily, the solutions in the case of two characters can be found analytically. They are as follows: l g P ( t) = + exp( ( l + g) t) - The character is absent and will not be gained after time t. g g P ( t) = exp( ( l + g) t) - The character is absent and will be gained after time t. l l P ( t) = exp( ( l + g) t) - The character is present and will be lost after time t. g l P ( t) = + exp( ( l + g) t) - The character is present and will not be lost after time t. Comment Though the infinitesimal rates g and l and the time t are additional unknown parameters, this approach gives us the basis to evaluate the node-dependent probabilities of gain and loss in some situations. Reversely, given an estimation of the gain and loss probabilities, one can estimate t, g and l. This approach explains also the four scoring coefficients which are attached to each node in the previous considerations. Actually, though they are not the time-dependent probabilities found above, they are functions from them or their rough approximation (as in the case of Maximum parsimony). The nature of these solutions shows how to use any information about the time spent during the transition from the parent to the child. Results and Discussion A Generalized Minimization Algorithm Preliminary notes The algorithm described is a generalization of a set of algorithms which have been reinvented several times and applied in special cases, sometimes as MP, sometimes as ML reconstruction. The first idea of such an approach was developed by Fitch (5) using a set theory approach and was applied to MP for nucleotide substitution reconstruction. A formal description is made by Sankoff () and is applied to MP. Implementation for ML in the case of amino-acid substitutions is made by Pupko et al. (8). The generalized approach aims to deal with a class of assessments, which include Maximum parsimony, Minimum surprisal (equivalent to Maximum likelihood) and Minimum entropy. The approach covers the case of trees with different edge lengths and the case when the events have different rates over different parts of the tree. According to the Markov properties of the processes over the phylogenetic tree, a wide class of the scoring approaches lead to a set of weighting coefficients per node. In our case, the coefficients are four. In the corresponding weighting function, one and only one of them appear as an addend representing the corresponding node and its choice is made according to uniform rules. otations Let the investigated phylogenetic tree T be binary and rooted with root R and let L T be the set of the leaves. For T \ L we denote the subtree with root as U(). Let K = { k,..., km} be a set of independent hereditary characters whose presence or absence in the leaves is given as initial data. In the further presentation only a fixed k K will be considered. The basic building block of the tree is an ancestor node A with two children S and S2. There is a set of four cases, each of which has its own weighting coefficient: 3299

5 Loss: k A and k S : l ; g ; Gain: k A and k S : ot Loss (or inheritance): k A and k S : l ; ot Gain: k A and k S : g In particular, for the scoring approaches considered above, we have: For the Maximum parsimony: l =, g = G, l =, g = l For the Minimum surprisal: = log( λ ), g = log( γ ), l = log( λ ), g = log( γ ) l For the Minimum entropy: = λ log( λ ), g = γ log( γ ), l = λ log( λ ), g = γ log( γ ) The elementary Score Function For every A T \ L and conditionally of the trait k we define a triad of boolean variables as follows: ( k A, k S, k S2) The number of all such triplets is 2 3 = 8. The elementary score function Φ is defined on A (actually, on its associated triad) as follows: Φ ( A),, : g S + g S,, : g S + gs,, : gs + g S,, : gs + gs =,, : ls + ls,, : S S 2 l + l,,, : l S + ls,, : l S + l S 2. (Eq. 5) ote that it does not take into account the coefficients attached to A but only those of its children. Recursive definition of the minimizing functions T \ L { R}, (i.e., an internal node, which is For every not the root) we define Λ ( ) = min Φ ( A) A U ( ) where the minimum is taken over all possible scenarios. Actually, this is the value of the minimum reached in the tree whose root is, without taking into account the weighting coefficients of itself. The following two restrictions of Λ will be used: Λ = Λ( k ) and Λ = Λ( k ). ote: Finally, we define that Λ ( R) = min( Λ ( R), Λ ( R) + g R ). This will be used when we calculate the global minimum as Λ(R). Adding g R (and choosing a proper gain weight for the root node) is important from a conceptual point of view. If we do not add such a penalty for the root, the minimization procedure tends to move the originating of the treat towards the root. In other words, the root will become a source of free gains contrary to the common evolutionary knowledge. This is true especially if the gain event is considered as more unlikely to happen than the loss. The following assertions hold: ( A) ( ) ( 2 ) Λ S + Λ S2 + g S + g S Λ S + Λ S2 + g + g, 2, Λ S + Λ S + g + g S S 2 Λ = min Λ S + Λ S + gs + g S 2 S S 2 (Eq. 6) Proof: The components of the minimum expression cover the full set of events which can happen under the condition k A. In every case, the corresponding score coefficients are added. ( ) ( 2 ) Λ S + Λ S2 + l + l, S S 2 Λ S + Λ S2 + ls + l S Λ ( A) = min Λ S + Λ S2 + l S + ls Λ S + Λ S + l + l S S 2 (Eq. 7) Proof: The components of the minimum expression cover the full set of events which can happen under the condition k A. In every case, the corresponding score coefficients are added. ote: In real-world biological examples, the gain and loss coefficients are much bigger than not gain and not loss. This comes from the fact that the mutations are relatively rare events. Thus, the last case in Eq. 6 and the first case in Eq. 7 will never qualify for reaching the minimums and can be excluded. But they arise naturally from the logic of the explanation as they stand for scenarios which are not impossible. The case of the leaves The functions Φ and A, which we defined above, use data from the children of the corresponding node. When the node is a leaf, they must be defined explicitly. So, if S L, the following settings conform with the general definitions: ( S ) ( S ) Λ = Φ = An overview of the implementation If we have already calculated Λ ( S ), Λ ( S2 ), Λ ( S) and Λ ( S2) : Λ ; - Calculate A Λ and A 33

6 - Remember the number of the expressions in the minimums at which the minimal values are achieved; ote: It may happen that the minimums are achieved at more than one expression. This actually can lead to finding alternative scenarios, all of which are extremal. For now, we do not investigate this opportunity. - When achieving the root of the tree, calculate Λ = ( Λ, Λ + ). R min R R g R - Remember the expression at which the minimum is achieved. (This will determine if k R and actually, the winning of the two competing alternatives.); - Using the stored number of the minimal expression, we determine the presence or absence of k in the children of R and will continue recursively. Comparison with PARS and MALS algorithms The given minimization algorithm can achieve the same results (a particular scenario) as any other algorithm, given the scoring coefficients are the same. Therefore the results are expected to be identical with those obtained by PARS (Maximum Parsimony) and MALS (Maximum Likelihood) if the proper scoring coefficients are chosen. Unlikely PARS and MALS, here not any lists of nodes are kept, which should simplify the implementation and should allow its implementation on much bigger phylogenetic trees. The considered approach and proposed algorithm allow proceeding with a wide class of phylogenetic trees and scoring criteria. The scoring approaches include the most popular Maximum parsimony and Maximum likelihood. In the given context, the Maximum parsimony is a variant of the same optimisation approach, with a simple and uniform set of scoring coefficients. The iteration procedure described reaches conformity in the mapping of the gains and losses and the corresponding probabilities used. As mentioned above, the uniqueness of the extremum for the iterative ML approach is not guaranteed. The conditions when the problem has a single extremum remain to be elucidated in the future. An additional scoring criteria, Minimum entropy, is proposed. It fits the same approach as MP and ML, with differences in the set of the scoring coefficients only. If Minimum entropy is a useful approach as it is formulated, is a subject to further research. Some authors have used log likelihood as a simplification step for the Maximum likelihood computations, but it was not recognized as the negative value of the information content of a phylogenetic scenario. One can see recent examples ( 3, 4) and many others. Actually, this can be considered as an reinvention of the concept of the information, though in a different context, and shows that the useful scientific paradigms inevitably appear when there is a need for them. Acknowledgements This work was partially supported by DCSIS, Birkbeck, University of London, London, UK. It is a further extension of previously published results (6, 7) and the author is grateful to Prof. Boris Mirkin and Prof. Trevor Fenner from DCSIS, Birkbeck for their guidance. The author also would like to thank the anonymous referees of the first version of the paper for the constructive criticism. REFERECES. Chor B., Hendy M.D., Holland B.R., Penny D. (2) Mol. Biol. Evol., 7, Cohen O., Pupko T. (2) Mol. Biol. Evol., 27, Guindon S., Dufayard J.-F., et al. (2) Syst. Biol., 59, Farris J.S. (973) Syst. Zool., Fitch W. (97) Syst. Zool., 2, Mirkin B.G., Fenner T.I., Galperin M.Y., Koonin E.V. (23) BMC Evol. Biol., 3(2). 7. Mirkin B.G., Camargo R., Fenner T.I., Loizou G., Kellam P. (26) In: Proceedings of the 26 IEEE symposium on computational intelligence in bioinformatics and computational biology (D. Ashlock, Ed.), Piscataway, Pupko T., Pe er I., Shamir R., Graur D. (2) Mol. Biol. Evol., 7, Ross S.M. (996) Stochastic Processes, John Wiley & Sons, ew York.. Sankoff D. (975) SIAM J. Appl. Math., 28, Shannon C. (948) Bell Syst. Tech. J., 27, , Tatusov R.L., atale D.A.et al. (2) ucleic Acids Res., 29, Vandev D., Prodanova K., Petkov V. (23) Application of Mathematics in Engineering and Economics, BULVEST 2, Sofia, pp Yang J., Benyamin B., et al. (2) at. Gen.,

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis