Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

Size: px

Start display at page:

Download "Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne"

Aleesha Julie Leonard
6 years ago
Views:

1 Phylogenetic Inference and Hypothesis Testing Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne November 13, 2003

2 Contents 1 Introduction 4 2 Molecular Phylogenetics The Use of Phylogenetic Trees Traditional Approaches Phylogenetic Trees From Genomic Data What about the root? How Treelike is Evolution? Models of Evolution A Simple Approach Evolution as a stochastic process Markov Models of Evolution Markov Theory Markov Models of Site Substitution Parameterized Models of Nucleotide Evolution Jukes-Cantor Model Jukes-Cantor Variance Generalisations of the Jukes-Cantor Model Problems with Markov Models of Evolution Modelling Rate Heterogeneity Modelling Non-Stationarity Summary of Nucleotide Markov Models Empirical Models of amino acid evolution PAM/Dayhoff Substitution Matrices BLOSUM Differences in PAM and BLOSUM Phylogenetics Tree Reconstruction Methods Evaluating Reconstruction Methods Complexity Accuracy Consistency Efficiency Robustness Usability in tests Parsimony Maximum Likelihood Is MP the same as ML? Distance Based Methods Unweighted pair group method using arithmetic averages (UPGMA)

3 4.5.2 The Molecular Clock Hypothesis Long Branch Attraction Neighbour Joining BIONJ Weighbor NJ and the minimum evolution method Least Squares Estimating Branch lengths Minimum Evolution Method with Least Squares Bayesian Tree Reconstruction Trees from Alignments, Alignments from Trees Phylogenetic Hypothesis Tests Confidence Regions of Phylogenetic Trees The Bootstrap The Non-parametric Bootstrap Testing Phylogenies using the Non-parametric Boostrap How well does it work? The Parametric Bootstrap Problems with the Parametric Bootstrap Bootstrap Based Tests Centering The Kishino Hasegawa Test The Shimodaira Hasegawa Test The Swofford Olsen Waddell Hillis Test (SOWH) Bayesian methods Bootstraps and Posterior Probabilities Which Test? Generalized Least Squares in Phylogenetic Hypothesis Testing Sample Average Variance and Covariance Motivation for Simulation of GLS test statistic GLS Test Statistic Simulation Method Results Discussion Contribution Further Work Conclusion 63 A GLS Results: Sample Average Covariance 64 A.1 Four Leaf Trees A.2 Five Leaf Trees B GLS results: JC Covariance 73 B.1 Four Leaf Trees B.2 Five Leaf Trees C Covariance Estimation 81 C.1 Sample Covariance Results C.1.1 Sample Covariance - 100bp C.1.2 Sample Covariance bp

4 C.1.3 Sample Covariance bp C.2 Jukes-Cantor Covariance C.2.1 Jukes-Cantor Covariance - 100bp C.2.2 Jukes-Cantor Covariance bp C.2.3 Jukes-Cantor Covariance bp C.3 Sample Average Covariance(Susko) C.3.1 Sample Average Covariance(Susko) - 100bp C.3.2 Sample Average Covariance(Susko) bp C.3.3 Sample Average Covariance(Susko) bp

5 Chapter 1 Introduction Phylogenetics is a field of biology that seeks to unlock the evolutionary history of life on earth. The aim is to understand relationships between species and through this the process of evolution itself. These relationships can be represented with a graph structure - traditionally simplified to evolutionary trees. The current approach is to try to reconstruct these trees from the blueprint of life: DNA sequences. Reconstruction methods are difficult to design and evaluate because the biological evidence is often ambiguous. Many approaches have been introduced to deal with the problems of estimation and hypothesis testing of phylogenetic trees. Parametric approaches exploit the elementary knowledge we have of evolution while non-parametric approaches have been developed to avoid the possibility of inaccurate preconceptions. Recently, Susko[40] presented an approach that applies the theory of generalized least squares to phylogenetic hypothesis testing. The generalized least squares approach has strong theoretical foundations in the theory of linear models. While the theory appears to be sound it is based on asymptotic results with regard to sequence length. It is not clear how well the test will perform in practice where the length of sequences is often only a few hundred nucleotides. I investigate the effect of sequence length on this approach. I also consider how Susko s approach differs from traditional parametric techniques with respect to estimation techniques for variance the variance-covariance estimation. In Chapter 2 I will give a general background to the problems involved in molecular phylogenetics. In Chapter 3 and Chapter 4 I review commonly used probabilistic models and tree reconstruction methods commonly respectively. In Chapter 5 I consider methods of evaluating confidence in results from such reconstruction methods when there may be conflict. This leads to an examination of current hypothesis testing methods and consideration of the validity of the generalized least squares approach in Chapter 6. 4

6 Chapter 2 Molecular Phylogenetics Phylogenetic trees represent relationships between species. They tell the story of life on earth. A phylogenetic tree is a tree in the graph sense. External vertices (nodes) represent extant species while internal nodes represent speciating events. The tree topology determines the lines of evolution - which species descended from which common ancestors. The branch (edge) lengths represent time since speciation of adjacent nodes. In reality, absolute time scales cannot be used and relative time scales are employed. These time scales depend on the data used to infer the tree. For example, if we use genome sequence data, a scale is the expected number of substitutions that have taken place at a site. 2.1 The Use of Phylogenetic Trees The role of phylogenetics is to help us understand the process of evolution from the patterns in nature we can observe in the present. As Huelsenbeck describes [25]: [F]or any question in which history may be a confounding factor, phylogenies have a central role. The most obvious use of phylogenetics is inferring common ancestors. This has implications to our understanding of evolution on a large scale. It can also help with more immediate problems. For example in epidemiology and understanding the spread of viruses. Viruses such as hepatitis C have a long dormant period meaning we can only detect the spread of the virus as it happened in the past. Understanding these processes and relationships can help in the development of biotechnology. For example, improving the design of drugs to consider host-pathogen mutual genetic variation. Phylogeny also shed light on our understanding of structural biology, helping us to infer function and functional constraints of genes. [30] 5

7 2.2 Traditional Approaches Historically phylogenetic trees have been constructed using two principles. The phenetic approach uses similarity scores derived from measures of physical characteristics. The most similar species are clustered together. While this is an intuitive approach, results may not represent genetic or evolutionary similarity. The cladistic approach assumes that related species will share unique features that were not present in distant ancestors. All species in a group must share a common ancestor. This means that species with many similar physical traits may not be grouped together. Both of the above approaches rely heavily on morphological and geographical data. However, this has changed with our understanding of the role of DNA in evolution. DNA sequencing has massively increased information about the evolutionary process. Most tree reconstruction methods now focus on examining the way DNA (or amino acid) sequences have evolved. 2.3 Phylogenetic Trees From Genomic Data In molecular phylogenetics patterns are searched for in genomic data. What we find is that evolution is a stochastic process. Mutations arise from changes to a species genome. That is, site substitutions, deletions, insertions and inversions. If we understand the process of mutation, we can make reasonable inferences about our past from the genome material we have now. Sequences that are very different are likely to be less closely related then sequences showing high similarity. Determining similarity is not an easy problem. Substitutions may be hidden from view by a number of factors. Examples of these include when a site changes and then changes back again (a reversal); when more than one mutation occurs at the same site; parallel changes occur on different branches of the tree (convergence or parallelism). With that in mind, there are three major components to phylogenetic inference that need to be considered. Probabilistic Models in Phylogenetics We need to consider what role probabilistic models can play to help our understanding of the problem. Typically, it is assumed that mutations occuring at time t depend on the sequence at that time but not on its previous history. This suggests a Markov model of sequence evolution. However whether or not the traditional assumptions such as homogeneity and reversibility are valid is less clear. The role of probabilistic models in phylogenetics is discussed in Chapter 3. 6

8 Reconstructing Phylogenetic trees from Sequence Data We need to understand what methods can be employed to actually reconstruct a phylogenetic tree. Within this there are three solid problems to consider: choosing a criteria, estimating the tree topology, estimating branch lengths. Besides this there has been a long standing feud between biologists (and to a latter extent mathematicians) whether parametric or nonparametric methods or something in between should be used. A review of commonly used tree reconstruction methods is contained in Chapter 4 Hypothesis Tests of Phylogenetic Trees Once a phylogenetic tree is decided upon, the next step is usually to try to identify which parts are well supported by the data. Hypothesis tests be must used with an understanding of what they actually test and what information the can provide the user. To complicate matters, the hypothesis being tested is often a tree topology and it it is still unclear how the usual statistical measures, such as variance, can be applied to such a structure. The debate between parametric and non-parametric testing is continued. The non-parametric bootstrap has been extensively in phylogenetics to provide a measure of confidence. However, the use of the parametric bootstraps and Bayesian methods are also becoming popular. All have their advantages and disadvantages. In any case, inferred trees need to be compared to traditional biological data (eg morphological). No matter how well a method works for simulated data, the aim of the game is to understand the process in reality. These issues are considered in greater detail in Chapter 5 and (with respect to a generalized least squares test) Chapter What about the root? In theory phylogenetic trees should be rooted to represent descent from the common ancestor. However, as we attempt to reconstruct phylogenetic trees, we have to consider all possible positions of a root with respect to the other nodes of the tree. Unfortunately it is unlikely that sequences will contain enough information to accurately place a root. However, there are some methods for rooting a tree. These include adding data from very distantly related species (outgroups), or the use of the molecular clock hypothesis. The use of outgroups has throws in its own bundle of problems as error in distant species effects the other closer related species. This and the validity of the molecular clock hypothesis is discussed in section

9 2.5 How Treelike is Evolution? It is worth asking the question of whether trees are really the right structure to describe evolution. Often there is not enough information in the data to resolve the issue of parallel mutations, while gene transfer from species to species mean that a gene can have more than one ancestor. There is criterion to determine if data are treelike: the four point condition. For any taxa i,j,k,l in a phylogenetic tree with four or more taxa we have: d ij + d kl max(d ik + d jl,d jk + d il ) (2.1) Otherwise other structures that may be more valid. For example, see Strimmer and Moulton s work on phylogenetic networks [39] 8

10 Chapter 3 Models of Evolution Most widely used tree reconstruction methods use probabilistic models of evolution. These can be formulated parametrically, using known (or assumed) properties of sequence evolution. They can also be derived empirically from information in the observed sequences. It makes sense to use whatever knowledge we have about the process of evolution rather than ignore it. On the other hand, evolution is very complex and biological evidence is often ambiguous. An example of a factor that needs to be taken into account, but is very hard to modeli, is differing in rates of evolution between and within lineages. How well a model fits reality can effect how a testing method works. Simpler models have greater power to discriminate but may be biased. So understanding models is necessary to understanding both tree reconstruction and confidence testing [19] A Simple Approach Evolutionary distances represent the divergence between species. That is, branch lengths on a phylogenetic tree. The following naive approach to determining distance shows why a probabilistic model is desirable. When determining distances between sequences it is intuitive to use a measure of dissimilarity. That is, take the distance between two sequences to be the Hamming distance. That is, for two species x and y, with sequence length S, we have: D xy = S (x i y i ) (3.1) i However, this approach does not take into account the phenomena described in (2.3) such as reversals and parallelism. This means the observed number of substitutions in a given sequence are a lower bound for the actual number of substitutions that have occured. This basically 9

11 means we cannot accurately look very far back in time. We need models that estimate substitution rates that correct for unseen events. An obvious first step is to try to define evolution as some type of stochastic process Evolution as a stochastic process Definition A stochastic process is a collection of random variables {X(t)} t T, with a common probability space. We can think of the process of evolution as the stochastic process of substitutions in a sequence. The the set of states, are nucleotides (or amino acids). That is, for a nucleotide model, X(t) {A,C,G,T } is the nucleotide at that site at time t. 1 To develop a tractable model for evolution we need to make further simplifying assumptions. This leads us to the well studied world of Markov processes. 3.1 Markov Models of Evolution Markov Theory Definition A stochastic process has the Markov property if the probability of observing a new state at time s + t only depends on the state at time s. That is, P(X(s + t) = j X(s) = i s,...,x(0) = i 0 ) = P(X(s + t) = j X(s) = i s ) (3.2) A Markov process is a stochastic process with the Markov property. The Markov property is also referred to as the memoryless property of Markov processes. If t,s then we have a discrete time Markov process. Similarly if t,s Ê we have a continuous time Markov process. To model evolution we want the latter since evolution is happening in continuous time. Definition The transition probability, P ij (t,s), is the probability of changing from state i at time s to state j at time t + s. For a Markov process we have: P ij (s,t) = P(X(s + t) = j X(s) = i) (3.3) 1 When deriving species trees this means the nucleotides at a site at time t expressed in the majority of the population. 10

12 If P ij (t,s) above is independent of s then the process is homogeneous. That is, P ij (t,s) = P ij (t) (3.4) We can notate these probabilities as a transition matrix P(t). The transition probabilities obey the Kolmogorov-Chapman equation: P ik (t) = j P ij (v)p jk (t v) (3.5) In matrix form: P(t) = P(v)P(t v) (3.6) This is equivalent to P(t + v) = P(t)P(v) (3.7) With initial condition P(0) = I (3.8) From this, we can extrapolate the transition probabilities at time t as P(t) = [P(1)] t (3.9) Markov Models of Site Substitution A discrete state continuous time Markov process of site substitution can be formulated as follows. We define the transition probabilities as the probability of substitution of a nucleotide or amino acid. We also assume that the process is time-homogeneous. That is, the rate of substitution is independent of time and the process is the same throughout the whole tree. Markov chain models of site substitution are usually further constrained other properties such as stationarity and reversibility. The assumption of stationarity is that the process is in equilibrium. This effectively assumes that that nucleotide frequencies are (approximately) the same from species to species throught time. Reversibility means that we do not distinguish the 11

13 process from the process in reverse: we treat the process starting at an ancestor species to a descendent and vice versa as the same. That is, evolution does not have a direction. This is in fact a model the evolution of a sequence along a phylogenetic tree branch. Usually the process of substitution is assumed to be Poisson. The validity of these assumptions is discussed in Before this, I examine how transition matrices for Markov models of evolution can be derived parametrically. 3.2 Parameterized Models of Nucleotide Evolution We can derive the transition matrix P(t) by estimating a rate matrix Q. Q ij is the rate that i changes to j in a very small time step δt. P ij (δt) = Q ij δt + o(δt) (3.10) In fact, if for all states i, j P ij = 1 (that is the process is honest), Q defines the process (hence transition matrix) uniquely[9]. Let P(1) = e Q. Then using the Kolmogorov-Chapman equation, P(t) = P(1) t = e tq (3.11) tq j = j! (3.12) j=0 Now, d dt P(t) 0 = Qe tq 0 (3.13) = Q (3.14) Now let v = (1,1,1,1) T. It is easy to see that v is an eigenvector of P(t) as i P ij(t) = 1) as noted earlier. Qv = e tq v = j=0 = 1.v + (tq) j v j! j=1 t j (Q j v) j! (3.15) (3.16) = v (3.17) This represents a power series on t, so all coefficients Q j v = 0. That is Qv = 0. Rate 12

14 matrices that have this condition define processes. It is clear at this point that Markov models suffer from confounding. That is e tq = e t γ γq, where γ > 0, scales the rate. This means that absolute time scales cannot be used. Hence, expected subsitutions per site is the usual time unit stated Jukes-Cantor Model Of all the Markov models of evolution, the Jukes-Cantor model makes the strongest assumptions about the process. Besides the properties of Markov chains this model assumes: The process acts on sites is independently and identically distributed (iid). All substitutions occur with equal probability. With these assumptions, we need only define an appropriate rate matrix to derive transition probabilities. If we define the rate of substitution as α, the Jukes-Cantor rate matrix: 3α α α α α 3α α α Q JC = α α 3α α α α α 3α (3.18) The rate at which a site stays in its current state must be 3α as row sums must equal zero. We can find the spectral decomposition of e tq JC by determining the eigenvectors and eigenvalues of Q. Q = Sdiag(λ 1,λ 2,λ 3,λ 4 )S 1 (3.19) Where S is the matrix with the eigenvectors of Q as columns, λ i are the corresponding eigenvalues. From further linear algebra we can write: e tq = Sdiag(e tλ 1,e tλ 2,e tλ 3,e tλ 4 )S 1 (3.20) In this case it is easy to see that the matrix S is can be defined as follows. 13

15 S = (3.21) Corresponding eigenvalues: λ 1 = 0 and λ 2 = λ 3 = λ 4 = 4α. Also, we can verify that S 1 = 1 4 ST. We can now derive the transition probabilities for the Jukes-Cantor Model. 1 4 P(t) = (1 + 3e 4αt ) for diagonal elements 1 4 (1 e 4αt ) for off diagonal elements (3.22) The probability of seeing a change after time t does not depend on the current state. Pr(X(t) X(0)) = Pr(X(t) = b X(0) = a),b a (3.23) = b a P ab (t) (3.24) = (1 e 4αt ) (3.25) We can use P c (t), the proportion of changed sites after time t, to estimate the time of divergence by solving the above for t. Now the number of sites that changed is distributied binomially as they either have or have not changed. So we have, P c (t) Bin(N, 3 4 (1 e 4αt )), where N is the length of the sequence. This means ˆP c (t) = no. changes/n is a maximum likelihood estimate. From the invariant property of maximum likelihood estimates the following is a maximum likelihood estimate of the time of divergence. ˆt = 1 4α log(1 4 3 ˆP c ) (3.26) This is the Jukes-Cantor distance estimate. It is usually written as d ij where i and j represent to sequences/taxa. Since two independent (unrelated sequences are expected to agree at 1/4 of sites, sequences are considered to be unrelated as ˆP c 3/4. At this point distances tend to infinity. 14

16 Selection of Rate Parameter For very short time spaces the total number of changes inferred by the Jukes-Cantor estimate is equal to the number of observed changes. More precisely: lim t 0 P c (t) P obs (t) = 1 (3.27) As both numerator and denominator tend to zero this can be seen using l Hopitals rule dp c dt t=0 = dp obs t=0 = 1 dt (3.28) dp c = 3αe 4αt dt (3.29) = 3α at t = 0 (3.30) Applying our boundary condition 3α = 1 (3.31) α = 1 3 (3.32) Jukes-Cantor Variance The variance of the Jukes-Cantor estimate can be derived using the delta method. ˆt Eˆt dˆt (ˆp Eˆp) dˆp (3.33) σ 2 1 ˆp Eˆp (ˆt) = (1 4 3 ˆp)2 n (3.34) That is, σ 2 (t) = e 8t/3 T(1 T)/S (3.35) Where T = 3 4 (1 e 4t/3 ) and S is the sequence length. That is, the variance grows exponentially with t. The covariance of two distance estimates is derived using the tree structure. Distance estimates are represented on a phylogenetic tree by the sum of branch lengths on the unique path between the taxa under question. To calculate the covariance of two pairwise distances we simply calculate the variance of branches common to both paths. 15

17 3.2.3 Generalisations of the Jukes-Cantor Model It is very clear from biological evidence that the assumptions made in the Jukes-Cantor model do not generally hold. Analysis of DNA sequences shows that substitutions are not equiprobable. An example of this is transition/transversion bias. Nucleotides are grouped according to their molecular structure as purines (A,G) or pyramidines (C,T). Purine to purine or pyramidine to pyramidine substitutions are called transitions. The rest are called transversions. Because of this molecular structure it is much more likely that a transition than a transversion will happen. The Kimura-2-parameter model(k2p) attempts to correct the assumption by introducing parameters to model the difference in transition and transversion rates. This approach produces a new rate matrix where β is the rate of transitions, α the rate of transversion: (2β + α) β β α β (2β + α) α β Q K2P = β α (2β + α) β α β β (2β + α) (3.36) In 1981 Felsenstein [15] presented a model (F81) where substitution rate depends only on the equilibrium frequency of a nucleotide. These equilibrium frequencies are usually determined from the observed frequencies in the sequences to hand. µ represents a rate parameter and π i represents the frequency of nucleotide i. (. indicates the value necessary to make the row sums equal to zero).. µπ T µπ C µπ G µπ Q F81 = A. µπ C µπ G µπ A µπ T. µπ G µπ A µπ T µπ C. (3.37) Hasegawa et al[20] futher refined Felsenstein s model by considering transition/tranversion rates β and α. 16

18 . βπ T βπ C απ G βπ Q HKY = A. απ C βπ G βπ A απ T. βπ G απ A βπ T βπ C. (3.38) Finally the most general time reversible model (GTR) has nine free parameters:. ρπ T βπ C γπ G ρπ Q G = A. απ C σπ G βπ A απ T. τπ G γπ A σπ T τπ C. (3.39) 3.3 Problems with Markov Models of Evolution Biological evidence that all models so far considered simplify situation too much. For example, they can t deal with long-additive distance correlation due to RNA folding. A key problem appears to be the iid assumption between sites. The assumption of rate homogeneity is contradicted by evidence that mutations are dependent on local sequence context. Protein coding genes are an example of how this assumption can be violated by very basic ideas. Because of the redundancy in the nucleotide to amino acid code, different codon positions are subject to different selectional pressures. Mutation rates appear to be dependent on structural and functional constraints as well as chromosomal positions. These are all local properties of a sequence. However, it is assumed that substitution rates are constant throughout a phylogenetic tree. Markov models of evolution assume stationarity of base frequencies. That is, expected nucleotide frequencies remain the same with time. This is contradicted by observations of nucleotides are very different in sequences from different species. For example, GC content in mammals is much higher than in flies.[30] Lockhart et al [31] have shown that if a model that assumes stationarity is used, then breaking the assumption can lead to inaccurate distance estimates. The main problem being a tendency to group sequences with similar nucleotide frequencies, irrespective of evolutionary development. 17

19 3.4 Modelling Rate Heterogeneity A number of methods have been suggested to add some level of rate heterogeneity into Markov models. One approach is to set some sites invariable while others change. This is useful when one can determine conserved regions in sequences. However, it doesn t allow for more than one rate. Another approach allows sites to evolve at different rates, where the rate for a site from a Gamma distribution with shape parameter α. A discrete Gamma model has also been developed by Yang [30] that allows much easier computation. This is, perhaps, the most popular approach at present. However, it still does make use of information available about local behaviour. Recently, Steel et al [42] have presented a covariotide model of site substitution where sites are effected by different selection pressures. This model allows some sites to be invariant while others change. However, sites do not have to remain invariant. This represents the fact that constraints on sites can change over time. The activation of sites is governed by a Markov process where sites are still iid to keep the model tractable. Other techniques have been based on defining multiple categories of rates. This implemented using hidden Markov models. Algorithms infer the most probable rate category for a site. These are discussed in [11] 3.5 Modelling Non-Stationarity As mentioned previously, Markov models of evolution assume stationarity of nucleotide frequencies. However, there is strong evidence suggesting that this is not the case. The paralinear and logdet corrections have been developed to make distance estimates more reliable when base frequencies differ from species to species. Both rely on the following lemma. Lemma Let t be a measure of evolutionary time. Now, t log[det(p(t)] (3.40) Proof P(t) = e tq) (3.41) 18

20 From linear algebra = det(p(t)) = e ttrace(q) (3.42) = log[det(p(t))] = ttrace(q) (3.43) Since Q remains fixed = t log[det(p(t))] (3.44) Paralinear distance Barry and Hartigan [2] suggest an asynchronous distance estimate. This is still based on a Markov process where sites are iid. However, it makes no assumption of homogeneity, reversibility or stationarity. It need not assume base frequencies are in equilibrium, nor that the rate of substitution is constant throughout the tree. The distance estimate is taken to be: ˆd ij = 1 4 log[det(p xy)] (3.45) Where P xy is the transition matrix at a particular site from species x to species y. This is assumed to be the same across all sites. The (i,j)th element of the transition matrix is estimated as the Pr(Y = j X = i), where X and Y are bases that have the same position in sequences for species x and species y, respectively. i,j {A,C,G,T } for nucleotide sequences. The distance measure is additive and asymmetric. The latter property means that generally d ij d ji, which is not a particularly desirable property. In fact, this measure can only be used to estimate the total number of substitutions along a branch when substitution rates are held constant and the model is reversible. LogDet Transformation This transformation method involves recording a divergence matrix, F xy, for each pair of taxa x and y. The ijth entry of F xy is the proportion of sites in which taxa x and y have states i and j respectively. The dissimilarity value, d xy is calculated as: ˆd xy = log[detf xy ] (3.46) Variance can be calculated using the paralinear method. Where S is the sequence length, r 19

21 is 4 or 20: ˆσ 2 xy = r r [(Fxy 1 ) 2 ji(f xy ) 1]/S (3.47) i=1 j=1 When models have with equal nucleotide frequencies, Lockhart et al [31] show how to calculate branch lengths: ˆd xy = ( ˆd xy + [log(detf xx F yy )]/2)/r (3.48) Distances become treelike as sequence lengths increase, provided we reinstate our independence assumptions across sites and across the tree. This means that reconstruction methods that require treelike distances to work will work with corrected distances (and sufficiently long sequences). The LogDet transform has been shown to provide more realistic results where similar nucleotide frequencies might be indicating false evolutionary relationships Summary of Nucleotide Markov Models We can consider the relationships between these models via the following parameters [44]: κ: The rate of transitions relative to rate of transversions. In practice, κ > 1 reflects the evidence that transitions are more prevalent than transversions. α: A measure of between site variation in the rate of nucleotide substitution. This is often drawn from a gamma distribution with mean 1 and variance 1 α low amounts of rate variation. [44]. High values mean Base frequencies π = (π A,π C,π G,π T ). ie three independent parameters. π MLE,π obs : The maximum likelihood, and observed base frequencies. The maximum likelihood estimate α and κ are usually used. Model α κ π Jukes Cantor each Kimura 2-P variable 0.25 each Felsenstein 1 variable HKY variable variable JC+Γ variable each K2P+Γ variable variable 0.25 each Fel+Γ variable 1 variable HKY+Γ variable variable variable 20

22 3.6 Empirical Models of amino acid evolution When modelling amino acid evolution, empirical Models have been the preferred solution. These models specify explicit transition probabilities derived from empirical evidence. The preference for empirical models when dealing with amino acids is partially due to the complexity increase involved in having twenty character states. The following section provides an overview of the two most common empirical methods: The PAM and the BLOSUM matrices PAM/Dayhoff Substitution Matrices The PAM/Dayhoff matrices empirically estimate amino acide substitution rates based on a markov process framework. These rates were derived from alignments of protein sequences that are atleast 85% identical. Deriving the Mutation Matrix Let A be the matrix of observed proportions of changes in between two amino acides i, j. That is: A ij = N ij N (3.49) In fact, A ij has the same description as F xy described for the LogDet transform. Let ˆπ k be the vector of amino acid frequencies of sequence k. ˆπ k = Nk j N (3.50) We want to derive substitution (transition) probabilities for the time it takes 1% of all amino acids to mutate - this is the point acception mutation (PAM) unit. P ij = Pr(i mutates)pr(i mutates toj i mutates) (3.51) Now we can empirically derive a relative mutability of the amino acid i as m i : m i = P(i mutates) (3.52) j = A ij k,j A (3.53) kj 21

23 Now, Pr(i mutates to j i mutates) = A ij j A ij (3.54) and we now have have an estimate of P: P ij = m i A ij j A ij (3.55) To calibrate our matrix to the PAM measure we simply solve: π i (1 P ii ) = 0.01 (3.56) i The matrix of P ij s is the PAM matrix. If π a vector of amino acid frequencies, Pπ is the probability vector after that time period (1-PAM) To consider more distant relationships we can derive the k-pam matrix. Because this is based on a Markov process we can theoretically achieve this by raisig the 1-PAM matrix to the kth power. P(k) = P k (3.57) The log-odds form of PAM matrices are often used for scoring sequence alignment reliability. This can be thought of as a log-likelihood ration test with the null hypothesis being that a sites have aligned by chance. S ij = log P ij π i (3.58) The more rare the amino acid in each aligned pair, the lower the probability of a chance alignment and so a greater significance. Problems with the PAM model Besides the problems inherent in Markov models of evolution, the PAM matrices suffer from other problems. Firstly, it assumes that proteins have average amino acide composition (many don t). Secondly, rare replacements are not observed enough to resolve relative frequencies properly. Thirdly, error in PAM(1) extrapolated (in say PAM(250)) Markov processes don t accurately model evolution. There is no theoretical justification for applying this to divergent alignments. In fact this 22

24 approach implies a large loss of information. As evolutionary distance increase, information content decreases. This means a longer region of similarity to get a high score to distinguish from chance. However, regions of similarity are found in narrow blocks as evolutionary distance increases so it is difficult to find the necessary data. Attempts to update the PAM matrices to make them more accurate have been made. A particular example is the Jones Taylor Thornton model BLOSUM The Block Sum (BLOSUM) substitution matrices were introduce in 1992 Henikoff and Henikoff [21]. They take completely different approach to the PAM matrices. The key point is that the derivation of transition probabilities uses alignments of distantly related sequences. Blocks are conserved regions of local alignments with no gaps. The aim is to obtain a set of score for matches and mismatches that best favors a correct alignment with each of the other segments in the block relative to incorrect alignment. This is done by creating a table where each column contains amino acid pair frequencies for the corresponding column in the alignment. This is a 20(20 1)/2 N matrix where the first term is the number of possible pairs of amino acids and N is the length of the alignment. A score matrix is defined from a log-odds matrix from the frequency table. Let F ij be the ij t h entry of frequency matrix. Let q ij be the observed probability of an ij pair. q ij = F ij j F ij (3.59) We can estimate the expected probability of an ij pair occuring as e ij. Let p i be the probability of i occuring in an ij pair. Let e ij = p i p j. Our odds ratio matrix takes the form: S ij = q ij e ij (3.60) That is, the observed probability over the expected probability that i and j appear together at random. Ratios are usually multiplied by scaling factor of 2 then rounded to the nearest integer. This is the BLOSUM (block substitution matrix) with half bit units. Unlike the PAM matrices, separate matrices have been derived for different time scales. BLOSUM matrices are referred to by minimum percentage identity between species. That is in BLOSUM 60 sequences that are atleast 60% similar are treated as identical. As distances become large we expect to a BLOSUM matrix with a decrease BLOSUM parameter. 23

25 Problems with the BLOSUM matrices The main problem with the BLOSUM matrices is that it can be overtrained. That is, if most of the conserved blocks are taken from just a few species then the resulting matrix isn t going to look too much like reality. This is a real problem as most genomic data available is from very few species. To reduce contributions from most closely related members of family (reduce multiple contributions of amino acid pairs) - sequences are clustered within blocks. Each cluster is weighted as a single sequence. Matrices analogous to transition matrices estimated without any reference to rate matrix Q. 3.7 Differences in PAM and BLOSUM The differences in PAM and BLOSUM substitution matrices are a consequence of their different approaches to the problem PAM matrices are derived from a tree based model that uses matrix multiplication to extrapolate larger time scales. It is based on mutations in both conserved and variable regions. BLOSUM is derived from pair frequencies in highly conserved blocks. Different weights can be given to different sequence groups. BLOSUM has an advantage in that it was derived with from more representative data set. Hardly any transitions were observed in deriving PAM whereas this was not the case for BLOSUM. This problem has been address by re-deriving the models with more data. This is the Jones Taylor Thornton model. The fact that BLOSUM is not tree derive does not seem to be a major disadvantage. BLOSUM generally gives better results when used to score database searches as highly conserved regions usually serve as anchor points. However, PAM-style matrices are still more widely used in phylogenetics 24

26 Chapter 4 Phylogenetics Tree Reconstruction Methods This chapter surveys tree reconstruction methods. Phylogenetic reconstruction methods come in many forms. Parametric methods such as maximum likelihood rely on a specification of a model. On the other end of the scale, non-parametric, such as maximum parsimony, claim to make no assumptions about the underlying model as any assumption we do make are likely to be inadequate. In the middle there are semi-parametric methods - the distance based methods that require a model to generate distances but then go onto reconstruct the tree by a non-parametric cluster method. Bayesian approaches have also been proposed. This chapter first outlines how the usual statistical indicators are redefined for the problem of phylogenetic inference. The rest of the chapter examines commonly used methods for phylogenetic tree reconstruction. 4.1 Evaluating Reconstruction Methods Before examining the methods available we need to have an idea of what we want from them. Several often used criteria are discussed below. When evaluating tree reconstruction methods having the available the usual bag of statistical measures. However, it will be seen that defining these with respect to phylogenetics is not straightforward Complexity An important issue to consider in the design of reconstruction methods is the size of the space of trees. There are (2N 3)!! topologically unique rooted trees N leaves, and (2N 5)!! unrooted trees. Clearly, algorithms that involve evaluating entire space of N leaf trees are not going to 25

27 be computationally feasible Accuracy The evaluation critieria that has been given the most attention is accuracy. When we build phylogenies we need some measure of how well the method used estimates the true tree. This is usually evaluated by examining how the method performs with respect to simulated data and biologically well supported phylogenies[25] Consistency The behaviour reconstruction methods as sequences gets longer is usually discussed in terms of consistency. Definition An estimator ˆT n of T is consistent if lim P T ( ˆT n T > ǫ) = 0 as n 0 (4.1) n In phylogenetics this is interpreted as whether a reconstruction method will return the true tree if its inputs are based on infinitely long sequences. This criteria has been given much consideration as the amount of genome data available increases. In effect, this means that the barrier to success only depends on the researcher having enough sequence to hand. Steel [13] showed that if the frequencies of residues are known and are iid, then a consistent estimator can be found. If not (site rates are variable and/or frequencies are not known), then it can be impossible to estimate a phylogenetic tree consistently. However, Chang [8] has shown that if sites are iid, the correct model is being used assome other restrictions, than a consistent estimator can be found. There has been much debate about the usefulness of such a measure given that real sequences will always be finite. Holmes makes the point that as sequence lengths are made longer in reality the less valid the site independence assumption is [23] Efficiency Definition As estimator ˆT is efficient if it is unbiased and lim n σ 2 ( ˆT n ) (I(T ) 1 ) = 1 (4.2) 1 Holmes provides a useful phylogenetics to statistics term conversion table in [23] 26

28 Where I(T ) is the Fisher information of T. In phylogenetics, we want this to mean that as longer sequences are used the variances in our trees is as low as it can get. The problem with this is that the variance of a tree is not well defined. In fact, the literature generally uses effiency to describe how quickly a reconstruction method converges to the correct solution as it is given more data. This is usually measured via simulation. Ideally, we would use an analog of the mean squared error for trees, E(d(T, ˆT )) 2, as this gives an indication of both variance and bias. However, the problem remains of how to define distances between trees and this has not been solved Robustness We know current assumptions made about sequence evolution are inadequate. With this is mind, it is very desirable to know how well a method is likely to behave when wrong assumptions are made. For example, the effect of model misspecification on parametric methods This is usually assessed by simulating a data under a fully specified model and then reconstructing the tree with misspecificastion Usability in tests We also need to consider if a reconstruction method can reject false assumptions in our model of evolution. For example, we want to be able to determine is additional complexity is worthwhile. Since understanding the process of evolution is our primary goal, this should alway be kept in mind. 4.2 Parsimony The maximum parsimony method chooses the best tree as the one where the least number of base changes have occurred in sequences from the root to the leaves. An example of this is seen in fig 4.1. Combinatorially this is the same as finding the minimum Steiner tree for Hamming distance between sequences[23]. In theory this means that all possible assignments of sequences to internal nodes over all possible tree topologies (with the necessary number of leaves) must be evaluated. In practice, heuristics are employed to cut down the search space. Recursive algorithms and branch and bound have also been employed to avoid repeating computation (See [11]). 27

29 AAA AAA GGA AAG AAA GGA AGA Figure 4.1: An example of how changes are counted using the principle of parsimony. The paths from sequences at the leaves of the tree to the root involve 4 base changes. Parsimony is based on the concept of Occam s razor. Solutions that make the least amount of assumptions are likely to be the best. By looking only at base changes, parsimony claims to require no knowledge of the evolutionary model. Parametized models are known to be flawed so this non-parametric approach may seem quite reasonable. However, it seems that an underlying model for parsimony exists implicitly. The assumptions are that sites are independent (we cost each substitution separately) and the probability of substitution is equal for all bases. It has been long established that parsimony is inconsistent. The situtation where this happens has been dubbed the Felsenstein zone. However, as mentioned previously, consistency is not always a necessary property for a reconstruction method. Another problem is that different trees can be equally parsimonious for a set of sequences. 4.3 Maximum Likelihood Maximum likelihood reconstruction is, unsurprisingly, based on the likelihood principle. Given data, D and a model we calculate the likelihood of hypothesis, H as P(D H) the probability of observing D if H is correct [38]. With respect to phylogenetics, the data D is sequence data and the model is a process of site substitution (see Chapter 3).H is a phylogenetic tree which is is defined by it s topology and branch lengths. The aim is to find the tree that maximum likelihood. We choose this tree as our best guess. 28

30 t 4 x 5 x 4 t 2 t 3 t 1 x 2 x 3 x 1 Figure 4.2: Example ML tree T. x i are nodes representing sequences, t i are branch lengths Example Likelihood Calculation The likelihood of the rooted tree in fig 4.3 can be calculated as follows. P(x 1,...,x 5 T,t 1,...,t 4 ) = P(x 1 x 4,t 1 )P(x 2 x 4,t 2 )P(x 4 x 5,t 1,t 2,t 4 ) (4.3) Where P(x i x j,t) = L(x i,x j,t). The right hand side being the likelihood of (x i,x j ) forming a branch of lenght t in tree T This can also be transformed into a recursive form. Characterisitcs of maximum likelihood Maximum likelihood estimation is consistent. It borrows its efficiency rating from more general theory of maximum likelihood estimation. Unlike distance based methods it has been found to be robust to the presence of distant taxa [4]. The maximum likelihood tree is not necessarily unique [38]. So this method may not be able to resolve completely which is the best tree. It is also extremely expensive computationally. The three taxa tree shown in fig 4.3 is a trivial example because there is only one possible unrooted tree topology for three leaf tree. If we are dealing with models that are not reversible we have to consider every possible rooted tree. For n sequences ths potentially involves evaluating the likelihood for all (2n 3)!! rooted trees topologies and all possible assignments of sequences to the hidden internal nodes of the 29

31 tree. This is a huge computational problem! 2 Simplifications The problem can be simplified by making our usual assumptions. If we assume that sites are iid, we need only consider the evolution of individual sites with respect to the tree. The probability of the tree with respect to sequences is then just of the product of the probabilities of the sites. This provides opportunities for parallelising computation. If we assume that the model of site substitution is reversible then we can determine the probabilities of substiutions from the leaves up - a postorder traversal. Infact, we only need to consider the unrooted tree. This is the pulley principle described by Felsenstein [15]. Search heuristics This still leaves the problem of calculating the likelihood of every unrooted tree. To cut down the search space heuristics need to be employed. Felsenstein proposed a branch and bound method where taxa are added incrementally to maximize the likelihood at each stage. The big disadvantage with this approach is that it may not find the optimal tree. 4.4 Is MP the same as ML? The use of maximum parsimony over maximum likelihood (and vice versa) has been the source of much division in the phylogenetics. However, as Holmes aptly puts it: The statistical perspective sees the differences between maximum likelihood, maximum parsimony...as much more a matter of degrees of freedom allowed in a model than a matter for religious wars The non-parametric nature of MP means that no parameters are pinned down. In effect it needs to optimize over infinite dimensional criteria. A parametric model such as Jukes-Cantor is at the other end of the scale. Variable rate models lie somewhere in the middle. This view is well supported by the work of Steel et al [38] who have found conditions where the MP tree is the ML tree. This happens when there is no common mechanism assumed between sites or lineages. However, the general evidence from simulations is that MP does not perform as well as ML. This is likely to be due to the implicitly restrained model involved in most parsimony 2 In fact is has been shown that maximum likelihood for phylogeny is NP-complete. 30

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then