On Estimating Topology and Divergence Times in Phylogenetics

Size: px

Start display at page:

Download "On Estimating Topology and Divergence Times in Phylogenetics"

Felix York
5 years ago
Views:

1 UPPSALA DISSERTATIONS IN MATHEMATICS 55 On Estimating Topology and Divergence Times in Phylogenetics Bodil Svennblad Department of Mathematics Uppsala University UPPSALA 2008

3 To dad

5 List of Papers This thesis is based on the following papers, which are referred to in the text by their roman numerals. I Erixon, P., Svennblad, B., Britton, T. and B. Oxelman (2003) Reliability of Bayesian Posterior Probabilities and Bootstrap Frequencies in Phylogenetics. Systematic Biology, 52(5): II Svennblad, B., Erixon, P., Oxelman, B. and T. Britton (2006) Fundamental Differences Between the Methods of Maximum Likelihood and Maximum Posterior Probability in Phylogenetics. Systematic Biology, 55(1): III Svennblad, B. and T. Britton (2007) Improving Divergence Time Estimation in Phylogenetics: More Taxa vs. Longer Sequences. Statistical Applications in Genetics and Molecular Biology, vol 6: Iss. 1, Article 35. IV Svennblad, B. (2007) Consistent estimation of divergence times in phylogenetic trees with local molecular clocks. Submitted. V Svennblad, B. (2008) On how to specify non-informative priors on branch lengths in phylogenetics. Submitted. Reprints were made with permission from the publishers.

7 Contents 1 Introduction Data AlgorithmsandMethodsforPhylogeneticInference Likelihoodbasedmethods SubstitutionModels MaximumLikelihood Pointestimateoftopology Confidencemeasure Bayesianinference Point estimate of topology and confidence measure Priors Divergencetimesestimation Algorithmsfordatingnodes SummaryofPapers PaperI PaperII PaperIII PaperIV PaperV Example:PhylogenyofPrimates Sammanfattningpåsvenska(SummaryinSwedish) Acknowledgments Bibliography... 49

9 1. Introduction A phylogenetic tree describes the evolutionary history among organisms or entities that share a common ancestor. It is assumed that evolution happens like a tree where all lineages evolve independently. Phylogenetics has long been used among biologists who are interested in finding out how existing organisms are related. But it is also used to study the evolution of languages (e.g.[gj00]), manuscript texts [SDBH04], cultural artefacts [MH05] and to find the origin of HIV virus (e.g. [GMI + 90], [RRP + 01]). Phylogenetic analysis has even been used in criminal prosecutions as evidence of responsibility for HIV transmission. For this purpose phylogenetics should be used with care as the direction and timing of the HIV transmission is hard to tell from the phylogenetic analysis [BAV + 07]. In this thesis we will restrict the discussion to phylogenetic trees for use in systematics of organisms. In the biological context phylogenetic trees were first implied by Charles Darwin in the book On the Origin of Species in Ernst Haeckel coined the term phylogeny and created the first published phylogenetic tree 1866 in the book Monophyletischer Stammbaum der Organismen. 1.1 Data In the early days of phylogenetic reconstruction only morphological data were available. Today different genome projects have sequenced the DNA for almost the entire genome of species (e.g. human, mice and E.coli bacteria, Genome/ hg5yp/index.shtml). DNA consists of two complementary chains twisted around each other to form a right-handed helix. Each chain consists of a sequence of four nucleotides: A (adenine), G (guanine), C (cytosine) and T (thymine). Each nucleotide in one chain is bounded to a nucleotide in the other, always so that A in one chain binds to T in the other, and C binds to G. When analyzing DNA sequences therefore only the sequence of one chain is needed. Evolution occurs by changes in the DNA sequence. Changes can occur through [Li97, chapter 1] 9

10 i. substitution (replacement) of a single nucleotide, ii. insertion of one or more nucleotides, iii. deletion of one or more nucleotides, iv. inversion where a segment of the DNA sequence is inverted, v. recombination, which includes crossing-over and gene conversion. Figure 1.1 visualizes a hypothetical evolution of four species through different types of changes in the DNA sequence. (ii) A C C T C G G A C T T i A C C T T C G G A C T T A C C T T G G A C T T (iv) (i) i a: A C C T T C G C T T i b: A C C T T C G G A T T C c: A C C A T G G A C C T d: T C C A A G G A C T T (iii) Figure 1.1: Hypothetical evolution of 4 species, where changes in the DNAsequence of the common ancestor at the top are substitutions (i), insertion (ii), deletion (iii) and inversion (iv), leading to the DNA sequences of the present species {a, b, c, d}. (i) (i) The evolution in Figure 1.1 can be summarized as a matrix of the DNA sequences of the four species {a, b, c, d}. The matrix should be aligned so that it is possible to follow the evolution site by site (columnwise). Figure 1.1 is summarized in matrix (1.1). column (site) species i a A C C T T C G - - C T T b A C C T T C G G A T T C c A C C A - T G G A C C T d T C C A - A G G A C T T (1.1) The example above is presented as if evolution was actually observable, which of course is not the case. With real data, only the sequences AC- CTTCGCTT, ACCTTCGGATTC, ACCATGGACCT and TCCAAG- GACTT for the four species a, b, c and d respectively are observed. When trying to reconstruct the evolutionary history, matrix (1.1) has to 10

11 be estimated. How to align the sequences is outside the scope of this thesis. Here we only consider matrices with no gaps, that is we assume we have a data matrix X where each row represents a DNA sequence of one of the species and each element in the matrix is one of the nucleotides {A, C, T, G}. 1.2 Algorithms and Methods for Phylogenetic Inference Assuming that the species from which the data matrix X is obtained have a common ancestor and that evolution can be described in a tree like manner, an estimate of the phylogeny is wanted that describes the evolutionary relationships between the species. There are several algorithms that will either create a tree from X by using pairwise distances between the species or that will choose a tree among several possible ones as the best tree according to some optimization criteria. We will here briefly describe some of them. In the next chapter the so called likelihood based methods are described in more detail. Mathematically, a graph is a set of vertices and a set of edges connecting them. A tree is a graph with no loops. The degree of a vertex is the number of edges connected to it. We will mainly consider binary trees, for which all vertices, except the root (if the tree is rooted) and the leaves, are of degree 3. In this thesis, we will adopt the language used among biologists and talk about nodes instead of vertices and branches instead of edges. The external nodes (tips, leaves, taxa) represent the living organisms for which we have DNA sequences. The internal nodes represent ancestors for which usually no sequence data are available. The branching order of a tree is called the topology. In phylogenetic analyses we try to estimate how far the different species arefromeachotherbasedonthednasequencesinx. E.g. define d ij between pair i and j in X to be d ij = # sites where x ik x jk, n where n is the sequence length, k =1,...,n. From the distance matrix D a phylogenetic tree can be built by clustering methods like UPGMA [SM58] or Neighbour-joining (NJ) [SN87]. The algorithms of UPGMA and NJ create a tree. There are other methods where all possible trees are compared and one is chosen as the estimate, based on some optimization criteria. One such method is parsimony where the estimate of the evolutionary tree is the one that minimizes the number of substitutions needed to obtain X. The idea was introduced by Edwards and Cavalli-Sforza ([ECS63], [CSE67]) but 11

12 for allele frequencies. Fitch was the first to present an algorithm for calculating the minimal number of substitutions for a given tree [Fit71]. The number of unrooted bifurcating trees for k species is (2k 5)!! = (2k 5)(2k 7) 5 3. If we consider rooted trees, the number of possible topologies for k species is (2k 3)!! [DEKM98, p. 164]. For even a moderate number of species, like k =10, there are about 2 million unrooted, or about 34 million rooted topologies. To calculate the parsimony lengths for all trees is then very time consuming. There are however algorithms, like branch and bound (see eg. [Fel04, chapter 5]) that find the best tree without going through every single one. 12

13 2. Likelihood based methods In this chapter we will explain the two probabilistic methods that are the ones most commonly used today in phylogenetic inference, Maximum Likelihood and Bayesian inference. They both use the likelihood function to extract information from data. Probabilistic methods use the likelihood function to choose the topology among all possible ones according to either the value of the likelihood or the posterior probability. To calculate the likelihood function a model of evolution is needed. 2.1 Substitution Models We assume that the sites (positions in the DNA sequence) evolve independently. At a particular site the substitutions are described by a Markov process with the four nucleotides being the states of the chain. Let q ij denotetherateofchangefrombasei to j during some infinitesimal time period dt, thatisp (X(t + dt) =j X(t) =i) =q ij dt. For DNA sequences these rates can be expressed as a matrix Q, μaπ C μbπ G μcπ T μgπ A μdπ G μeπ T Q = μhπ A μjπ C μfπ T μiπ A μkπ C μlπ G, (2.1) where the rows (and columns) correspond to the bases A, C, G and T respectively ([SOWH96, chapter 11]). The row sum should always be 0 so the diagonal elements, q ii, are minus the sum of the other entries at the row, q ii = i j q ij. Thus q ii istherateatwhichthe Markov chain leaves state i. The parameter μ is the mean instantaneous substitution rate. To allow for different rates between the states, μ is modified by the parameters a,...,l. The parameters π A, π C, π G and π T are the equilibrium frequencies of the nucleotide bases respectively. The transition-probability matrix over any time t>0:p (t) ={p ij (t)}, where p ij (t) =P (X(t) =j X(0) = i), is the solution to the differential equation 13

14 which equals P(t) t = P (t)q, P (t) =e Qt. (2.2) The matrix Q in (2.1) is written in the most general form. Almost all DNA substitution models currently used are time-reversible models, assuming the same amount of change from state i to j as from j to i. This assumption can be written as π i q ij = π j q ji, where π i is the proportion of time the Markov Chain spends in state i, π i q ij the amount of "flow" from states i to j. The time-reversibility implies that a = g, b = h, c = i, d = j, e = k and f = l. The model is then called the general time reversibel model (GTR). Time-reversibility does not imply Q to be symmetric. Two models for which Q is symmetric though are Jukes-Cantor (JC) [JC69] and Kimuras 2-parameter model (K2P) [Kim80]. Since q ij = q ji for those models, π A = π C = π G = π T = 1 4. The simplest possible model of evolution, Jukes-Cantor, assumes that when a change occurs all bases are equally probable, that is a = b = = f in (2.1). K2P assumes that changes defined as transitions (see Figure 2.1) occur with one rate and transversions with another rate, that is b = e and a = c = d = f. A G transversion transition T C Figure 2.1: Transitions are substitutions between A and G or between T and C. All other substitutions are called transversions. A generalization of K2P is the HKY85 model, where different equilibrium base frequencies are allowed [HKY85] (see also section 5). More complex models let rates vary for different sites, by assuming the rate for any site to be a random variable drawn from a statistical distribution. The most commonly used distribution is the gamma distribution with shape parameter α and scale parameter 1/α. For more detailed descriptions of nucleotide substitution models see e.g. [Yan06, chapter 1] and [SOWH96, chapter 11]. 14

15 2.2 Maximum Likelihood The method of Maximum Likelihood is a well founded method in statistical inference in general, popularized by R.A. Fisher in the 1920 s. For phylogenies, Edwards and Cavalli-Sforza introduced Maximum Likelihood [ECS64], but for gene frequency data, Felsenstein [Fel81] brought the method to phylogenetic inference based on nucleotide sequences Point estimate of topology Assume that we have a k n matrix, X, with aligned DNA sequences from k species, without insertions or deletions. For a given tree topology τ, the likelihood function L(τ,b (τ), θ X), whereb (τ) are the corresponding branch lengths and θ the parameters of the substitution model, can be calculated as the product of the probabilities of the observed data at one site. The likelihood for each topolgy τ i is maximized with respect to b (τi) and θ. Theestimatedtopologyˆτ (ML) is the topology with the largest maximized likelihood (see Figure 2.2), that is ˆτ (ML) = argmax i {max b,θ L(τ i, b (τ i), θ X)}. A l(τ 1 ) τ 1 D B C E b A D E l(τ 2 ) C τ 2 B b Figure 2.2: For each topology the likelihood function is maximized with respect to branch lengths and model parameters, here visualized as if branches were univariate and without model parameters (for 5 species the space of branches is 7-dimensional). The topology with the largest maximized likelihood is the one chosen as the estimate of the phylogeny, ˆτ = τ 1 in this example. 15

16 2.2.2 Confidence measure The method described above gives an estimate of the topology. As always with estimates, it is interesting to know how certain the estimate is (compare with e.g. standard deviations and confidence intervals in classical statistics). Is the estimated tree outstanding in describing the data at hand, or are there other tree topologies with almost the same optimized likelihood value? (See Figure 2.3 for a visualization of the uncertainty problem). Felsenstein suggested a nonparametric bootstrap procedure to obtain a measure of confidence [Fel85]. The procedure works with parsimony or other nonprobabilistic algorithms, but also for the Maximum Likelihood method. l(τ i ) l(τ i ) max 1 max τ max τ 1 2 max τ 2 b b Figure 2.3: The likelihood functions for two topologies, visualized as if dependent of a univariate branch (in reality, if more than one topology is possible the branch space is at least 5-dimensional). In both cases ˆτ = τ 1 as τ 1 has the largest maximized likelihood. To the left, there is another topology with almost as large maximized likelihood while at the right-hand side the maximized likelihood of the other topology is much smaller than for τ 1. Efron introduced the idea of bootstrap in the Annals of Mathematical Statistics in The technique is a general method to create measures of uncertainty and bias. Assume we have a sample from an unknown density f and an unknown parameter θ = g(f) (e.g. the mean). The parameter is estimated from the sample points (e.g. ˆθ = x). If the density f is known, the distribution of the estimator can be derived. The variance of the estimator can be calculated or estimated and hence a confidence interval describing the uncertainty of ˆθ may be calculated. If f is unknown but new samples could be drawn from f, new estimates ˆθ i would be obtained. These could then be used to estimate the variance of the original estimate. Usually, we have our original sample and can not draw new samples from the unknown distribution. The idea of bootstrap 16

17 is that we hopefully have a representative sample so that the empirical distribution F n is approximately equal to the true F (see Figure 2.4). Drawing samples from F n is then approximately equal to drawing samples from F. This is done by creating a pseudo-sample by drawing data points from the original sample with replacement until the number of observations in the new sample equals the number of observations in the original one. Repeating this a large number of times gives many estimates of the parameter. These estimates can be seen as a sample of the estimator and the variance of the estimator is estimated by the sample variance true prob. distr. emp. prob. distr density probability distribution Figure 2.4: To the left is the unknown density from which the sample is drawn. Data points in the sample is marked with circles. To the right is the true probability distribution (solid line) and the empirical probability distribution (dashed line). Bootstrap for phylogenetic inference with Maximum Likelihood (or another estimation method) works as follows. Denote the estimate of the topology with τ (ML), which is obtained as described in section from the data matrix X. A pseudo-matrix of the same size as X is created by drawing columns of X with replacement. The phylogenetic analysis is performed for the pseudo-matrix giving the estimate ˆτ (ML). This procedure is repeated a large number of times and the bootstrap support value is the fraction of bootstrap replicates giving the same topology as the original data. Each column in the data matrix X of aligned DNA sequences for k species is one out of 4 k possible ones (each entry in a column is one of the four nucleotides {A, C, T, G} and we consider all possible combinations of the k entries in the column). Since we assume the sites to be independent, the order of the sites does not influence the likelihood function. Hence we 17

18 can consider n =(n 1,...,n 4 k) instead of X, wheren i is the number of columns of a specific pattern (e.g. n 1 is the number of columns of pattern (AAA...A) )and n i = n. Each combination of n i with n i = n gives a Maximum Likelihood estimate (see Figure 2.5). In the figure the filled circle denote the original data, the other circles denote other possible n-values. When creating pseudo-matrices each replicate results in one of the circles. The estimate of the topology does not change for data points near the original data point, but changes abruptly as the border to the next topology estimate is crossed. This is a different situation than with the well behaved continuous density used in the description of the idea of the original bootstrap. τ 1 τ 2 τ 3 Figure 2.5: The n-space is 4 k dimensional, where k is the number of species, but is here visualized in two dimensions. The circles denote possible values of n, where the original data is denoted with the filled circle. Each possible value of n gives an estimate of the topology τ, here the regions of three possible topologies are shown. The bootstrap support value is an estimate of the probability of getting the same estimate as the original one when considering pseudo-replicates. If the original data matrix is near the border of a region, the estimate is uncertain. The probability of getting a data point in another region when creating a pseudo-replicate is then pretty large and hence the bootstrap frequency will be small. On the other hand if the original data point is far away from the border to another region, the estimate is robust and that will show in a large bootstrap support value. The interpretation of bootstrap support value as a confidence measure is however not clear. 18

19 2.3 Bayesian inference Generally, Bayesian inference combines earlier knowledge or understanding (prior beliefs) with currently measured data. The prior beliefs are updated through Bayes theorem. For an unknown continuous quantity θ, Bayes theorem can be expressed as f(θ x 1,...,x n )= L(θ x 1,...,x n )g(θ) L(θ x 1,...,x n )g(θ )dθ, (2.3) where f( ) denotes the posterior probability density of the unknown θ after observing data x 1,...,x n, L(θ x 1,...,x n ) is the likelihood function of the data and g denotes the prior density function of θ, whichisgiven prior to observing data. The denominator, the probability of the data at hand, is a normalizing constant where the integral is over the support of θ. For deeper understanding of Bayesian inference in general, see e.g. [BT73], [Pre03]. In phylogenetics, Bayesian inference was suggested in the last decade of the 20th century ([Mau96], [Li97], [RY96]). It became popular when Markov Chain Monte Carlo (MCMC) methods were introduced into the phylogeny problem (e.g. [LS99], [YR97]) making it possible to sample from the posterior distribution, resulting in softwares like BAMBE [SL98] and MrBayes [HR01]. For general theory of MCMC see e.g. [GRe96] Point estimate of topology and confidence measure Applied to the phylogeny problem Bayes theorem can be expressed as f(τ,b (τ), θ X) = L(τ,b (τ), θ X)g(τ,b (τ), θ) τ L(τ, b (τ ), θ X)g(τ, b (τ ), θ )dbdθ, (2.4) where the sum in the denominator is over all possible topologies, τ. The posterior probability density, the left-hand side of (2.4), is a joint probability density for all parameters. To obtain the marginal posterior probability for a given topology τ i the other parameters, the branch lengths b (τ i) and model parameters θ are integrated out. Hence, f(τ i X) = L(τi, b (τ i), θ X)g(τ i, b (τ i), θ)dbdθ τ L(τ, b (τ ), θ X)g(τ, b (τ )db (τ i ) dθ, θ )dbdθ. (2.5) The estimate of the topology is the one with the largest posterior probability, that is ˆτ (MPP) = argmax i {f(τ i X)}, (2.6) 19

20 where MPP is an acronym for Maximum Posterior Probability. The interpretation of the posterior density is clear, it is the density given the model, prior and data. As confidence measure of ˆτ (MPP) the posterior probability is used, and hence the support value is the probability of the estimate being the true phylogeny if the correct model and prior are used. In a frequently cited paper by Efron et al., [EHH96], it is claimed that bootstrap support values can be interpreted as posterior probabilities under certain circumstances. Other authors have noticed empirically that Bayesian posterior probabilities for the best supported topology are significantly higher than corresponding non-parametric bootstrap frequency (e.g. [SGN02], [WZHH02], [KMCD01], [MEO + 01]). In Paper I we investigate this with simulated data. In Paper II it is shown that the statement of Efron et al. is true under the conditions given there, but that those conditions are violated in general phylogenetic inference. Therefore the two support measures should not be expected to be equal. Britton et al., [BSEO07], give a mathematical argument for the Bayesian support to be larger than the bootstrap support value Priors The Bayesian approach in phylogenetics requires priors on the topology τ, corresponding branch lengths b (τ) and model parameters θ. Asmore data are added, the less influence on the posterior density do the priors have. However, they still have to be specified. The problem of specifing priors is, as pointed out by Huelsenbeck et al. [HLMR02], the strength of the Bayesian method or the weakness. If prior knowledge is available why not use it? On the other hand, how should one specify priors when prior knowledge is not available? If one knows that some topologies are impossible, the prior probabilities for those can be set to zero and there will be no positive posterior probability for them. If no weighting on topologies is wanted a priori, a discrete uniform prior is usually used. When there is no prior knowledge available for branch lengths and model parameters priors representing the lack of knowledge is wanted, so called non-informative priors. There is however no universal way to choose such non-informative priors. One reason for this is that there exists no unique definition of the term in the literature. Two different priors are often used in phylogenetics for branch lengths. Either a flat prior on a large interval (b U (0,M), wherem is large) or an exponential prior (b Exp (λ), whereλ has to be specified or drawn from a hyper prior distribution). In Paper V we derive the so called Jeffreys prior for branch lengths using the Jukes-Cantor model of evolution. 20

21 3. Divergence times estimation Many methods in phylogenetics, including the likelihood based ones described in the previous chapter, give the estimated branch lengths in expected number of substitutions per site. time t 0 t 1 b c d t 2 a a b c d τ Figure 3.1: To the left is the estimated tree, ˆτ, with branch lengths proportional to the expected number of substitutions per site. To the right the corresponding time tree, where branch lengths are proportional to evolutionary time. In Figure 3.1 the hypothetical estimate ˆτ of the four species {a, b, c, d} is shown to the left. Since the speciation between a and b, the a-lineage has evolved more than the b-lineage. The DNA for the b species is more similar to the DNA of the ancestor than the a species is. Once the topology is estimated, that is, once the rooted tree ˆτ to the left in Figure 3.1 is obtained the divergence times of the internal nodes may be of interest. How many years ago did the common ancestor of a and b exist? We are interested in estimating the time of the divergences t 0, t 1 and t 2 in the tree to the right in Figure 3.1. The topology, that is, the branching order, is obtained from ˆτ, but now the leaves are at the same level. The length of the path from the root to a leaf is equal for all leaves. A tree with this property is called an ultrametric tree. 3.1 Algorithms for dating nodes The left-hand tree in Figure 3.1 can be estimated consistently as the sequence length tends to infinity (assuming the model is correct). The branch lengths, proportional to the expected number of substitutions per site, is an estimate of the product of rates and times. To estimate the 21

22 time tree to the right in Figure 3.1 consistently, some assumption on how the rates vary over the tree is needed [Bri05]. Different molecular clock assumptions exist from the global molecular clock with one substitution rate over the entire tree to different local clocks with different rates in different parts of the tree. The extreme is a single rate for every branch. With more and more data, and with a method that consistently estimates the left-hand tree in Figure 3.1, ˆτ should be closer and closer to the right-hand tree if the global clock is valid, that is if all lineages have evolved at equal rates. If the global substitution rate is known, or can be estimated from a calibration node, it is easy to estimate the divergence times consistently. The algorithm of Mean Path Length (MPL), first introduced by Bremer and Gustafsson [BG97] and further developed by Britton et al. [BOVB02], is a method implicitly using the global clock. It is assumed that the branch lengths are proportional to the number of substitutions between the nodes. The MPL of a node is the mean of the sum of the observations along paths from the node to descending leaves. The algorithm allows one calibration point. The global clock can also be assumed when phylogenetic inference is based on the ML-method, enforcing an ultrametric tree. This is implemented in softwares like PAUP* [Swo02], PHYLIP [Fel05] and BASEML [Yan97]. For real data, the molecular clock assumption does usually not hold [LF74]. If it is violated, the left-hand tree ˆτ could still be estimated consistently, but it will not tend to the right-hand tree when the sequence length n increases. An algorithm that is based on the MPL method but allows several calibration points, and thereby corrects for deviations from the molecular clock is PATHd8 [BAJ + 07]. The calibration points define segments in the tree. A local molecular clock model, where the same rate is assumed within the segment but may differ between segments, is implicitly assumed. The properties of the algorithm is investigated in Paper IV. The local molecular clock with rate constancy in parts of the tree is also implemented e.g. in BASEML and R8s [San03]. The definition of segments in the tree using the same local clock is a crucial step for the method. The local molecular clock described above is between two extremes the global clock assuming a constant rate over the entire tree and a model allowing independent rates for all branches. There are methods that do not require the segments of the tree to be explicitly pre-defined but smooth the rate change over the tree by penalizing rates that change too fast between neighbouring branches (the NPRS method implemented in R8s, [San03]). For further review of dating methods, see Rutschmann [Rut06]. 22

23 4. Summary of Papers Many interesting areas of molecular biology can be studied once genomes have been successfully sequenced. We are interested in the evolutionary relationships between organisms. Only from the introductory part of this thesis (chapters 1-3) issues like properties of the methods of reconstructing the phylogeny and estimating the divergence times are raised. Almost all the questions we have studied have been asked by biologists using phylogenetics in their every day profession, noticing qualities of the methods and wondering if there are theoretical explanations for what they are noticing. In this chapter the aim and the content of the papers will be summarized. My contribution to Paper I is the regression part. For the other papers I am the main author, responsible for calculations and explanations given therein. 4.1 Paper I Reliability of Bayesian Posterior Probabilities and Bootstrap Frequencies in Phylogenetics Bootstrap support values (see section 2.2.2) have been used in phylogenetics for different reconstruction methods ever since Felsenstein suggested it in 1985 [Fel85]. For methods like parsimony or Maximum Likelihood the bootstrap procedure is very time consuming (except when only few species are considered). The reason for this is that for every pseudosample all possible topologies should be studied, calculating the parsimony length or optimizing the likelihood value (see section 2.2.1). When the use of MCMC was introduced in phylogenetics (e.g. [YR97], [LS99]) making Bayesian inference possible, it was much faster and gave a support value in form of posterior probabilities (see section 2.3.1). Efron et al. [EHH96] stated that bootstrap support values could be interpreted as posterior probabilities using a non-informative prior. It seemed that it was possible to use the new faster method to achieve approximations of the measure of support that had been used for some time. However, several papers (e.g. [WZHH02], [KMCD01] [MEO + 01]) had noticed that Bayesian support values often were larger than the corresponding bootstrap values. This, together with unclear interpretation of a bootstrap value started the work with this paper. 23

24 Here we use simulated data from a fixed unrooted tree of 5 species {A, B, C, D, E}, shown in Figure 4.1, using the evolution model of Jukes- Cantor and GTR+Γ (see section 2.1). By using simulated data we actually know the true phylogeny. D A E B C Figure 4.1: The fixed unrooted tree from which data sets were simulated. The numbers along branches represent expected number of substitutions per site. The two short internal branches without numbers are both of length Support values for topologies with sequences from more than four species are often given for clades (groups of species) rather than for entire topologies. A support value for AB is the support for A and B to form a subtree. Contributions for this support value may also come from other topologies than the estimated one. E.g. by letting C and E change places in Figure 4.1 the topology is changed but A and B are still grouped as closely related. For a large number of simulated data sets the estimates of the topology were calculated using Maximum Likelihood and Bayesian inference with the same model of evolution as the data sets had been created from. For the Bayesian inference, uniform priors on topology and branch lengths (b (τ) U (0, 10)) were used, which were the default values of Mr- Bayes2.01 [HR01] used for the analysis. With ML bootstrap support values were also calculated where all parameters were reoptimized for each pseudo-sample. The different support values were paired. The Wilcoxon signed-rank test shows that Bayesian inference yields significantly higher support values than Maximum Likelihood bootstrap values for well supported clades. From the results a logistic regression was fitted expressing π, the probability that the true clade has been found as a function of the support value, x. Figure 4.2 shows the result where the logit model, π(x) x 1 x log 1 π(x) = α + β log has been used for the posterior probabilities (solid line) and the bootstrap values (dashed line). The data sets were also analyzed with the wrong model in order to investigate the support values with model misspecification. The analysis shows that the risk of making erroneous conclusions is higher with Bayesian inference than with Maximum Likelihood bootstrapping. 24

25 Logistic regression BAYES MLBOOT π(x) = x π(x) = x Support value Figure 4.2: Reproduced figure of the logistic regression for the probability π that the true clades are found as a function of the support values x i for Bayesian posterior probabilities (solid line) and Maximum Likelihood bootstrap support values (dashed line). The help line indicates support values that correspond to 95% of the clades being correctly estimated. 4.2 Paper II Fundamental Differences Between the Methods of Maximum Likelihood and Maximum Posterior Probabilities in Phylogenetics Paper I indicated that there is a difference between the support values of bootstrapping Maximum Likelihood and using Bayesian support in phylogenetics, despite the theoretical claim of their approximate equivalence [EHH96]. The aim of this paper is to investigate the conditions needed for the statement of Efron et al. to hold, trying to understand why the systematical difference of the two support measures appears. 25

26 Consider aligned DNA sequences of length n of k species with no gaps. Each column of the data matrix X is then one out of 4 k possible ones. Denote the possible columns by X 1,...,X 4 s,wherex 1 =(A, A,..., A), X 2 =(C,C...,C) etc. The proportions of X 1,...,X 4 k generated from the true phylogeny τ with branch lengths b (τ) are p =(p 1,...,p k ),where i p i =1.Each{τ,b (τ) } induces a different p-vector. The continuous p-space can, at least theoretically, be divided into regions representing different topologies τ i (see the left-hand side of Figure 4.3). p n {τ, b (τ) } τ 1 τ τ 3 2 τ 1 τ 3 τ 2 Figure 4.3: Each {τ,b (τ) } induces a different p-vector. The continuous p-space can be divided into regions representing different topologies τ i. The sample space n is a discrete space which can be divided into regions where the estimate ˆτ = τ i. The data matrix X can be represented as n =(n 1,...,n 4 k) where n i is the number of columns in X that equal X i. Using n it is possible to esitmate p (e.g. ˆp i = n i n ). We are however not primarily interested in p but in τ. With a consistent estimation method the regions of n giving ˆτ i = τ i should be close to the corresponding regions of the p-space (see Figure 4.3). Efron et al. stated in [EHH96] that "The bootstrap probability that ˆτ = ˆτ is almost the same as the aposteriori probability that τ = ˆτ starting from an uninformative prior on p", where ˆτ is the estimate of the topology for a bootstrap replicate. This statement is proven in this paper. For their statement to hold the prior for p should be used. In Bayesian phylogenetics this parameter is not considered, rather the topology, τ, corresponding branch lengths b (τ) and model parameters θ. Usingthe Jukes-Cantor model of evolution (which eliminates θ), k =4species, a discrete uniform prior on τ and either a uniform (U (0,M) for different values of M) or an exponential (Exp (λ) for different values of λ) prior for branch lengths b (τ), we studied the results of the likelihood based methods. Using the Jukes-Cantor model of evolution several possible data patterns contribute to the likelihood function in the same way. Hence p (and n) is, for k =4species, of dimension 15 instead of 4 4 = 256 (see Table 4.1). We show, analytically in this paper, that for Maximum Likelihood only 3 of the patterns are separately informative in the sense that a data 26

27 set consisting of only one of the patterns gives a unique estimate of the topology while 9 of the patterns in the Bayesian inference are separately informative. Since the two methods differ, the regions of the n-space where ˆτ = τ i do not always coincide. Hence the two measures of support can not be expected to be approximately equal. pattern no pattern description 1 XXY Y Groups of two nucleotides equal 2 XY XY within the group but not equal 3 XYYX between groups. 4 XXXY One nucleotide differs from the 5 XXY X rest. 6 XY XX 7 Y XXX 8 XXY Z One group with two equal 9 YZXX nucleotides, the other two differ 10 XY XZ from the group and from each 11 XY ZX other. 12 Y XXZ 13 YXZX 14 XY ZU All nucleotides different. 15 XXXX All nucleotides equal. Table 4.1: With the Jukes-Cantor model of evolution and with 4 taxa there are 15 different patterns contributing to the likelihood in different ways. estimate of topology method τ 1 τ 2 τ 3 ML MPP 1, 8, 9 2, 10, 13 3, 11, 12 Table 4.2: Separately informative patterns favouring topologies τ 1 = {(a, b), (c, d)}, τ 2 = {(a, c), (b, d)} and τ 3 = {(a, d), (b, c)} for the methods of Maximum Likelihood and Maximum Posterior Probability respectively. The enumeration of the patterns follows Table

28 4.3 Paper III Improving Divergence Times in Phylogenetics: More Taxa vs. Longer Sequences At a phylogenetic meeting organized by Nescent (National Evolutionary Synthesis Center) in September 2006 the question was posed whether more taxa (sequences from more species) usually imply that the Maximum Likelihood estimate of the age (the divergence time) of the root gets older. The aim of this paper is to investigate the properties of ML as a method of estimating divergence times of a given rooted tree. A difference between this paper and the two previous ones is that the rooted topology has already been estimated and is assumed known, with branch lengths proportional to the number of substitutions. Hence we have a rooted binary tree with divergence times, t i, of the internal nodes and observed number of substitutions along branches, y i (see Figure 4.4). We restrict the study to symmetric trees in the sense that the twosubtreesofanodehaveequallymanytaxa. t_1 1 time y_1 y_2 t_2=t_3 2 3 t_4 t_7 t_6 t_5 y_7 y_3 4 y_8 y_4 y_5 y_ y_11 y_12 y_9 y_13 y_10 y_14 Figure 4.4: A rooted symmetric binary tree with k =8taxa and l =log 2 k =3 levels. A node is at level j if it has j 1 nodes on the path from the root to the node. The divergence time of node i is denoted t i and the observations along the branches are denoted y i. The evolution model used in this study is the Jukes-Cantor model, [JC69], for which the number of substitutions along a branch, Y i Po (n rt), where n is the sequence length, r the mean substitution rate and t the elapsed time between the nodes connected by the branch. The Y i :s are independent and the model is in the exponential family. (For more on the exponential family, see e.g. [Lin01]). For a symmetric tree of k taxa, there are log 2 k levels of internal nodes, where the root is at level 1 and a node is at level j if it has j 1 nodes on the path from the root to the node. Hence, at level j the nodes with 28

29 divergence times t 2 j 1,...,t 2 j 1 are placed. With this notation the score function, U i (t) := l(t) t i equals U i (t) = η(t) T (y) a(t) t i t i 2n r + y 1 = n r y i 1 n r t 1 t 2 + y 2 t [i/2] t i + y 2i 1 t i t 2i + y i 1 t [i/2] t i + y 2i 1+y 2i t 1 t 3 i =1 y 2i t i t 2i+1 i =2,..., k 2 1 (4.1) t i i = k 2,...,k 1. To obtain the ML estimate of t, the score function is set to 0 and the equations in (4.1) solved numerically. For a fixed number of taxa, k, the model being in the exponential family, it follows that the estimate ˆt (ML) is consistent, ˆt (ML) t when n. The asymptotic variance of ˆt (ML) equals the inverse of the expected information matrix, I ij (t) = ( E U ) i(t), t j whichcanbeshowntoequal I(t) = n r t 1 t 2 + n r n r t [i/2] t i + t 1 t 3, i = j =1, n r t i t 2i + n r t i t 2i+1, i = j, i =2,..., k 2 1, n r t [i/2] t i + 2n r t i, i = j, i = k 2,...,k 1, n r t j t i, j =[i/2], n r t j t i, j =(2i, 2i +1), (4.2) where i =2,...,k 1 in the last row. Each entry of the inverse of I(t) will have a factor 1 n and hence the variance of the ML estimate will be reduced by a factor 2 if the sequence length is doubled. The Maximum Likelihood method is time consuming. Solving the score functions numerically is a non-trivial task and so is finding the inverse of I(t) numerically. We therefore compare the method of Maximum Likelihood with the much simpler and faster method of Mean Path Length (see section 3.1), which works with the model of Jukes-Cantor. The variance oftheestimateofthedivergencetimeoftheroot,t 1, can be expressed as V (ˆt (MPL) 1 )= 1 ( k 1 n r j=1 2 2j 2j+1 2 i=2 j 1 (t [ i+1 2 ] t i+1)+ 1 k 2 2k 2 i=k 1 t [ i+1 2 ] ), (4.3) 29

30 where j denotes the level of the node, k the number of taxa and [ ] the integer part. To compare (4.3) with the variance of the ML estimate of the age of the root, (4.3) should be compared to the top left element of the inverse of the expected information (4.2). From (4.3) we see that also here the variance is reduced by a factor 2 if the seqence length is doubled. In this study we have considered three kinds of trees: (1) equidistant complete symmetric, (2) complete symmetric and (3) symmetric trees (see Figure 4.5). In the complete symmetric trees all nodes on the same level diverged at the same time. In the equidistant case we further require the times between speciation to be equal, that is all branches are of the same length. In the symmetric trees the two subtrees of a node have equally many nodes, but the divergence times may be different for all nodes. When doing the inference, the type of tree is of course assumed to be unknown. time 1 time 1 time equidistant complete symmetric complete symmetric symmetric Figure 4.5: The three different types of trees considered in the simulations. To the left is the equidistant complete symmetric case where all nodes at the same level diverged at the same time and the times between the levels are equal. In the middle is the complete symmetric case where the times between the levels do not need to be equal. At the right is the third type which is symmetric in the sense that the two subtrees of a node have equally many nodes, but the times of divergence may differ between nodes. For the equidistant case we have shown the following theorem Theorem If the number of taxa in a rooted phylogenetic tree is k =2 l and all branches are of the same length (t 1 / log 2 k), where t 1 is the age of the root, then V (ˆt (MPL) 1 )=V(ˆt (ML) 1 )=I11 1 = t 1 k 1 n r log 2 k k, (4.4) where I is the information matrix. Hence V (ˆt (ML) 1 )=V(ˆt (MPL) 1 ) in the equidistant complete symmetric case and both estimators are efficient. 30

31 For the complete symmetric case we have, in addition, verified that V (ˆt (ML) 1 )=V(ˆt (MPL) 1 ) for the number of taxa k {4, 8, 16} and simulations indicate that this holds in general also for larger k. For a symmetric tree, where all nodes have individual divergence times, our simulation results indicate that ML estimates the divergence time of the root with slightly higher precision than the method of MPL. The differences between the estimates of the two methods and the corresponding variables are small though. When estimating the divergence times for internal nodes, ML uses all observations of the entire tree. MPL only takes the observations along the paths from the node to the descending taxa, so MPL uses less information than ML. For a node located close to the root the two methods use almost the same amount of information, for the root exactly the same. The estimates, as well as the precisions thereof, should therefore be close. For a node closer to taxa, MPL only uses part of the information and the precision is then lower than for ML. The question posed was whether more taxa affect the ML to overestimate the divergence time of the root or not. The answer is no if k is increased in a nice way. From (4.4) the variance could be reduced by a factor 2 either by doubling the sequence length n or by squaring the number of taxa k. For fixed k, ML is consistent and therefore does not overestimate the age of the root systematically. However, since we only have finite sequences, we cannot let k.ifk>nthere is not enough data to estimate the number of substitutions along branches. We therefore consider only situations where k is increased in a nice way and not too fast so that the sequence length always is much larger than the number of taxa. 4.4 Paper IV Consistent estimation of divergence times in phylogenetic trees with local molecular clocks The Mean Path Length (see section 3.1 and 4.3) is the base of the algorithm of PATHd8 [BAJ + 07] for estimating divergence times of a given tree. The main difference between MPL and PATHd8 is that the latter allows several calibration nodes and thereby corrects for deviations from the molecular clock assumption (see section 3.1). The purpose of this paper is to investigate for what families of models the algorithm of PATHd8 estimates the divergence times consistently. PATHd8 estimates divergence times of a given rooted, not necessarily binary, tree with branch lengths proportional to the number of substitutions for DNA sequences of length n by first calculating the MPLs of all nodes. A node can be defined either as a fixed age node with a calibration time known e.g. from fossils, or as a reference node for which minimum 31

32 age or maximum age or both are given, or as a usual node for which no age constraint is set. The age of a non fixed age node is estimated by weighted relative MPL of the node of interest, x, and the MPLs and ages of the closest fixed age node located closer to the root and of adjacent fixed age nodes. The weights are defined by the size of the subtree that is defined by having its root in the fixed age node. The algorithm checks the estimates of the constrained nodes and adjust the estimates to given minimum age or maximum age if the original estimate is too small or too large respectively. The corrected nodes are then considered as fixed age nodes and the estimation of the age of the non fixed age nodes is redone. In this way the algorithm corrects for deviations from the molecular clock and smooths the substitution rates for sister groups. One of its advantages is that it is very fast, even for very large trees. The method of MPL implicitly assumes a global molecular clock. The algorithm of PATHd8 divides the tree into segments, defined by the fixed age nodes. The implicit assumption of PATHd8 is a local molecular clock model where a fixed substitution rate for a segment is assumed but the rates can differ between segments. Figure 4.6 describes the evolutionary history of 14 present species. The ages {a 0,a 1,a 4 = a 3,a 8 = a 5 } of the root α 0 and the nodes α 1, α 4 and α 8 are known, all other divergence times a i remain to be estimated. time 20 a0 15 a 1 a a 5 α 0 t1 t2 α 1 α 2 t3 t4 t5 t6 α3 α 4 α 5 t7 t8 t9 t10 t11 t12 α6 α7 α8 α9 α10 α11 α12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t 24 t25 t26 r0 =0.03 r1 =0.02 r2 =0.01 r3 =0.04 fixed age node node of interest Figure 4.6: The true time tree, where the ages of the nodes α 0, α 1, α 4 and α 8 are assumed to be known. The aim is to estimate the divergence times of the other nodes, indicated here by node α 5. In PATHd8 the fixed age nodes {a 0,a 1,a 4,a 8 } are the roots of the segments S 0 = {α 0,α 2,α 5,α 11,α 12,t 1,t 2,t 4,t 5,t 11,t 12,t 23...,t 26 }, S 1 = {α 1,α 3,α 6,α 7,t 3,t 6,t 7,t 8,t 13,...,t 16 }, S 2 = {α 8,t 17,t 18 } and segment S 3 = {α 4,α 9,α 10,t 9,t 10,t 19,...,t 22 }. The local clock assumption assigns rates {r 0,r 1,r 2,r 3 } to be fixed, but unknown, within the segments. Let y i 32

33 denote the observed number of substitutions along a branch with elapsed time t i.thempls,p i,oftherootα 0 and the node α 5 will be p 0 n = ( 1 6y 1 +8y 2 +4(y 3 + y 4 + y 5 )+2(y y 12 )+ 14n ) +y y 26 p 5 n n 1 14 (r 0(4(2a 0 a 3 )+6(a 0 a 1 )) + r 1 (2(3a 1 a 5 )) + +2r 2 a 5 +4r 3 a 3 ), 1 = 4n (2(y 11 + y 12 + y y 26 ) n r0 a 3 If a global molecular clock is valid, r 0 = r 1 = r 2 = r 3,thenp 0 r 0 14 (14a 0 4a 3 6a 1 +6a 1 2a 5 +2a 5 +4a 3 )=r 0 a 0. The estimate of the divergence time of α 5 is â 5 =(p 5 /p 0 )a 0 and with a global clock p 5 n r 0 a 3 a 0 a 0 = a 3, p 0 r 0 a 0 that is, â 5 is a consistent estimate. Now, if the global clock assumption is violated but the local clock is valid, where at least one r i differs from the others, the MPL of the root, p 0, will not be consistent. The MPL of the root depends on all observations, the ones in the same segment as the root as well as observations in other segments. The size and direction of the bias in p 0 depend on the size and directions of the differences in rates. Since all observations contributing to the MPL p 5 of α 5 are within thesamesegment,p 5 will still be consistent. The estimate of the age of the node, â 5,willhencenot be unbiased. To avoid this inconsistency we suggest a change of the algorithm, to move the weighting averaging from the age estimating part to the calculations of the MPLs. We suggest that adjusted MPL (ampl) is used instead, thereby only considering the observations in the same segment as the node of interest belongs to. When calculating the adjusted MPL of anodeα i the observations along paths within the segment are weighted according to the size of the subtrees descending from the fixed age nodes that are located further down (closer to the terminal taxa). A path from α i that ends in another fixed age node is, with this weighting, blown up to about what it would have been if the path had ended in a leaf, that is, if the local rate would have been a global one. The ampl is consistent and unbiased but the uncertainty is larger than for MPL since fewer observations are used. Hence the age estimate, using ampl instead of 33

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal