On Estimating Topology and Divergence Times in Phylogenetics

Size: px
Start display at page:

Download "On Estimating Topology and Divergence Times in Phylogenetics"

Transcription

1 UPPSALA DISSERTATIONS IN MATHEMATICS 55 On Estimating Topology and Divergence Times in Phylogenetics Bodil Svennblad Department of Mathematics Uppsala University UPPSALA 2008

2

3 To dad

4

5 List of Papers This thesis is based on the following papers, which are referred to in the text by their roman numerals. I Erixon, P., Svennblad, B., Britton, T. and B. Oxelman (2003) Reliability of Bayesian Posterior Probabilities and Bootstrap Frequencies in Phylogenetics. Systematic Biology, 52(5): II Svennblad, B., Erixon, P., Oxelman, B. and T. Britton (2006) Fundamental Differences Between the Methods of Maximum Likelihood and Maximum Posterior Probability in Phylogenetics. Systematic Biology, 55(1): III Svennblad, B. and T. Britton (2007) Improving Divergence Time Estimation in Phylogenetics: More Taxa vs. Longer Sequences. Statistical Applications in Genetics and Molecular Biology, vol 6: Iss. 1, Article 35. IV Svennblad, B. (2007) Consistent estimation of divergence times in phylogenetic trees with local molecular clocks. Submitted. V Svennblad, B. (2008) On how to specify non-informative priors on branch lengths in phylogenetics. Submitted. Reprints were made with permission from the publishers.

6

7 Contents 1 Introduction Data AlgorithmsandMethodsforPhylogeneticInference Likelihoodbasedmethods SubstitutionModels MaximumLikelihood Pointestimateoftopology Confidencemeasure Bayesianinference Point estimate of topology and confidence measure Priors Divergencetimesestimation Algorithmsfordatingnodes SummaryofPapers PaperI PaperII PaperIII PaperIV PaperV Example:PhylogenyofPrimates Sammanfattningpåsvenska(SummaryinSwedish) Acknowledgments Bibliography... 49

8

9 1. Introduction A phylogenetic tree describes the evolutionary history among organisms or entities that share a common ancestor. It is assumed that evolution happens like a tree where all lineages evolve independently. Phylogenetics has long been used among biologists who are interested in finding out how existing organisms are related. But it is also used to study the evolution of languages (e.g.[gj00]), manuscript texts [SDBH04], cultural artefacts [MH05] and to find the origin of HIV virus (e.g. [GMI + 90], [RRP + 01]). Phylogenetic analysis has even been used in criminal prosecutions as evidence of responsibility for HIV transmission. For this purpose phylogenetics should be used with care as the direction and timing of the HIV transmission is hard to tell from the phylogenetic analysis [BAV + 07]. In this thesis we will restrict the discussion to phylogenetic trees for use in systematics of organisms. In the biological context phylogenetic trees were first implied by Charles Darwin in the book On the Origin of Species in Ernst Haeckel coined the term phylogeny and created the first published phylogenetic tree 1866 in the book Monophyletischer Stammbaum der Organismen. 1.1 Data In the early days of phylogenetic reconstruction only morphological data were available. Today different genome projects have sequenced the DNA for almost the entire genome of species (e.g. human, mice and E.coli bacteria, Genome/ hg5yp/index.shtml). DNA consists of two complementary chains twisted around each other to form a right-handed helix. Each chain consists of a sequence of four nucleotides: A (adenine), G (guanine), C (cytosine) and T (thymine). Each nucleotide in one chain is bounded to a nucleotide in the other, always so that A in one chain binds to T in the other, and C binds to G. When analyzing DNA sequences therefore only the sequence of one chain is needed. Evolution occurs by changes in the DNA sequence. Changes can occur through [Li97, chapter 1] 9

10 i. substitution (replacement) of a single nucleotide, ii. insertion of one or more nucleotides, iii. deletion of one or more nucleotides, iv. inversion where a segment of the DNA sequence is inverted, v. recombination, which includes crossing-over and gene conversion. Figure 1.1 visualizes a hypothetical evolution of four species through different types of changes in the DNA sequence. (ii) A C C T C G G A C T T i A C C T T C G G A C T T A C C T T G G A C T T (iv) (i) i a: A C C T T C G C T T i b: A C C T T C G G A T T C c: A C C A T G G A C C T d: T C C A A G G A C T T (iii) Figure 1.1: Hypothetical evolution of 4 species, where changes in the DNAsequence of the common ancestor at the top are substitutions (i), insertion (ii), deletion (iii) and inversion (iv), leading to the DNA sequences of the present species {a, b, c, d}. (i) (i) The evolution in Figure 1.1 can be summarized as a matrix of the DNA sequences of the four species {a, b, c, d}. The matrix should be aligned so that it is possible to follow the evolution site by site (columnwise). Figure 1.1 is summarized in matrix (1.1). column (site) species i a A C C T T C G - - C T T b A C C T T C G G A T T C c A C C A - T G G A C C T d T C C A - A G G A C T T (1.1) The example above is presented as if evolution was actually observable, which of course is not the case. With real data, only the sequences AC- CTTCGCTT, ACCTTCGGATTC, ACCATGGACCT and TCCAAG- GACTT for the four species a, b, c and d respectively are observed. When trying to reconstruct the evolutionary history, matrix (1.1) has to 10

11 be estimated. How to align the sequences is outside the scope of this thesis. Here we only consider matrices with no gaps, that is we assume we have a data matrix X where each row represents a DNA sequence of one of the species and each element in the matrix is one of the nucleotides {A, C, T, G}. 1.2 Algorithms and Methods for Phylogenetic Inference Assuming that the species from which the data matrix X is obtained have a common ancestor and that evolution can be described in a tree like manner, an estimate of the phylogeny is wanted that describes the evolutionary relationships between the species. There are several algorithms that will either create a tree from X by using pairwise distances between the species or that will choose a tree among several possible ones as the best tree according to some optimization criteria. We will here briefly describe some of them. In the next chapter the so called likelihood based methods are described in more detail. Mathematically, a graph is a set of vertices and a set of edges connecting them. A tree is a graph with no loops. The degree of a vertex is the number of edges connected to it. We will mainly consider binary trees, for which all vertices, except the root (if the tree is rooted) and the leaves, are of degree 3. In this thesis, we will adopt the language used among biologists and talk about nodes instead of vertices and branches instead of edges. The external nodes (tips, leaves, taxa) represent the living organisms for which we have DNA sequences. The internal nodes represent ancestors for which usually no sequence data are available. The branching order of a tree is called the topology. In phylogenetic analyses we try to estimate how far the different species arefromeachotherbasedonthednasequencesinx. E.g. define d ij between pair i and j in X to be d ij = # sites where x ik x jk, n where n is the sequence length, k =1,...,n. From the distance matrix D a phylogenetic tree can be built by clustering methods like UPGMA [SM58] or Neighbour-joining (NJ) [SN87]. The algorithms of UPGMA and NJ create a tree. There are other methods where all possible trees are compared and one is chosen as the estimate, based on some optimization criteria. One such method is parsimony where the estimate of the evolutionary tree is the one that minimizes the number of substitutions needed to obtain X. The idea was introduced by Edwards and Cavalli-Sforza ([ECS63], [CSE67]) but 11

12 for allele frequencies. Fitch was the first to present an algorithm for calculating the minimal number of substitutions for a given tree [Fit71]. The number of unrooted bifurcating trees for k species is (2k 5)!! = (2k 5)(2k 7) 5 3. If we consider rooted trees, the number of possible topologies for k species is (2k 3)!! [DEKM98, p. 164]. For even a moderate number of species, like k =10, there are about 2 million unrooted, or about 34 million rooted topologies. To calculate the parsimony lengths for all trees is then very time consuming. There are however algorithms, like branch and bound (see eg. [Fel04, chapter 5]) that find the best tree without going through every single one. 12

13 2. Likelihood based methods In this chapter we will explain the two probabilistic methods that are the ones most commonly used today in phylogenetic inference, Maximum Likelihood and Bayesian inference. They both use the likelihood function to extract information from data. Probabilistic methods use the likelihood function to choose the topology among all possible ones according to either the value of the likelihood or the posterior probability. To calculate the likelihood function a model of evolution is needed. 2.1 Substitution Models We assume that the sites (positions in the DNA sequence) evolve independently. At a particular site the substitutions are described by a Markov process with the four nucleotides being the states of the chain. Let q ij denotetherateofchangefrombasei to j during some infinitesimal time period dt, thatisp (X(t + dt) =j X(t) =i) =q ij dt. For DNA sequences these rates can be expressed as a matrix Q, μaπ C μbπ G μcπ T μgπ A μdπ G μeπ T Q = μhπ A μjπ C μfπ T μiπ A μkπ C μlπ G, (2.1) where the rows (and columns) correspond to the bases A, C, G and T respectively ([SOWH96, chapter 11]). The row sum should always be 0 so the diagonal elements, q ii, are minus the sum of the other entries at the row, q ii = i j q ij. Thus q ii istherateatwhichthe Markov chain leaves state i. The parameter μ is the mean instantaneous substitution rate. To allow for different rates between the states, μ is modified by the parameters a,...,l. The parameters π A, π C, π G and π T are the equilibrium frequencies of the nucleotide bases respectively. The transition-probability matrix over any time t>0:p (t) ={p ij (t)}, where p ij (t) =P (X(t) =j X(0) = i), is the solution to the differential equation 13

14 which equals P(t) t = P (t)q, P (t) =e Qt. (2.2) The matrix Q in (2.1) is written in the most general form. Almost all DNA substitution models currently used are time-reversible models, assuming the same amount of change from state i to j as from j to i. This assumption can be written as π i q ij = π j q ji, where π i is the proportion of time the Markov Chain spends in state i, π i q ij the amount of "flow" from states i to j. The time-reversibility implies that a = g, b = h, c = i, d = j, e = k and f = l. The model is then called the general time reversibel model (GTR). Time-reversibility does not imply Q to be symmetric. Two models for which Q is symmetric though are Jukes-Cantor (JC) [JC69] and Kimuras 2-parameter model (K2P) [Kim80]. Since q ij = q ji for those models, π A = π C = π G = π T = 1 4. The simplest possible model of evolution, Jukes-Cantor, assumes that when a change occurs all bases are equally probable, that is a = b = = f in (2.1). K2P assumes that changes defined as transitions (see Figure 2.1) occur with one rate and transversions with another rate, that is b = e and a = c = d = f. A G transversion transition T C Figure 2.1: Transitions are substitutions between A and G or between T and C. All other substitutions are called transversions. A generalization of K2P is the HKY85 model, where different equilibrium base frequencies are allowed [HKY85] (see also section 5). More complex models let rates vary for different sites, by assuming the rate for any site to be a random variable drawn from a statistical distribution. The most commonly used distribution is the gamma distribution with shape parameter α and scale parameter 1/α. For more detailed descriptions of nucleotide substitution models see e.g. [Yan06, chapter 1] and [SOWH96, chapter 11]. 14

15 2.2 Maximum Likelihood The method of Maximum Likelihood is a well founded method in statistical inference in general, popularized by R.A. Fisher in the 1920 s. For phylogenies, Edwards and Cavalli-Sforza introduced Maximum Likelihood [ECS64], but for gene frequency data, Felsenstein [Fel81] brought the method to phylogenetic inference based on nucleotide sequences Point estimate of topology Assume that we have a k n matrix, X, with aligned DNA sequences from k species, without insertions or deletions. For a given tree topology τ, the likelihood function L(τ,b (τ), θ X), whereb (τ) are the corresponding branch lengths and θ the parameters of the substitution model, can be calculated as the product of the probabilities of the observed data at one site. The likelihood for each topolgy τ i is maximized with respect to b (τi) and θ. Theestimatedtopologyˆτ (ML) is the topology with the largest maximized likelihood (see Figure 2.2), that is ˆτ (ML) = argmax i {max b,θ L(τ i, b (τ i), θ X)}. A l(τ 1 ) τ 1 D B C E b A D E l(τ 2 ) C τ 2 B b Figure 2.2: For each topology the likelihood function is maximized with respect to branch lengths and model parameters, here visualized as if branches were univariate and without model parameters (for 5 species the space of branches is 7-dimensional). The topology with the largest maximized likelihood is the one chosen as the estimate of the phylogeny, ˆτ = τ 1 in this example. 15

16 2.2.2 Confidence measure The method described above gives an estimate of the topology. As always with estimates, it is interesting to know how certain the estimate is (compare with e.g. standard deviations and confidence intervals in classical statistics). Is the estimated tree outstanding in describing the data at hand, or are there other tree topologies with almost the same optimized likelihood value? (See Figure 2.3 for a visualization of the uncertainty problem). Felsenstein suggested a nonparametric bootstrap procedure to obtain a measure of confidence [Fel85]. The procedure works with parsimony or other nonprobabilistic algorithms, but also for the Maximum Likelihood method. l(τ i ) l(τ i ) max 1 max τ max τ 1 2 max τ 2 b b Figure 2.3: The likelihood functions for two topologies, visualized as if dependent of a univariate branch (in reality, if more than one topology is possible the branch space is at least 5-dimensional). In both cases ˆτ = τ 1 as τ 1 has the largest maximized likelihood. To the left, there is another topology with almost as large maximized likelihood while at the right-hand side the maximized likelihood of the other topology is much smaller than for τ 1. Efron introduced the idea of bootstrap in the Annals of Mathematical Statistics in The technique is a general method to create measures of uncertainty and bias. Assume we have a sample from an unknown density f and an unknown parameter θ = g(f) (e.g. the mean). The parameter is estimated from the sample points (e.g. ˆθ = x). If the density f is known, the distribution of the estimator can be derived. The variance of the estimator can be calculated or estimated and hence a confidence interval describing the uncertainty of ˆθ may be calculated. If f is unknown but new samples could be drawn from f, new estimates ˆθ i would be obtained. These could then be used to estimate the variance of the original estimate. Usually, we have our original sample and can not draw new samples from the unknown distribution. The idea of bootstrap 16

17 is that we hopefully have a representative sample so that the empirical distribution F n is approximately equal to the true F (see Figure 2.4). Drawing samples from F n is then approximately equal to drawing samples from F. This is done by creating a pseudo-sample by drawing data points from the original sample with replacement until the number of observations in the new sample equals the number of observations in the original one. Repeating this a large number of times gives many estimates of the parameter. These estimates can be seen as a sample of the estimator and the variance of the estimator is estimated by the sample variance true prob. distr. emp. prob. distr density probability distribution Figure 2.4: To the left is the unknown density from which the sample is drawn. Data points in the sample is marked with circles. To the right is the true probability distribution (solid line) and the empirical probability distribution (dashed line). Bootstrap for phylogenetic inference with Maximum Likelihood (or another estimation method) works as follows. Denote the estimate of the topology with τ (ML), which is obtained as described in section from the data matrix X. A pseudo-matrix of the same size as X is created by drawing columns of X with replacement. The phylogenetic analysis is performed for the pseudo-matrix giving the estimate ˆτ (ML). This procedure is repeated a large number of times and the bootstrap support value is the fraction of bootstrap replicates giving the same topology as the original data. Each column in the data matrix X of aligned DNA sequences for k species is one out of 4 k possible ones (each entry in a column is one of the four nucleotides {A, C, T, G} and we consider all possible combinations of the k entries in the column). Since we assume the sites to be independent, the order of the sites does not influence the likelihood function. Hence we 17

18 can consider n =(n 1,...,n 4 k) instead of X, wheren i is the number of columns of a specific pattern (e.g. n 1 is the number of columns of pattern (AAA...A) )and n i = n. Each combination of n i with n i = n gives a Maximum Likelihood estimate (see Figure 2.5). In the figure the filled circle denote the original data, the other circles denote other possible n-values. When creating pseudo-matrices each replicate results in one of the circles. The estimate of the topology does not change for data points near the original data point, but changes abruptly as the border to the next topology estimate is crossed. This is a different situation than with the well behaved continuous density used in the description of the idea of the original bootstrap. τ 1 τ 2 τ 3 Figure 2.5: The n-space is 4 k dimensional, where k is the number of species, but is here visualized in two dimensions. The circles denote possible values of n, where the original data is denoted with the filled circle. Each possible value of n gives an estimate of the topology τ, here the regions of three possible topologies are shown. The bootstrap support value is an estimate of the probability of getting the same estimate as the original one when considering pseudo-replicates. If the original data matrix is near the border of a region, the estimate is uncertain. The probability of getting a data point in another region when creating a pseudo-replicate is then pretty large and hence the bootstrap frequency will be small. On the other hand if the original data point is far away from the border to another region, the estimate is robust and that will show in a large bootstrap support value. The interpretation of bootstrap support value as a confidence measure is however not clear. 18

19 2.3 Bayesian inference Generally, Bayesian inference combines earlier knowledge or understanding (prior beliefs) with currently measured data. The prior beliefs are updated through Bayes theorem. For an unknown continuous quantity θ, Bayes theorem can be expressed as f(θ x 1,...,x n )= L(θ x 1,...,x n )g(θ) L(θ x 1,...,x n )g(θ )dθ, (2.3) where f( ) denotes the posterior probability density of the unknown θ after observing data x 1,...,x n, L(θ x 1,...,x n ) is the likelihood function of the data and g denotes the prior density function of θ, whichisgiven prior to observing data. The denominator, the probability of the data at hand, is a normalizing constant where the integral is over the support of θ. For deeper understanding of Bayesian inference in general, see e.g. [BT73], [Pre03]. In phylogenetics, Bayesian inference was suggested in the last decade of the 20th century ([Mau96], [Li97], [RY96]). It became popular when Markov Chain Monte Carlo (MCMC) methods were introduced into the phylogeny problem (e.g. [LS99], [YR97]) making it possible to sample from the posterior distribution, resulting in softwares like BAMBE [SL98] and MrBayes [HR01]. For general theory of MCMC see e.g. [GRe96] Point estimate of topology and confidence measure Applied to the phylogeny problem Bayes theorem can be expressed as f(τ,b (τ), θ X) = L(τ,b (τ), θ X)g(τ,b (τ), θ) τ L(τ, b (τ ), θ X)g(τ, b (τ ), θ )dbdθ, (2.4) where the sum in the denominator is over all possible topologies, τ. The posterior probability density, the left-hand side of (2.4), is a joint probability density for all parameters. To obtain the marginal posterior probability for a given topology τ i the other parameters, the branch lengths b (τ i) and model parameters θ are integrated out. Hence, f(τ i X) = L(τi, b (τ i), θ X)g(τ i, b (τ i), θ)dbdθ τ L(τ, b (τ ), θ X)g(τ, b (τ )db (τ i ) dθ, θ )dbdθ. (2.5) The estimate of the topology is the one with the largest posterior probability, that is ˆτ (MPP) = argmax i {f(τ i X)}, (2.6) 19

20 where MPP is an acronym for Maximum Posterior Probability. The interpretation of the posterior density is clear, it is the density given the model, prior and data. As confidence measure of ˆτ (MPP) the posterior probability is used, and hence the support value is the probability of the estimate being the true phylogeny if the correct model and prior are used. In a frequently cited paper by Efron et al., [EHH96], it is claimed that bootstrap support values can be interpreted as posterior probabilities under certain circumstances. Other authors have noticed empirically that Bayesian posterior probabilities for the best supported topology are significantly higher than corresponding non-parametric bootstrap frequency (e.g. [SGN02], [WZHH02], [KMCD01], [MEO + 01]). In Paper I we investigate this with simulated data. In Paper II it is shown that the statement of Efron et al. is true under the conditions given there, but that those conditions are violated in general phylogenetic inference. Therefore the two support measures should not be expected to be equal. Britton et al., [BSEO07], give a mathematical argument for the Bayesian support to be larger than the bootstrap support value Priors The Bayesian approach in phylogenetics requires priors on the topology τ, corresponding branch lengths b (τ) and model parameters θ. Asmore data are added, the less influence on the posterior density do the priors have. However, they still have to be specified. The problem of specifing priors is, as pointed out by Huelsenbeck et al. [HLMR02], the strength of the Bayesian method or the weakness. If prior knowledge is available why not use it? On the other hand, how should one specify priors when prior knowledge is not available? If one knows that some topologies are impossible, the prior probabilities for those can be set to zero and there will be no positive posterior probability for them. If no weighting on topologies is wanted a priori, a discrete uniform prior is usually used. When there is no prior knowledge available for branch lengths and model parameters priors representing the lack of knowledge is wanted, so called non-informative priors. There is however no universal way to choose such non-informative priors. One reason for this is that there exists no unique definition of the term in the literature. Two different priors are often used in phylogenetics for branch lengths. Either a flat prior on a large interval (b U (0,M), wherem is large) or an exponential prior (b Exp (λ), whereλ has to be specified or drawn from a hyper prior distribution). In Paper V we derive the so called Jeffreys prior for branch lengths using the Jukes-Cantor model of evolution. 20

21 3. Divergence times estimation Many methods in phylogenetics, including the likelihood based ones described in the previous chapter, give the estimated branch lengths in expected number of substitutions per site. time t 0 t 1 b c d t 2 a a b c d τ Figure 3.1: To the left is the estimated tree, ˆτ, with branch lengths proportional to the expected number of substitutions per site. To the right the corresponding time tree, where branch lengths are proportional to evolutionary time. In Figure 3.1 the hypothetical estimate ˆτ of the four species {a, b, c, d} is shown to the left. Since the speciation between a and b, the a-lineage has evolved more than the b-lineage. The DNA for the b species is more similar to the DNA of the ancestor than the a species is. Once the topology is estimated, that is, once the rooted tree ˆτ to the left in Figure 3.1 is obtained the divergence times of the internal nodes may be of interest. How many years ago did the common ancestor of a and b exist? We are interested in estimating the time of the divergences t 0, t 1 and t 2 in the tree to the right in Figure 3.1. The topology, that is, the branching order, is obtained from ˆτ, but now the leaves are at the same level. The length of the path from the root to a leaf is equal for all leaves. A tree with this property is called an ultrametric tree. 3.1 Algorithms for dating nodes The left-hand tree in Figure 3.1 can be estimated consistently as the sequence length tends to infinity (assuming the model is correct). The branch lengths, proportional to the expected number of substitutions per site, is an estimate of the product of rates and times. To estimate the 21

22 time tree to the right in Figure 3.1 consistently, some assumption on how the rates vary over the tree is needed [Bri05]. Different molecular clock assumptions exist from the global molecular clock with one substitution rate over the entire tree to different local clocks with different rates in different parts of the tree. The extreme is a single rate for every branch. With more and more data, and with a method that consistently estimates the left-hand tree in Figure 3.1, ˆτ should be closer and closer to the right-hand tree if the global clock is valid, that is if all lineages have evolved at equal rates. If the global substitution rate is known, or can be estimated from a calibration node, it is easy to estimate the divergence times consistently. The algorithm of Mean Path Length (MPL), first introduced by Bremer and Gustafsson [BG97] and further developed by Britton et al. [BOVB02], is a method implicitly using the global clock. It is assumed that the branch lengths are proportional to the number of substitutions between the nodes. The MPL of a node is the mean of the sum of the observations along paths from the node to descending leaves. The algorithm allows one calibration point. The global clock can also be assumed when phylogenetic inference is based on the ML-method, enforcing an ultrametric tree. This is implemented in softwares like PAUP* [Swo02], PHYLIP [Fel05] and BASEML [Yan97]. For real data, the molecular clock assumption does usually not hold [LF74]. If it is violated, the left-hand tree ˆτ could still be estimated consistently, but it will not tend to the right-hand tree when the sequence length n increases. An algorithm that is based on the MPL method but allows several calibration points, and thereby corrects for deviations from the molecular clock is PATHd8 [BAJ + 07]. The calibration points define segments in the tree. A local molecular clock model, where the same rate is assumed within the segment but may differ between segments, is implicitly assumed. The properties of the algorithm is investigated in Paper IV. The local molecular clock with rate constancy in parts of the tree is also implemented e.g. in BASEML and R8s [San03]. The definition of segments in the tree using the same local clock is a crucial step for the method. The local molecular clock described above is between two extremes the global clock assuming a constant rate over the entire tree and a model allowing independent rates for all branches. There are methods that do not require the segments of the tree to be explicitly pre-defined but smooth the rate change over the tree by penalizing rates that change too fast between neighbouring branches (the NPRS method implemented in R8s, [San03]). For further review of dating methods, see Rutschmann [Rut06]. 22

23 4. Summary of Papers Many interesting areas of molecular biology can be studied once genomes have been successfully sequenced. We are interested in the evolutionary relationships between organisms. Only from the introductory part of this thesis (chapters 1-3) issues like properties of the methods of reconstructing the phylogeny and estimating the divergence times are raised. Almost all the questions we have studied have been asked by biologists using phylogenetics in their every day profession, noticing qualities of the methods and wondering if there are theoretical explanations for what they are noticing. In this chapter the aim and the content of the papers will be summarized. My contribution to Paper I is the regression part. For the other papers I am the main author, responsible for calculations and explanations given therein. 4.1 Paper I Reliability of Bayesian Posterior Probabilities and Bootstrap Frequencies in Phylogenetics Bootstrap support values (see section 2.2.2) have been used in phylogenetics for different reconstruction methods ever since Felsenstein suggested it in 1985 [Fel85]. For methods like parsimony or Maximum Likelihood the bootstrap procedure is very time consuming (except when only few species are considered). The reason for this is that for every pseudosample all possible topologies should be studied, calculating the parsimony length or optimizing the likelihood value (see section 2.2.1). When the use of MCMC was introduced in phylogenetics (e.g. [YR97], [LS99]) making Bayesian inference possible, it was much faster and gave a support value in form of posterior probabilities (see section 2.3.1). Efron et al. [EHH96] stated that bootstrap support values could be interpreted as posterior probabilities using a non-informative prior. It seemed that it was possible to use the new faster method to achieve approximations of the measure of support that had been used for some time. However, several papers (e.g. [WZHH02], [KMCD01] [MEO + 01]) had noticed that Bayesian support values often were larger than the corresponding bootstrap values. This, together with unclear interpretation of a bootstrap value started the work with this paper. 23

24 Here we use simulated data from a fixed unrooted tree of 5 species {A, B, C, D, E}, shown in Figure 4.1, using the evolution model of Jukes- Cantor and GTR+Γ (see section 2.1). By using simulated data we actually know the true phylogeny. D A E B C Figure 4.1: The fixed unrooted tree from which data sets were simulated. The numbers along branches represent expected number of substitutions per site. The two short internal branches without numbers are both of length Support values for topologies with sequences from more than four species are often given for clades (groups of species) rather than for entire topologies. A support value for AB is the support for A and B to form a subtree. Contributions for this support value may also come from other topologies than the estimated one. E.g. by letting C and E change places in Figure 4.1 the topology is changed but A and B are still grouped as closely related. For a large number of simulated data sets the estimates of the topology were calculated using Maximum Likelihood and Bayesian inference with the same model of evolution as the data sets had been created from. For the Bayesian inference, uniform priors on topology and branch lengths (b (τ) U (0, 10)) were used, which were the default values of Mr- Bayes2.01 [HR01] used for the analysis. With ML bootstrap support values were also calculated where all parameters were reoptimized for each pseudo-sample. The different support values were paired. The Wilcoxon signed-rank test shows that Bayesian inference yields significantly higher support values than Maximum Likelihood bootstrap values for well supported clades. From the results a logistic regression was fitted expressing π, the probability that the true clade has been found as a function of the support value, x. Figure 4.2 shows the result where the logit model, π(x) x 1 x log 1 π(x) = α + β log has been used for the posterior probabilities (solid line) and the bootstrap values (dashed line). The data sets were also analyzed with the wrong model in order to investigate the support values with model misspecification. The analysis shows that the risk of making erroneous conclusions is higher with Bayesian inference than with Maximum Likelihood bootstrapping. 24

25 Logistic regression BAYES MLBOOT π(x) = x π(x) = x Support value Figure 4.2: Reproduced figure of the logistic regression for the probability π that the true clades are found as a function of the support values x i for Bayesian posterior probabilities (solid line) and Maximum Likelihood bootstrap support values (dashed line). The help line indicates support values that correspond to 95% of the clades being correctly estimated. 4.2 Paper II Fundamental Differences Between the Methods of Maximum Likelihood and Maximum Posterior Probabilities in Phylogenetics Paper I indicated that there is a difference between the support values of bootstrapping Maximum Likelihood and using Bayesian support in phylogenetics, despite the theoretical claim of their approximate equivalence [EHH96]. The aim of this paper is to investigate the conditions needed for the statement of Efron et al. to hold, trying to understand why the systematical difference of the two support measures appears. 25

26 Consider aligned DNA sequences of length n of k species with no gaps. Each column of the data matrix X is then one out of 4 k possible ones. Denote the possible columns by X 1,...,X 4 s,wherex 1 =(A, A,..., A), X 2 =(C,C...,C) etc. The proportions of X 1,...,X 4 k generated from the true phylogeny τ with branch lengths b (τ) are p =(p 1,...,p k ),where i p i =1.Each{τ,b (τ) } induces a different p-vector. The continuous p-space can, at least theoretically, be divided into regions representing different topologies τ i (see the left-hand side of Figure 4.3). p n {τ, b (τ) } τ 1 τ τ 3 2 τ 1 τ 3 τ 2 Figure 4.3: Each {τ,b (τ) } induces a different p-vector. The continuous p-space can be divided into regions representing different topologies τ i. The sample space n is a discrete space which can be divided into regions where the estimate ˆτ = τ i. The data matrix X can be represented as n =(n 1,...,n 4 k) where n i is the number of columns in X that equal X i. Using n it is possible to esitmate p (e.g. ˆp i = n i n ). We are however not primarily interested in p but in τ. With a consistent estimation method the regions of n giving ˆτ i = τ i should be close to the corresponding regions of the p-space (see Figure 4.3). Efron et al. stated in [EHH96] that "The bootstrap probability that ˆτ = ˆτ is almost the same as the aposteriori probability that τ = ˆτ starting from an uninformative prior on p", where ˆτ is the estimate of the topology for a bootstrap replicate. This statement is proven in this paper. For their statement to hold the prior for p should be used. In Bayesian phylogenetics this parameter is not considered, rather the topology, τ, corresponding branch lengths b (τ) and model parameters θ. Usingthe Jukes-Cantor model of evolution (which eliminates θ), k =4species, a discrete uniform prior on τ and either a uniform (U (0,M) for different values of M) or an exponential (Exp (λ) for different values of λ) prior for branch lengths b (τ), we studied the results of the likelihood based methods. Using the Jukes-Cantor model of evolution several possible data patterns contribute to the likelihood function in the same way. Hence p (and n) is, for k =4species, of dimension 15 instead of 4 4 = 256 (see Table 4.1). We show, analytically in this paper, that for Maximum Likelihood only 3 of the patterns are separately informative in the sense that a data 26

27 set consisting of only one of the patterns gives a unique estimate of the topology while 9 of the patterns in the Bayesian inference are separately informative. Since the two methods differ, the regions of the n-space where ˆτ = τ i do not always coincide. Hence the two measures of support can not be expected to be approximately equal. pattern no pattern description 1 XXY Y Groups of two nucleotides equal 2 XY XY within the group but not equal 3 XYYX between groups. 4 XXXY One nucleotide differs from the 5 XXY X rest. 6 XY XX 7 Y XXX 8 XXY Z One group with two equal 9 YZXX nucleotides, the other two differ 10 XY XZ from the group and from each 11 XY ZX other. 12 Y XXZ 13 YXZX 14 XY ZU All nucleotides different. 15 XXXX All nucleotides equal. Table 4.1: With the Jukes-Cantor model of evolution and with 4 taxa there are 15 different patterns contributing to the likelihood in different ways. estimate of topology method τ 1 τ 2 τ 3 ML MPP 1, 8, 9 2, 10, 13 3, 11, 12 Table 4.2: Separately informative patterns favouring topologies τ 1 = {(a, b), (c, d)}, τ 2 = {(a, c), (b, d)} and τ 3 = {(a, d), (b, c)} for the methods of Maximum Likelihood and Maximum Posterior Probability respectively. The enumeration of the patterns follows Table

28 4.3 Paper III Improving Divergence Times in Phylogenetics: More Taxa vs. Longer Sequences At a phylogenetic meeting organized by Nescent (National Evolutionary Synthesis Center) in September 2006 the question was posed whether more taxa (sequences from more species) usually imply that the Maximum Likelihood estimate of the age (the divergence time) of the root gets older. The aim of this paper is to investigate the properties of ML as a method of estimating divergence times of a given rooted tree. A difference between this paper and the two previous ones is that the rooted topology has already been estimated and is assumed known, with branch lengths proportional to the number of substitutions. Hence we have a rooted binary tree with divergence times, t i, of the internal nodes and observed number of substitutions along branches, y i (see Figure 4.4). We restrict the study to symmetric trees in the sense that the twosubtreesofanodehaveequallymanytaxa. t_1 1 time y_1 y_2 t_2=t_3 2 3 t_4 t_7 t_6 t_5 y_7 y_3 4 y_8 y_4 y_5 y_ y_11 y_12 y_9 y_13 y_10 y_14 Figure 4.4: A rooted symmetric binary tree with k =8taxa and l =log 2 k =3 levels. A node is at level j if it has j 1 nodes on the path from the root to the node. The divergence time of node i is denoted t i and the observations along the branches are denoted y i. The evolution model used in this study is the Jukes-Cantor model, [JC69], for which the number of substitutions along a branch, Y i Po (n rt), where n is the sequence length, r the mean substitution rate and t the elapsed time between the nodes connected by the branch. The Y i :s are independent and the model is in the exponential family. (For more on the exponential family, see e.g. [Lin01]). For a symmetric tree of k taxa, there are log 2 k levels of internal nodes, where the root is at level 1 and a node is at level j if it has j 1 nodes on the path from the root to the node. Hence, at level j the nodes with 28

29 divergence times t 2 j 1,...,t 2 j 1 are placed. With this notation the score function, U i (t) := l(t) t i equals U i (t) = η(t) T (y) a(t) t i t i 2n r + y 1 = n r y i 1 n r t 1 t 2 + y 2 t [i/2] t i + y 2i 1 t i t 2i + y i 1 t [i/2] t i + y 2i 1+y 2i t 1 t 3 i =1 y 2i t i t 2i+1 i =2,..., k 2 1 (4.1) t i i = k 2,...,k 1. To obtain the ML estimate of t, the score function is set to 0 and the equations in (4.1) solved numerically. For a fixed number of taxa, k, the model being in the exponential family, it follows that the estimate ˆt (ML) is consistent, ˆt (ML) t when n. The asymptotic variance of ˆt (ML) equals the inverse of the expected information matrix, I ij (t) = ( E U ) i(t), t j whichcanbeshowntoequal I(t) = n r t 1 t 2 + n r n r t [i/2] t i + t 1 t 3, i = j =1, n r t i t 2i + n r t i t 2i+1, i = j, i =2,..., k 2 1, n r t [i/2] t i + 2n r t i, i = j, i = k 2,...,k 1, n r t j t i, j =[i/2], n r t j t i, j =(2i, 2i +1), (4.2) where i =2,...,k 1 in the last row. Each entry of the inverse of I(t) will have a factor 1 n and hence the variance of the ML estimate will be reduced by a factor 2 if the sequence length is doubled. The Maximum Likelihood method is time consuming. Solving the score functions numerically is a non-trivial task and so is finding the inverse of I(t) numerically. We therefore compare the method of Maximum Likelihood with the much simpler and faster method of Mean Path Length (see section 3.1), which works with the model of Jukes-Cantor. The variance oftheestimateofthedivergencetimeoftheroot,t 1, can be expressed as V (ˆt (MPL) 1 )= 1 ( k 1 n r j=1 2 2j 2j+1 2 i=2 j 1 (t [ i+1 2 ] t i+1)+ 1 k 2 2k 2 i=k 1 t [ i+1 2 ] ), (4.3) 29

30 where j denotes the level of the node, k the number of taxa and [ ] the integer part. To compare (4.3) with the variance of the ML estimate of the age of the root, (4.3) should be compared to the top left element of the inverse of the expected information (4.2). From (4.3) we see that also here the variance is reduced by a factor 2 if the seqence length is doubled. In this study we have considered three kinds of trees: (1) equidistant complete symmetric, (2) complete symmetric and (3) symmetric trees (see Figure 4.5). In the complete symmetric trees all nodes on the same level diverged at the same time. In the equidistant case we further require the times between speciation to be equal, that is all branches are of the same length. In the symmetric trees the two subtrees of a node have equally many nodes, but the divergence times may be different for all nodes. When doing the inference, the type of tree is of course assumed to be unknown. time 1 time 1 time equidistant complete symmetric complete symmetric symmetric Figure 4.5: The three different types of trees considered in the simulations. To the left is the equidistant complete symmetric case where all nodes at the same level diverged at the same time and the times between the levels are equal. In the middle is the complete symmetric case where the times between the levels do not need to be equal. At the right is the third type which is symmetric in the sense that the two subtrees of a node have equally many nodes, but the times of divergence may differ between nodes. For the equidistant case we have shown the following theorem Theorem If the number of taxa in a rooted phylogenetic tree is k =2 l and all branches are of the same length (t 1 / log 2 k), where t 1 is the age of the root, then V (ˆt (MPL) 1 )=V(ˆt (ML) 1 )=I11 1 = t 1 k 1 n r log 2 k k, (4.4) where I is the information matrix. Hence V (ˆt (ML) 1 )=V(ˆt (MPL) 1 ) in the equidistant complete symmetric case and both estimators are efficient. 30

31 For the complete symmetric case we have, in addition, verified that V (ˆt (ML) 1 )=V(ˆt (MPL) 1 ) for the number of taxa k {4, 8, 16} and simulations indicate that this holds in general also for larger k. For a symmetric tree, where all nodes have individual divergence times, our simulation results indicate that ML estimates the divergence time of the root with slightly higher precision than the method of MPL. The differences between the estimates of the two methods and the corresponding variables are small though. When estimating the divergence times for internal nodes, ML uses all observations of the entire tree. MPL only takes the observations along the paths from the node to the descending taxa, so MPL uses less information than ML. For a node located close to the root the two methods use almost the same amount of information, for the root exactly the same. The estimates, as well as the precisions thereof, should therefore be close. For a node closer to taxa, MPL only uses part of the information and the precision is then lower than for ML. The question posed was whether more taxa affect the ML to overestimate the divergence time of the root or not. The answer is no if k is increased in a nice way. From (4.4) the variance could be reduced by a factor 2 either by doubling the sequence length n or by squaring the number of taxa k. For fixed k, ML is consistent and therefore does not overestimate the age of the root systematically. However, since we only have finite sequences, we cannot let k.ifk>nthere is not enough data to estimate the number of substitutions along branches. We therefore consider only situations where k is increased in a nice way and not too fast so that the sequence length always is much larger than the number of taxa. 4.4 Paper IV Consistent estimation of divergence times in phylogenetic trees with local molecular clocks The Mean Path Length (see section 3.1 and 4.3) is the base of the algorithm of PATHd8 [BAJ + 07] for estimating divergence times of a given tree. The main difference between MPL and PATHd8 is that the latter allows several calibration nodes and thereby corrects for deviations from the molecular clock assumption (see section 3.1). The purpose of this paper is to investigate for what families of models the algorithm of PATHd8 estimates the divergence times consistently. PATHd8 estimates divergence times of a given rooted, not necessarily binary, tree with branch lengths proportional to the number of substitutions for DNA sequences of length n by first calculating the MPLs of all nodes. A node can be defined either as a fixed age node with a calibration time known e.g. from fossils, or as a reference node for which minimum 31

32 age or maximum age or both are given, or as a usual node for which no age constraint is set. The age of a non fixed age node is estimated by weighted relative MPL of the node of interest, x, and the MPLs and ages of the closest fixed age node located closer to the root and of adjacent fixed age nodes. The weights are defined by the size of the subtree that is defined by having its root in the fixed age node. The algorithm checks the estimates of the constrained nodes and adjust the estimates to given minimum age or maximum age if the original estimate is too small or too large respectively. The corrected nodes are then considered as fixed age nodes and the estimation of the age of the non fixed age nodes is redone. In this way the algorithm corrects for deviations from the molecular clock and smooths the substitution rates for sister groups. One of its advantages is that it is very fast, even for very large trees. The method of MPL implicitly assumes a global molecular clock. The algorithm of PATHd8 divides the tree into segments, defined by the fixed age nodes. The implicit assumption of PATHd8 is a local molecular clock model where a fixed substitution rate for a segment is assumed but the rates can differ between segments. Figure 4.6 describes the evolutionary history of 14 present species. The ages {a 0,a 1,a 4 = a 3,a 8 = a 5 } of the root α 0 and the nodes α 1, α 4 and α 8 are known, all other divergence times a i remain to be estimated. time 20 a0 15 a 1 a a 5 α 0 t1 t2 α 1 α 2 t3 t4 t5 t6 α3 α 4 α 5 t7 t8 t9 t10 t11 t12 α6 α7 α8 α9 α10 α11 α12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t 24 t25 t26 r0 =0.03 r1 =0.02 r2 =0.01 r3 =0.04 fixed age node node of interest Figure 4.6: The true time tree, where the ages of the nodes α 0, α 1, α 4 and α 8 are assumed to be known. The aim is to estimate the divergence times of the other nodes, indicated here by node α 5. In PATHd8 the fixed age nodes {a 0,a 1,a 4,a 8 } are the roots of the segments S 0 = {α 0,α 2,α 5,α 11,α 12,t 1,t 2,t 4,t 5,t 11,t 12,t 23...,t 26 }, S 1 = {α 1,α 3,α 6,α 7,t 3,t 6,t 7,t 8,t 13,...,t 16 }, S 2 = {α 8,t 17,t 18 } and segment S 3 = {α 4,α 9,α 10,t 9,t 10,t 19,...,t 22 }. The local clock assumption assigns rates {r 0,r 1,r 2,r 3 } to be fixed, but unknown, within the segments. Let y i 32

33 denote the observed number of substitutions along a branch with elapsed time t i.thempls,p i,oftherootα 0 and the node α 5 will be p 0 n = ( 1 6y 1 +8y 2 +4(y 3 + y 4 + y 5 )+2(y y 12 )+ 14n ) +y y 26 p 5 n n 1 14 (r 0(4(2a 0 a 3 )+6(a 0 a 1 )) + r 1 (2(3a 1 a 5 )) + +2r 2 a 5 +4r 3 a 3 ), 1 = 4n (2(y 11 + y 12 + y y 26 ) n r0 a 3 If a global molecular clock is valid, r 0 = r 1 = r 2 = r 3,thenp 0 r 0 14 (14a 0 4a 3 6a 1 +6a 1 2a 5 +2a 5 +4a 3 )=r 0 a 0. The estimate of the divergence time of α 5 is â 5 =(p 5 /p 0 )a 0 and with a global clock p 5 n r 0 a 3 a 0 a 0 = a 3, p 0 r 0 a 0 that is, â 5 is a consistent estimate. Now, if the global clock assumption is violated but the local clock is valid, where at least one r i differs from the others, the MPL of the root, p 0, will not be consistent. The MPL of the root depends on all observations, the ones in the same segment as the root as well as observations in other segments. The size and direction of the bias in p 0 depend on the size and directions of the differences in rates. Since all observations contributing to the MPL p 5 of α 5 are within thesamesegment,p 5 will still be consistent. The estimate of the age of the node, â 5,willhencenot be unbiased. To avoid this inconsistency we suggest a change of the algorithm, to move the weighting averaging from the age estimating part to the calculations of the MPLs. We suggest that adjusted MPL (ampl) is used instead, thereby only considering the observations in the same segment as the node of interest belongs to. When calculating the adjusted MPL of anodeα i the observations along paths within the segment are weighted according to the size of the subtrees descending from the fixed age nodes that are located further down (closer to the terminal taxa). A path from α i that ends in another fixed age node is, with this weighting, blown up to about what it would have been if the path had ended in a leaf, that is, if the local rate would have been a global one. The ampl is consistent and unbiased but the uncertainty is larger than for MPL since fewer observations are used. Hence the age estimate, using ampl instead of 33

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Phylogenetic trees 07/10/13

Phylogenetic trees 07/10/13 Phylogenetic trees 07/10/13 A tree is the only figure to occur in On the Origin of Species by Charles Darwin. It is a graphical representation of the evolutionary relationships among entities that share

More information

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument

Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument Bayesian support is larger than bootstrap support in phylogenetic inference: a mathematical argument Tom Britton Bodil Svennblad Per Erixon Bengt Oxelman June 20, 2007 Abstract In phylogenetic inference

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Mutation models I: basic nucleotide sequence mutation models

Mutation models I: basic nucleotide sequence mutation models Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction 1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Fundamental Differences Between the Methods of Maximum Likelihood and Maximum Posterior Probability in Phylogenetics

Fundamental Differences Between the Methods of Maximum Likelihood and Maximum Posterior Probability in Phylogenetics Syst. Biol. 55(1):116 121, 2006 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150500481648 Fundamental Differences Between the Methods of Maximum

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Nicolas Salamin Department of Ecology and Evolution University of Lausanne

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Speciation Times under an Episodic Molecular Clock Syst. Biol. 56(3):453 466, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701420643 Inferring Speciation Times under an Episodic Molecular

More information

Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution, course # Final Exam, May 3, 2006 Molecular Evolution, course #27615 Final Exam, May 3, 2006 This exam includes a total of 12 problems on 7 pages (including this cover page). The maximum number of points obtainable is 150, and at least

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

Chapter 7: Models of discrete character evolution

Chapter 7: Models of discrete character evolution Chapter 7: Models of discrete character evolution pdf version R markdown to recreate analyses Biological motivation: Limblessness as a discrete trait Squamates, the clade that includes all living species

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny Syst. Biol. 54(3):455 470, 2005 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150590945313 Branch-Length Prior Influences Bayesian Posterior Probability

More information

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies Week 8: Testing trees, ootstraps, jackknifes, gene frequencies Genome 570 ebruary, 2016 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.1/69 density e log (density) Normal distribution:

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Dating r8s, multidistribute

Dating r8s, multidistribute Phylomethods Fall 2006 Dating r8s, multidistribute Jun G. Inoue Software of Dating Molecular Clock Relaxed Molecular Clock r8s multidistribute r8s Michael J. Sanderson UC Davis Estimation of rates and

More information

Phylogeny: traditional and Bayesian approaches

Phylogeny: traditional and Bayesian approaches Phylogeny: traditional and Bayesian approaches 5-Feb-2014 DEKM book Notes from Dr. B. John Holder and Lewis, Nature Reviews Genetics 4, 275-284, 2003 1 Phylogeny A graph depicting the ancestor-descendent

More information

Intraspecific gene genealogies: trees grafting into networks

Intraspecific gene genealogies: trees grafting into networks Intraspecific gene genealogies: trees grafting into networks by David Posada & Keith A. Crandall Kessy Abarenkov Tartu, 2004 Article describes: Population genetics principles Intraspecific genetic variation

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe? How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

More information

Lab 9: Maximum Likelihood and Modeltest

Lab 9: Maximum Likelihood and Modeltest Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

More information

Systematics - Bio 615

Systematics - Bio 615 Bayesian Phylogenetic Inference 1. Introduction, history 2. Advantages over ML 3. Bayes Rule 4. The Priors 5. Marginal vs Joint estimation 6. MCMC Derek S. Sikes University of Alaska 7. Posteriors vs Bootstrap

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC

More information

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline Page Evolutionary Trees Russ. ltman MI S 7 Outline. Why build evolutionary trees?. istance-based vs. character-based methods. istance-based: Ultrametric Trees dditive Trees. haracter-based: Perfect phylogeny

More information

Consistency Index (CI)

Consistency Index (CI) Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

More information

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Thanks to Paul Lewis and Joe Felsenstein for the use of slides Thanks to Paul Lewis and Joe Felsenstein for the use of slides Review Hennigian logic reconstructs the tree if we know polarity of characters and there is no homoplasy UPGMA infers a tree from a distance

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki Phylogene)cs IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, 2016 Joyce Nzioki Phylogenetics The study of evolutionary relatedness of organisms. Derived from two Greek words:» Phle/Phylon: Tribe/Race» Genetikos:

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics Bayesian phylogenetics the one true tree? the methods we ve learned so far try to get a single tree that best describes the data however, they admit that they don t search everywhere, and that it is difficult

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

arxiv: v1 [q-bio.pe] 4 Sep 2013

arxiv: v1 [q-bio.pe] 4 Sep 2013 Version dated: September 5, 2013 Predicting ancestral states in a tree arxiv:1309.0926v1 [q-bio.pe] 4 Sep 2013 Predicting the ancestral character changes in a tree is typically easier than predicting the

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Phylogenetic Inference using RevBayes

Phylogenetic Inference using RevBayes Phylogenetic Inference using RevBayes Model section using Bayes factors Sebastian Höhna 1 Overview This tutorial demonstrates some general principles of Bayesian model comparison, which is based on estimating

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 1 Learning Objectives

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Phylogeny Tree Algorithms

Phylogeny Tree Algorithms Phylogeny Tree lgorithms Jianlin heng, PhD School of Electrical Engineering and omputer Science University of entral Florida 2006 Free for academic use. opyright @ Jianlin heng & original sources for some

More information

The course Phylogenetic data analysis (period IV, 2010) is an in-depth course on this topic

The course Phylogenetic data analysis (period IV, 2010) is an in-depth course on this topic Phylogeny inference Studying evolutionary relatedness among various groups of organisms (species, populations), through molecular sequence data (and also through morphological data). The course Phylogenetic

More information

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution? Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution? 30 July 2011. Joe Felsenstein Workshop on Molecular Evolution, MBL, Woods Hole Statistical nonmolecular

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi DNA Phylogeny Signals and Systems in Biology Kushal Shah @ EE, IIT Delhi Phylogenetics Grouping and Division of organisms Keeps changing with time Splitting, hybridization and termination Cladistics :

More information

Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

Week 7: Bayesian inference, Testing trees, Bootstraps

Week 7: Bayesian inference, Testing trees, Bootstraps Week 7: ayesian inference, Testing trees, ootstraps Genome 570 May, 2008 Week 7: ayesian inference, Testing trees, ootstraps p.1/54 ayes Theorem onditional probability of hypothesis given data is: Prob

More information