Evolutionary Genomics: Phylogenetics

Size: px
Start display at page:

Download "Evolutionary Genomics: Phylogenetics"

Transcription

1 Evolutionary Genomics: Phylogenetics Nicolas Salamin Department of Computational Biology, University of Lausanne Swiss Institute of Bioinformatics March 16th, 2017

2 Tree of Life and Evolution Phylogeny is the evolutionary history of living and extinct organisms. Darwin, 1857 Some key dates: 1964 Cavalli-Sforza and Edwards introduced parsimony and likelihood 1966 Hennig and the theory of cladistics 1977 Fitch used parsimony on DNA sequences 1978 Felsenstein worked out maximum likelihood method 1996 Rannala and Yang proposed Bayesian inference since 2000 Computational phylogenetics Haekel, 1866

3 New uses for phylogenetics Beside their use in systematics, trees are tools to study macroevolutionary processes detect evolution of genes and their function assess conservation issues drug design epidemiology evolution of languages

4 Macroevolution Serrano-Serrano et al. (2017) Proc. R. Soc. B

5 Genes evolution Christin et al. (2007) Cur. Biol.

6 Conservation issues Extinction risks between plant lineages Davies et al. (2011) PLoS Biol A=Cypereae, B=Disa, C=Indiogofera, D=Lachnaea, E=Muraltia, F=Podalyrieae, H=Restionaceae, I=Zygophyllum, J=Protea, K=Moraea

7 Drug design Cardoso et al. (2009) J. Clin. Virol.

8 Epidemiology H5N1 influenza epidemics Lemey et al. (2009) PLoS Comput. Biol.

9 Phylogenetic trees: definitions.

10 Phylogeny and phylogenetic tree The true and unobservable evolutionary history of a group of organisms is called a phylogeny. A phylogenetic tree is an estimate of the true unobservable evolutionary history derived from morphology, molecular data, etc. A cladogram or topology is the hierarchical structure of a phylogenetic tree, while a phylogram is a cladogram with explicit branch lengths and a chronogram has branch length as unit of time.

11 Hard and soft polytomies A clade on a phylogenetic tree is unresolved if there is a node with three or more direct descendants. from biology.fullerton.edu This creates polytomies, which are called soft if it s due to lack of data or a conflict between two plausible trees hard if they represent real two or more consecutive events of speciation

12 Orthologs, paralogs, xenologs Two homologous genes are ortholog if derived from a gene present in their common ancestor paralog if resulted from the duplication of the same ancestral gene xenolog if arose by lateral gene transfer from When inferring species phylogenetic trees, it is essential to have orthologous DNA sequences!

13 Branches of a tree A branch can have a length that represents the evolutionary time elapsed between its endpoints. If the edge set of a tree have no lengths assigned, the shape of the tree is called topology. Suppose that d xy is the length of the path separating x and y. Suppose that all leaves x i of a tree T are equally distant from the root r, that is, there exist a constant c such that d xi r = c. Then T is called ultrametric

14 The root of a tree In evolutionary studies, phylogenetic trees are drawn as branching trees deriving from a single ancestral species. This species is know as the root of the tree. A rooted tree is a tree to which a special internal node r is added with degrees = 2. Labeled histories: rooted trees with interior nodes ordered according to their age.

15 Consensus tree A tree that summarized the common features of a set of trees is called a consensus tree. strict consensus: shows only the groups that are shared among all trees. semi-strict consensus: shows only the resolved groups that are shared among all trees. n% majority-rule consensus: shows the groups that are shared by n% of the trees.

16 How do we measure evolution?

17 Type of models Depending on the question, several type of models of molecular evolution: 1. infinite allele model each mutation creates a new allele and allele types are all equally different from each other 2. infinite site model each mutation happens at a site that has not previously experienced mutation since common ancestor of sample 3. finite site model allow the possibility that each sequence site has experienced multiple mutations

18 Infinite allele model A B C D E Ancestral allele of type 0 A has the ancestral type 0 (C,D,E) have allele type 1 B has allele type 3 allele type 2 not observed Diversity typically measured as θ = 4Nµ where N = population size (diploid) µ = rate of mutation of new alleles Typically used with microsats, SSRs, AFLP.

19 Infinite site model A B C D E Assuming an ancestral state at the root A has the same state as the root (C,D,E) are identical and differ from A and Root by 1 site B differs from A and Root by two sites B differs from C, D and E by 3 sites Expected number of sites between 2 random sequences is θ = 4Nµ Root Average number of sites at which a sequence pair differs denoted as ˆπ. Tajima s D is known as an estimate of θ. Typically used with SNPs and sequence data representing haplotypes.

20 Finite site model AGTA AGTA AGTC AGTC AGTC 2: A to G 2: G to A Given an ancestral sequence at the root A has the same sequence as the root (C,D,E) are identical and differ from A and Root by 1 mutation B is identical from A and Root despite two mutations Each sequence site evolves on the same gene tree as each other but mutations are assumed to occur independently. AGTA 4: A to C Most often used for nucleotide substitution process (mutation plus fixation) rather than mutation process alone. Basic model to analyse sequence data.

21 Finite site model and evolutionary trees.

22 Let s assume that you have 5 sequences: Sequence data: a small example one two three four five ACC ACG TCT TCA CTC How would you classify these sequences based on share ancestry??????

23 Let s assume that you have 5 sequences: Sequence data: a small example one two three four five ACC ACG TCT TCA CTC How would you classify these sequences based on share ancestry? one two four three five

24 Let s assume that you have 5 sequences: A small example, extended one two three four five ACCACCATGGTACCAGTGCG ACGAAATACGTACCAACGAG TCTAACTACGGTACTACGCA TCAACCATCGGATACACGAA CTCACTACCTTACTTACGCT How would you classify these sequences based on share ancestry??????

25 Evolutionary trees Finding the evolutionary trees is essential to answer most of the biological questions of interest. There are four major methods for this. Distance Methods Evolutionary distances are computed for all OTUs and these are used to construct trees. Maximum Parsimony Trees are chosen to minimize the number of changes required to explain the data. Maximum Likelihood Tree which gives the highest likelihood under the given model of sequence evolution. Bayesian Methods Tree with the highest probability to give rise to the sequence data observed. List of most tree reconstruction packages available at

26 Differences between methods The end result of the four methods is an evolutionary tree, but they differ in the steps to reach it. Distance methods are clustering methods and find quickly a tree. You use bootstrapping to assess the reliability of the tree found in a second set of analyses Maximum parsimony and likelihood are optimality criteria. You need algorithm to sample the trees and select the best one based on these criteria. You use bootstrapping to assess the reliability of the tree found in a second set of analyses. Bayesian methods use likelihood criteria to sample trees and assess their reliability all in once.

27 Modeling evolution through time and lineages?

28 Likelihood and models Maximum likelihood relies on explicit probabilistic models of evolution. But, the process of evolution is so complex and multifaceted that basic models involve assumption built upon assumption. This reliance is often seen as a weakness of the likelihood framework, but the need to make explicit assumptions is a strength enables both inferences about evolutionary history and assessments of the accuracy of the assumptions made this leads to a better understanding of evolution The purpose of models is not to fit the data, but to sharpen the questions (S. Karlin)

29 Description of Maximum Likelihood Given an hypothesis H and some data D, the likelihood of H is L(H) = Prob(D H) = Prob(D 1 H)Prob(D 2 H) Prob(D n H) if the D can be split in n independent parts. Note that L(H) is not the probability of the hypothesis, but the probability of the data, given the hypothesis. Maximum likelihood properties (Fisher, 1922) consistency converge to correct value of the parameter efficiency has the smallest possible variance around true parameter value

30 Measuring evolution The problem that we are dealing with here is how to calculate the probability of evolution in sequences (whether DNA, AA at the inter- or intra-specific level).

31 What do we try to model From the alignment, it is easy to come up with a distance between species ˆp = n d /n, which is the proportion of different nucleotides. However, if p is small, ˆp is a good approximation of the number of substitutions per site. if p is large, ˆp will underestimate p because of multiple substitutions. distance methods could be used to do that, but they are approximations. We thus need a better way to model how substitutions/mutations are appearing during evolution. Markov processes can help us a lot in this.

32 Derivation of the process The simplest model assumes the following: 1 3α 1 3α α A C α α α G T α 1 3α 1 3α Each nucleotide has an equal chance of changing and when it changes, it changes to one of the three other nucleotide with equal probability (α). The parameter α is the probability of a substitution (i.e. mutation + fixation). This will create a Markov chain explaining the evolution of the sequence.

33 Probabilities of the Markov chain Imagine that ϕ is the stationary distribution of the Markov chain. The probability of observing A at site i after one time step is ϕ t+1 A = (1 3α)ϕ t A + αϕ t C + αϕ t G + αϕ t T = (1 3α)ϕ t A + α [ 1 ϕ t ] A The model is symmetric and it applies equally well to C, G or T, thus ϕ t+1 y = (1 4α)ϕ t y + α

34 Differential equation We can substract ϕ t y from both sides ϕ t+1 y ϕ t y = α ( 1 4ϕ t ) y and apply a continuous time approximation to get a differential equation dϕ t y dt = α ( 1 4ϕ t ) y with solution ϕ t y = 1 ( 4 + ϕ 0 y 1 ) e 4αt 4

35 Probabilities of [no] change We can estimate the probability of observing a nucleotide A at site i after an arbitrarily long time t. We have however two cases 1 the ancestral nucleotide was also A (ϕ 0 x=y = 1): p A A (t) = e 4αt 2 the ancestral nucleotide was, say, a C (ϕ 0 x y = 0): p C A (t) = e 4αt but 3 ways descendant with A could differ from the ancestor p [CGT ] A (t) = e 4αt

36 The process in R pxx <- function (r=1/3,t=0.1) { return ( * exp ( -4*r*t )); } pyx <- function (r=1/3,t=0.1) { return ( * exp ( -4*r*t )); } time <- seq (0,1.0,0.01) plot ( time, pyx (r=1/3,t=time ), type="l", xlab=" Time ", ylab=" Differences per site ", ylim=c (0,0.75)) abline (a=0,b=1, lty =2) lines ( time, pyx (r=1/2,t=time ), type="l", col=" red ") lines ( time, pyx (r=1/5,t=time ), type="l", col=" blue ") How do you explain the plot? What is the effect of r and t?

37 Generalizing the model of evolution The initial condition P(0) of the probability matrix P gives the probability of change for a time of 0. It is obviously lim P(t) = I t 0 For continuous Markov chains, it exists an infinitesimal or instantaneous rate matrix that fully describes the process of the chain P(t) P(0) lim = Q t 0 t For a tiny time h, the transition probabilities are approximated by (as shown before) P(h) I + Qh

38 Instantaneous matrix For DNA, Q is with r xy = µ x y π y and r AC r AG r AT Q = r CA r CG r CT r GA r GC r GT r TA r TC r TG from theory of finite Markov process, q ii = 4 j=0,j i q ij, thus 4 q ij = 0. j=0

39 Differential equation We want the probability P(t) of substitutions over a time t. Let s estimate first P(t + h), with h being an tiny time interval. The definition of Markov chains tells us that P(t + h) = P(t)P(h). Replacing P(h) by its approximation gives P(t + h) = P(t)(I + Qh) = P(t) + P(t)Qh Rearranging, we get P(t + h) P(t) h = P(t)Q Taking the limit as h 0 gives P = PQ, with intial condition P(0) = I, which leads to P(t) = e Qt

40 Rate of substitutions As Q and t occur only in the form of a product in P(t) = e Qt, it is conventional to scale Q so that the average rate is 1. We do this by normalizing Q by the total amount of change involved in the Markov chain: 4 4 π i q ij i=0 j=0,j i The branch length is therefore measured in expected substitutions per site. A long branch can therefore either by due to long evolutionary time a rapid rate of substitution a combination of both

41 Matrix representation in R JC <- function () { freq <- matrix (0, nrow =4, ncol =4); diag ( freq ) <- c (.25,.25,.25,.25); R <- matrix (c(0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0), nrow =4); Q <- (R %*% freq ) - diag ( apply (R %*% freq, 1, sum )) scaleq <- sum ( apply ( freq %*% R %*% freq, 1, sum )) return (Q * scaleq ); } library ( Matrix ) calcp <- function (R, t=0.1) { return ( sum ( expm (R * t)[1, -1]));} time <- seq (0, 1, 0.01) plot ( time, pyx (a=1, t=time ), type="l") lines ( time, sapply ( time / , calcp, JC ()), col=" red ", lty =2) Is it the same as before? Where is the coming from? Try with more complex models (see below).

42 Jukes-Cantor, 1969 with π A = π C = π G = π T = π A Q = π C π G π T Transition probabilities are P(t) = e Qt p ij (t) = { 1/4 + 3 exp( 4µt)/4, if i = j, 3/4 3 exp( 4µt)/4, if i j,

43 Kimura 2-parameters, κ 1 π A Q = 1 1 κ κ 1 1 π C π G 1 κ 1 π T Here, κ = α/β, where α is the transition rate β is the transversion rate. and π A = π C = π G = π T = It simplifies to Jukes-Cantor, 1969 if α = β.

44 Matrix representation in R K2P <- function (k) { freq <- matrix (0, nrow =4, ncol =4); diag ( freq ) <- c (.25,.25,.25,.25); R <- matrix (c (0,1,k,1,1,0,1,k,k,1,0,1,1,k,1,0), nrow =4); Q <- (R %*% freq ) - diag ( apply (R %*% freq, 1, sum )) scaleq <- sum ( apply ( freq %*% R %*% freq, 1, sum )) return (Q * scaleq ); } library ( Matrix ) calcp <- function (R, t=0.1) { return ( sum ( expm (R * t)[1, -1]));} time <- seq (0, 1, 0.01) plot ( time, sapply ( time, calcp, JC ()), type="l") lines (time, sapply (time, calcp, K2P (2), col =" red ", lty =2) What is the difference between JC69 and K2P?

45 Felsenstein, 1981 π C π G π T Q = π A π G π T π A π C π T π A π C π G where π i is the equilibrium frequency of nucleotide i. It simplifies to Jukes-Cantor, 1969 if π A = π C = π G = π T.

46 Matrix representation in R F81 <- function (f) { freq <- matrix (0, nrow =4, ncol =4); diag ( freq ) <- f; R <- matrix (c(0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,0), nrow =4); Q <- (R %*% freq ) - diag ( apply (R %*% freq, 1, sum )) scaleq <- sum ( apply ( freq %*% R %*% freq, 1, sum )) return (Q * scaleq ); } library ( Matrix ) calcp <- function (R, t=0.1) { return ( sum ( expm (R * t)[1, -1]));} time <- seq (0, 1, 0.01) plot ( time, sapply ( time, calcp, JC ()), type="l") lines (time, sapply (time, calcp, K2P (2), col =" blue ", lty =2) lines ( time, sapply ( time, calcp, F81 (c (0.23,0.27,0.28,0.22)), col=" red ", lty =2) What is the difference between JC69 and F81?

47 Hasegawa-Kishino-Yano, 1985 π C κπ G π T Q = π A π G κπ T κπ A π C π T π A κπ C π G Here, π i is the equilibrium frequency of nucleotide i and κ = α/β, where α is the transition rate β is the transversion rate. It simplifies to Kimura, 1980 if π A = π C = π G = π T. Felsenstein, 1981 if κ = 1.

48 Tamura-Nei, 1993 π C κ 2 π G π T Q = π A π G κ 1 π T κ 2 π A π C π T π A κ 1 π C π G Here, π i is the equilibrium frequency of nucleotide i, κ 1 = α Y /β and κ 2 = α R /β where α R is purine transition rate (A+G). α Y is pyrimidine transition rate (C+T). β is the transversion rate. It simplifies to Hasegawa, Kishino, Yano, 1985 if α R /α Y = π R /π Y. Felsenstein, 1984 if α R = α Y.

49 General-time reversible απ C βπ G γπ T Q = απ A δπ G ɛπ T βπ A δπ C ηπ T γπ A ɛπ C ηπ G where α η are rates of changes from one nucleotide to another π i are frequencies of nucleotides It simplifies to all other models if parameters are correctly constrained

50 How to model rate variation among sites? Variation of substitution rates The idea is to use a probability distribution to model changes in rates of substitution among sites, e.g. Gamma distribution mean of distribution is αβ, and variance αβ 2 set the mean rate of substitution to 1, so assume β = 1/α α parameter allows to change characteristics of distribution Felsenstein, 2004

51 Number of site classes Yang, 1994

52 How to calculate the likelihood of a tree?

53 Prerequisite Suppose we have a data set D of DNA sequences with m sites. We are given a topology T with branch lengths and a model of evolution, Q that allow us to compute P ij (t). Assumptions made to compute likelihood L(T, Q) = Prob(D T, Q) evolution in different sites is independent evolution in different lineages is independent the rate of evolution is the same between site and along lineages

54 Independence of sites Felsenstein, 2004 The assumption of independence of sites along a sequence allow us to decompose the likelihood in a product of likelihood for each site L(T, Q) = Prob(D T, Q) = The likelihood for this tree for site i is Prob(D i T, Q) = ACGT x ACGT y m Prob(D i T, Q) i=1 ACGT z ACGT Prob(A, C, C, C, G, x, y, z, w T, Q) w

55 Independence of lineages Felsenstein, 2004 With the independence of lineages assumptions, we can decompose the right hand side of the equation a bit further. Prob(A, C, C, C, G, x, y, z, w T, Q) = Prob(x)P xy (t 6 )P ya (t 1 )P yc (t 2 ) P xz (t 8 )P zc (t 3 ) P zw (t 7 )P wc (t 4 )P wg (t 3 ) ACGT x ACGT y ACGT z ACGT w

56 Pruning algorithm Goal: render the likelihood computation practicable using dynamic programming Idea: move summation signs as far right as possible and enclose them in parentheses where possible Prob(A, C, C, C, G, x, y, z, w T ) = ACGT x Prob(x) (ACGT (ACGT z (ACGT w y P xz (t 8 )P zc (t 3 ) P xy (t 6 )P ya (t 1 )P yc (t 2 ) ) P zw (t 7 )P wc (t 4 )P wg (t 5 ) )) The parentheses pattern and terms for tips has an exact correspondence to the structure of the tree.

57 Numerical integration Up to now we assumed that both branch lengths and model parameters were fixed at a given value. Of course, we don t know these values so to get these values, start from a random guess make small changes and compare the improvement in likelihood do so until the likelihood do not change any more every time the topology is changed, reestimate these parameters Drawback: very very very computationally intensive, but there are solutions to that.

58 Consistency and overparameterisation Maximum likelihood can be proved to be consistent (see Felsenstein, 2004 pp. 271) true if we use the correct model of evolution but what happen if model in not correct? No guarantee can be given less problematic if what we want to infer is just the topology Beware of overparameterisation problems adding more and more parameters to the model will result in a better fit to the data but this will lead to inconsistency a few parameters are more important than others, in particular the gamma distribution

59 How to choose the best tree?

60 Number of rooted trees?

61 More trees than electrons in Universe Branching process n-th taxa can be added to 2n 3 places (2n 3) (2n 3)! 2 n 1 (n 1)! with 50 taxa, approaching Eddington s number of electons in the visible universe

62 Exhaustive search To find the best tree, enumerate all the possible trees from the tree space: But you can quickly see that it is impossible for trees larger than 10 taxa... We need to take shortcuts

63 Swapping algorithms: shortcuts Algorithms to speed up even more the search through the tree space take an initial tree T i make small rearrangements (i.e. swap branches) to reach neighboring trees if you find a better, keep it and start again strategy will find local optimum in the tree space They are called greedy algorithm because they sieze the first improvement they see

64 Greedy algorithms The problem with greedy algorithms: you can miss the global optimum! but we don t have any other choice...

65 Nearest-Neighbour Interchange

66 Subtree Pruning and Regrafting

67 Tree Bisection and Reconnection

68 Swapping is not all Number of neighbor searched NNI 2(n 3) SPR 4(n 3)(n 2), although some may be the same neighour TBR (2n 1 3)(2n 2 3) possible ways to reconnect the two trees Previous methods just make rearrangements to a particular tree by swapping branches, which is not enough to fully cover the tree space. Additional techniques have to be used stepwise taxon addition: start several heuristic searches by starting from a different initial tree each time allow suboptimal trees to be swapped on in order to cross valleys in the tree space

69 Other heuristic searches You can invent all kinds of heuristic algorithm, and many have been proposed tree-fusing: exchange subtrees from two equally optimal trees genetic algorithms: let populations of trees evolve by setting fitness functions and mutation, migration, etc... tree windowing: do extensive rearrangement but on a local region of the tree search by reweighting: change the landscape of the tree space by reweighting characters simulated annealing: iteratively change the tree scoring function to lower the difference between optimal and suboptimal trees

70 How to know if the tree is correct?

71 Bootstrap on phylogenetic tree Allow us to infer the variability of parameters in models that are too complex for easy calculation of their variance. This is the case of topologies! Procedure sample whole columns of data with replacement recreate n pseudomatrices with the same number of species and sites than original one build n phylogenetic trees from these n pseudoreplicates weigh each tree in replicate i by the number of trees obtained

72 Summarizing results The end result of a bootstrap is a cloud of trees. What is the best way to summarize this given that trees have discrete topologies and continuous branch lengths? We could make a histogram of the length of a particular branches it will give a lower limit on the branch length then check if 0 is in the 95% interval, we would assert the existence of the branch A simpler solution is to count how many times a particular branch appears in the list of trees estimated by bootstrap. A majority-rule consensus tree containing clades appearing in more than 50% of them can then be built

73 Simple example

74 Multiple tests We don t (or rarely) know in advance which group interest us. If we look for the most supported group on the tree and report its p-value, we have a multiple-tests problem if no significant evidence for existence of any groups on a tree 5% of branches are expected to be above 0.95 so one out of every 20 branches of a tree would be significant furthermore, branches are not independent... The p-value cannot be interpreted as statistical test.

75 Biases in bootstrap Estimated p-value is conservative: Hillis and Bull (1993) One source of conservatism with bootstrap, we make statement about branch lengths µ then we reduce that to statements about tree topology, i.e. µ > 0 or µ < 0 generalisation of Hillis and Bull (1993) results is not clear

76 How to test models in phylogenetics?

77 Uncertainty assessment Likelihood does not only allow to make point estimate of the topology and branch length, it also gives information about the uncertainty of our estimate. It is possible to use the likelihood curve to test hypothesis and to make interval estimates. Asymptotically (i.e. when the number of data point tend towards ), the ML estimate ˆθ is normally distributed around its true value θ 0.

78 Likelihood ratio test At the maximum ˆθ S(ˆθ) = lnl(ˆθ) θ = 0 I(ˆθ) = 2 lnl(ˆθ) θ 2 < 0 Then the log likelihood of the true parameter value can be linked with the maximum value estimated by a Taylor series: lnl(θ 0 ) = lnl(ˆθ) + S(ˆθ)(θ 0 ˆθ) 1 2 I(ˆθ)(θ 0 ˆθ) 2 Subtracting lnl(ˆθ) from both sides, setting I(ˆθ) = 1/σ 2 and rearranging leads to 2 [ lnl(θ 0 ) lnl(ˆθ) ] (θ 0 ˆθ) 2 σ 2 = χ 2 1 For p parameters, twice the difference in likelihood is χ 2 p distributed.

79 Nested models Assumptions of the likelihood ratio test null hypothesis should be in the interior space that contains the alternative hypotheses if q parameters have been constrained, they must be able to vary in both sense if L 0 restricts one parameter to the end of its range, distribution of twice log likelihood ratio has half its mass at 0 and the other half in the usual χ 2 distribution halve the tail probability obtained with usual χ 2 Valid only asymptotically should be close enough to the true values true with very large amounts of data

80 Akaike Information Criterion More general model will always have higher likelihood than restricted models. So, choosing model with highest likelihood will lead to one that is unnecessarily complex. We should therefore compromise goodness of fit with complexity of model. AIC for hypothesis i with p i parameters: AIC i = 2lnL i + 2p i Hypothesis with the lowest AIC is preferred.

81 Adding more complexity All the different models seen so far are special cases of the GTR+Γ +I model setting Γ and I to 0 leads to GTR setting transversion rates to β and transition rates to α leads to F84 setting β = α leads to K2P setting all nucleotide frequencies to 1 4 lead to JC69

82 Examples Possible comparisons GTR+Γ vs GTR 2 [lnl GTR+Γ lnl GTR ] difference in df = 1 GTR+Γ vs F84+Γ 2 [lnl GTR+Γ lnl F84+Γ ] difference in df = 4 F84+Γ vs JC69 2 [lnl F84+Γ lnl J69 ] difference in df = 5 But not JC69+Γ vs GTR because they are not nested!

83 Are two topologies different? LRT or AIC not possible on topologies because they are discrete entities, and not quantitative parameters. We therefore have to rely on paired-sites tests expected lnl is average lnl per site as number of sites grows without limit if sites are independent and two trees have equal expected lnl, then differences in lnl at each site is drawn independently with expectation zero statistical test of mean of these differences is zero valid for likelihood and parsimony

84 Example 232-sites mtdna for 7 mammals Felsenstein, 2004

85 Different form of tests possible Possible tests z assumes the differences at each site are normally distributed and estimate the variance of differences of the scores KH use bootstrap sampling to infer distribution of sum of differences of scores and see whether 0 lay in the tails of distribution Results z sum of lnl = 3.18, σ 2 = 0.04, so variance of sum of differences is 11.31; z statistic is 3.18/3.36 = 0.94 with p-value of 0.34 KH 10,000 bootstrap samples of sites; 8,326 favored tree II, which yields a one-tailed probability of 0.16, and a 0.33 two-tailed probability

86 Multiple tests, again If we want to test more than two trees compare each tree to best tree accept all trees that cannot be rejected by KH test multiple tests setting, but no reduction of nominal rejection level possible need to correct for all different ways the data can vary, ways that support different trees When two trees are compared, but one of them is the actual best tree, we should do a one-tailed test.

87 SH test Resampling technique that approximately corrects for testing multiple trees 1 make R bootstrap of the N sites 2 for each tree, normalize resampled lnl so they have same expectation 3 for jth bootstrap, calculate S ij for ith tree how far normalized value is below maximum across all trees for that replicate 4 for each tree i, the tail probability is proportion of bootstrap replicates in which S ij is less than the actual difference between ML and lnl of that tree Resampling build a least-favorable case in which the trees show some patterns of covariation of site as in actual data but do not differ in overall lnl. One limitation: assume that all proposed trees are possibly equal in likelihood

88 Parametric bootstrap Use computer simulations to create pseudorandom data sets Advantages and disadvantages can sample from the desired distribution, even with small data sets flexible hypothesis testing framework close reliance on the correctness of the model of evolution

89 Measuring, searching and assessing trees all in one!

90 Bayesian methods The likelihood calculation is used as well by Bayesian methods. However, another component is added to the method: the prior distributions. Before observing any data, each parameter will be assigned a prior distribution topologies branch lengths each parameter of the model of evolution The prior distributions are then combined with the likelihood of the data to give the posterior distribution. This is a highly attractive quantity because it computes what we most need: the probabilities of different hypotheses in the light of the data.

91 Bayes theorem To combine all this together, we use the Bayes theorem Prob(T D) = Prob(T D) Prob(D) where Prob(T D) = Prob(T )Prob(D T ) so that Prob(T D) = Prob(T )Prob(D T ) Prob(D)

92 Normalizing constant The denominator Prob(D) is the sum of the numerator Prob(T )Prob(D T ) over all possible hypotheses T. This quantity is needed to normalize the probabilities of all T so that they add up to 1. This leads to Prob(T D) = Prob(T )Prob(D T ) Prob(T )Prob(D T ) T In words: posterior probability = prior probability likelihood normalizing constant

93 Estimate normalizing constant Posterior distribution expression has a denominator, i.e. Prob(T )Prob(D T ), that is often impossible to compute. T Fortunately, samples from the posterior distribution can be drawn using a Markov chain that does not need to know the denominator draw a random sample from posterior distribution of trees becomes possible to make probability statements about true tree e.g. if 96% of the samples from posterior distribution have (human,chimp) as monophyletic group, probability of this group is 96%

94 Makov chain Monte Carlo Idea: to wander randomly through tree space by sampling trees until we settle down into an equilibrium distribution of trees that has the desired distribution, i.e. posterior distribution. Markov chain: the new proposed tree will depend only on the previous one to reach equilibrium distribution, the Markov chain must be aperiodic no cycles should be present in the Markov chain irreducible every trees must be accessible from any other tree probability of proposing T j when we are at T i is the same as probability of proposing T i when we are at T j the Markov chain has no end

95 MCMC in practice Metropolis algorithm start with a random tree T i select a new tree T j by modifying T i in some way compute R = Prob(T j D) Prob(T i D) The beauty of MCMC: the normalizing constant being the same, it simplifies R = Prob(T j)prob(d T j ) Prob(T i )Prob(D T i ) if R 1, accept T j if R < 1, draw a random number n between [0, 1] and accept T j if R > n, otherwise keep T i

96 Schematic view P. Lewis, 2006

97 Putting it all together Start with random tree and arbitrary initial values for branch lengths and model parameters Each generations, chose at random to propose either: a new tree new branch length new model parameter values Either accept or reject the move Every k generations, save tree topology, branch lengths and model parameters (i.e. sample the chain) After n generations, summarize the sample using histograms, means, credibility intervals, etc.

98 How to propose a new tree We could invent any type of proposal distribution to wander through the tree space e.g. NNI by selecting a node at random should be able to reach all trees from any starting tree at least after sufficient running, but impossible to know how much running is enough Should be careful because if trees proposed are too different these trees will be rejected too often if trees proposed are too similar tree space won t be sampled well enough

99 MCMCMC Metropolis-coupled Markov-chain Monte Carlo or MC 3 involves running several MCMC chains simultaneously the cold chain is the one that counts, i.e. trees are sampled from that one the rest are called heated chains and are there to help the convergence chain is heated by raising densities to a power less than 1.0

100 MCMCMC P. Lewis, 2006

101 MCMC trace plot P. Lewis, 2006

102 Slow mixing P. Lewis, 2006

103 Convergence in MCMC Greatest practical problem: how long to run a chain to obtain a good approximation of the posterior probabilities The most important is to check that independent runs lead to the same posterior distribution.

104 Type of prior distributions Prior distributions for topologies have been proposed stochastic process of random speciation and extinction uniform distribution of all possible rooted trees Prior distributions on branch lengths exponential distribution uniform distribution Prior on model parameters Dirichlet distribution on nucleotide frequency uniform or exponential distribution on shape of γ distribution uniform distribution on proportion of invariant sites

105 Problems with priors As shown before, we have to be careful with prior because they can exclude possible estimated values. Other problematic aspects: universality of priors use of uninformative flat priors issues of scale unbounded quantities

106 Jukes-Cantor model Sequence disimilarity and branch length under this model: D = 3 4 (1 e 4 3 t ) different scale unbounded priors

107 Effect of priors Yang and Rannala, 2006

108 What can we do with Bayesian Beside their use in building phylogenetic trees, Bayesian methods are useful to deal with complex biological problems testing hypotheses about rates of host switching and cospeciation in host/parasite systems dating phylogenetic trees using autocorrelated prior distribution on rates of evolution infer rate of change of states of a character and the bias in the rate of gain of the character infer accuracy of inference of ancestral states infer position of the root of the tree testing rates of speciation and assess key innovations associated with changes in rates

109 Today s practicals We will look at the maximum likelihood and Bayesian inference today using the following software: phyml MrBayes R (package: ape, phangorn) R, FigTree or SeaView (to visualize the trees) Tracer or R (to visualize MrBayes output) The questions and data can be found here. A description of the format used in phylogenetics is available here.

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018 Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Nicolas Salamin Department of Ecology and Evolution University of Lausanne

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Consensus Methods. * You are only responsible for the first two

Consensus Methods. * You are only responsible for the first two Consensus Trees * consensus trees reconcile clades from different trees * consensus is a conservative estimate of phylogeny that emphasizes points of agreement * philosophy: agreement among data sets is

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Lab 9: Maximum Likelihood and Modeltest

Lab 9: Maximum Likelihood and Modeltest Integrative Biology 200A University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS" Spring 2010 Updated by Nick Matzke Lab 9: Maximum Likelihood and Modeltest In this lab we re going to use PAUP*

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2018 University of California, Berkeley Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley B.D. Mishler Feb. 14, 2018. Phylogenetic trees VI: Dating in the 21st century: clocks, & calibrations;

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Inferring Molecular Phylogeny

Inferring Molecular Phylogeny Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

More information

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise Bot 421/521 PHYLOGENETIC ANALYSIS I. Origins A. Hennig 1950 (German edition) Phylogenetic Systematics 1966 B. Zimmerman (Germany, 1930 s) C. Wagner (Michigan, 1920-2000) II. Characters and character states

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200B Spring 2009 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley B.D. Mishler Jan. 22, 2009. Trees I. Summary of previous lecture: Hennigian

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Bayesian Models for Phylogenetic Trees

Bayesian Models for Phylogenetic Trees Bayesian Models for Phylogenetic Trees Clarence Leung* 1 1 McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada ABSTRACT Introduction: Inferring genetic ancestry of different species

More information

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies Week 8: Testing trees, ootstraps, jackknifes, gene frequencies Genome 570 ebruary, 2016 Week 8: Testing trees, ootstraps, jackknifes, gene frequencies p.1/69 density e log (density) Normal distribution:

More information

Lecture 6 Phylogenetic Inference

Lecture 6 Phylogenetic Inference Lecture 6 Phylogenetic Inference From Darwin s notebook in 1837 Charles Darwin Willi Hennig From The Origin in 1859 Cladistics Phylogenetic inference Willi Hennig, Cladistics 1. Clade, Monophyletic group,

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Speciation Times under an Episodic Molecular Clock Syst. Biol. 56(3):453 466, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701420643 Inferring Speciation Times under an Episodic Molecular

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE CELLULAR & MOLECULAR BIOLOGY LETTERS http://www.cmbl.org.pl Received: 16 August 2009 Volume 15 (2010) pp 311-341 Final form accepted: 01 March 2010 DOI: 10.2478/s11658-010-0010-8 Published online: 19 March

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996 Following Confidence limits on phylogenies: an approach using the bootstrap, J. Felsenstein, 1985 1 I. Short

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe? How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

How to read and make phylogenetic trees Zuzana Starostová

How to read and make phylogenetic trees Zuzana Starostová How to read and make phylogenetic trees Zuzana Starostová How to make phylogenetic trees? Workflow: obtain DNA sequence quality check sequence alignment calculating genetic distances phylogeny estimation

More information

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogeny? - Systematics? The Phylogenetic Systematics (Phylogeny and Systematics) - Phylogenetic systematics? Connection between phylogeny and classification. - Phylogenetic systematics informs the

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Preliminaries. Download PAUP* from:   Tuesday, July 19, 16 Preliminaries Download PAUP* from: http://people.sc.fsu.edu/~dswofford/paup_test 1 A model of the Boston T System 1 Idea from Paul Lewis A simpler model? 2 Why do models matter? Model-based methods including

More information

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29): Statistical estimation of models of sequence evolution Phylogenetic inference using maximum likelihood:

More information

Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction 1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length

More information

Phylogenetic methods in molecular systematics

Phylogenetic methods in molecular systematics Phylogenetic methods in molecular systematics Niklas Wahlberg Stockholm University Acknowledgement Many of the slides in this lecture series modified from slides by others www.dbbm.fiocruz.br/james/lectures.html

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

Bayesian Phylogenetics

Bayesian Phylogenetics Bayesian Phylogenetics Paul O. Lewis Department of Ecology & Evolutionary Biology University of Connecticut Woods Hole Molecular Evolution Workshop, July 27, 2006 2006 Paul O. Lewis Bayesian Phylogenetics

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

Mutation models I: basic nucleotide sequence mutation models

Mutation models I: basic nucleotide sequence mutation models Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or

More information

Finding the best tree by heuristic search

Finding the best tree by heuristic search Chapter 4 Finding the best tree by heuristic search If we cannot find the best trees by examining all possible trees, we could imagine searching in the space of possible trees. In this chapter we will

More information

Phylogeny. November 7, 2017

Phylogeny. November 7, 2017 Phylogeny November 7, 2017 Phylogenetics Phylon = tribe/race, genetikos = relative to birth Phylogenetics: study of evolutionary relationships among organisms, sequences, or anything in between Related

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Bioinformatics tools for phylogeny and visualization. Yanbin Yin Bioinformatics tools for phylogeny and visualization Yanbin Yin 1 Homework assignment 5 1. Take the MAFFT alignment http://cys.bios.niu.edu/yyin/teach/pbb/purdue.cellwall.list.lignin.f a.aln as input and

More information

Phylogenetic trees 07/10/13

Phylogenetic trees 07/10/13 Phylogenetic trees 07/10/13 A tree is the only figure to occur in On the Origin of Species by Charles Darwin. It is a graphical representation of the evolutionary relationships among entities that share

More information

Biology 211 (2) Week 1 KEY!

Biology 211 (2) Week 1 KEY! Biology 211 (2) Week 1 KEY Chapter 1 KEY FIGURES: 1.2, 1.3, 1.4, 1.5, 1.6, 1.7 VOCABULARY: Adaptation: a trait that increases the fitness Cells: a developed, system bound with a thin outer layer made of

More information

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics Bayesian phylogenetics the one true tree? the methods we ve learned so far try to get a single tree that best describes the data however, they admit that they don t search everywhere, and that it is difficult

More information

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics CS5263 Bioinformatics Guest Lecture Part II Phylogenetics Up to now we have focused on finding similarities, now we start focusing on differences (dissimilarities leading to distance measures). Identifying

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition David D. Pollock* and William J. Bruno* *Theoretical Biology and Biophysics, Los Alamos National

More information

Chapter 7: Models of discrete character evolution

Chapter 7: Models of discrete character evolution Chapter 7: Models of discrete character evolution pdf version R markdown to recreate analyses Biological motivation: Limblessness as a discrete trait Squamates, the clade that includes all living species

More information

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

How should we organize the diversity of animal life?

How should we organize the diversity of animal life? How should we organize the diversity of animal life? The difference between Taxonomy Linneaus, and Cladistics Darwin What are phylogenies? How do we read them? How do we estimate them? Classification (Taxonomy)

More information

Phylogenetic Assumptions

Phylogenetic Assumptions Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA Copyright htregulation WARNING This material has been reproduced and communicated to you by

More information

Molecular Evolution, course # Final Exam, May 3, 2006

Molecular Evolution, course # Final Exam, May 3, 2006 Molecular Evolution, course #27615 Final Exam, May 3, 2006 This exam includes a total of 12 problems on 7 pages (including this cover page). The maximum number of points obtainable is 150, and at least

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Phylogenetic analyses. Kirsi Kostamo

Phylogenetic analyses. Kirsi Kostamo Phylogenetic analyses Kirsi Kostamo The aim: To construct a visual representation (a tree) to describe the assumed evolution occurring between and among different groups (individuals, populations, species,

More information

AP Biology. Cladistics

AP Biology. Cladistics Cladistics Kingdom Summary Review slide Review slide Classification Old 5 Kingdom system Eukaryote Monera, Protists, Plants, Fungi, Animals New 3 Domain system reflects a greater understanding of evolution

More information

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION Integrative Biology 200 Spring 2018 University of California, Berkeley "PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley D.D. Ackerly Feb. 26, 2018 Maximum Likelihood Principles, and Applications to

More information

Consensus methods. Strict consensus methods

Consensus methods. Strict consensus methods Consensus methods A consensus tree is a summary of the agreement among a set of fundamental trees There are many consensus methods that differ in: 1. the kind of agreement 2. the level of agreement Consensus

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Systematics - Bio 615

Systematics - Bio 615 Bayesian Phylogenetic Inference 1. Introduction, history 2. Advantages over ML 3. Bayes Rule 4. The Priors 5. Marginal vs Joint estimation 6. MCMC Derek S. Sikes University of Alaska 7. Posteriors vs Bootstrap

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information