66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Size: px
Start display at page:

Download "66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866"

Transcription

1 66 Bioinformatics I, WS 09-10, D. Huson, December 1, Phylogeny Evolutionary tree of organisms, Ernst Haeckel, References J. Felsenstein, Inferring Phylogenies, Sinauer, C. Semple and M. Steel, Phylogenetics, Oxford University Press, R. Durbin, S. Eddy, A. Krogh & G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.L. Swofford, G.J. Olsen, P.J.Waddell & D.M. Hillis, Phylogenetic Inference, in: D.M. Hillis (ed.), Molecular Systematics, 2 ed., Sunderland Mass., M. Salemi and A-M. Vandamme, The Phylogenetic Handbook Cambridge University Press 5.2 Why phylogenetic analysis? Phylogenetic analysis has many applications, including understanding the emergence and development of diseases:

2 Bioinformatics I, WS 09-10, D. Huson, December 1, Outline of the chapter 1. Enumerating phylogenetic trees 2. Computer representation of trees 3. Distance-based reconstruction methods 4. Parsimony-based reconstruction methods 5. Jukes-Cantor model of DNA sequences evolution 6. Maximum likelihood reconstruction methods 7. Bayesian reconstruction methods 8. Phylogenetic Networks 1, 3, 4 will cover some material already presented in Grundlagen der Bioinformatik 5.4 Phylogenetic analysis The goal of phylogenetic analysis is to determine and describe the evolutionary relationship between a givens set of species. In particular, this involves determining the order of speciation events and their approximate timing. It is generally assumed that speciation is a branching process: a population of organisms becomes separated into two sub-populations. Over time, these evolve into separate species that do not crossbreed. Because of this assumption, a tree is often used to represent a proposed phylogeny for a set of species, showing how the species evolved from a common ancestor.

3 68 Bioinformatics I, WS 09-10, D. Huson, December 1, Early Evolutionary Studies Before the early 1960s: Anatomical/Morphological features were the dominant criteria used to derive evolutionary relationships between species 1958: Francis Crick suggested to use molecular sequences for phylogenetic tree reconstruction. 1962: Emile Zuckerkandl and Linus Pauling built the first phylogenetic tree from aligned amino acid sequences. They also proposed the theory of the molecular clock. 1967: Walter Fitch and Emanuel Margoliash published the first algorithm for phylogenetic tree reconstruction (least-squares algorithm). 1977: Carl Woese uses phylogenies based on SSU rrna sequences to show that are are three Domains of Life: Eukaryotes, Bacteria and Archaea Phylogenetic trees In the following, we will use X = {x 1, x 2,..., x n } to denote a set of taxa, where a taxon x i is simply a representative of a group of individuals defined in some way. Definition (Phylogenetic tree) A phylogenetic tree (on X) is a system T = (V, E, λ) consisting of a connected graph (V, E) without cycles, together with a labeling λ of the leaves by elements of X, such that: 1. every leaf is labeled by exactly one taxon, and 2. every taxon appears exactly once as a label Unrooted trees An unrooted phylogenetic tree is obtained by placing a set of taxa on the leaves of a tree: Pan_panisc Gorilla Homo_sap Mus_mouse Rattus_norv harbor_sel Bos_ta(cow) fin_whale blue whale Gorilla Pan_panisc Homo_sap Mus mouse harbor_sel Rattus norv Bos_ta(cow) Taxa X + tree phylogenetic tree T on X fin_whale blue_whale

4 Bioinformatics I, WS 09-10, D. Huson, December 1, Unrooted and rooted trees Enumerating phylogenetic trees Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves. We assume that T is bifurcating, i.e. that each internal node has degree 3. Any non-bifurcating tree on n taxa will have less nodes and edges. Lemma (Number of nodes and edges) A bifurcating phylogenetic tree T on n taxa has 2n 2 nodes, 2n 3 edges and n 3 internal edges for all n 3. Proof: by induction. For n = 3 there is exactly one tree: it has 3 outer edges and zero internal edges. It has 3 leaves and one internal node. Assume that the statement holds for n. Any tree T with n + 1 leaves can be obtained from some tree T with n leaves by inserting a new node v into some edge e of T and connecting v to a new leaf w. This increases both the number of nodes and the number of edges by 2. Thus T has 2n = 2(n + 1) 2 nodes and 2n = 2n 1 = 2(n + 1) 3 edges. The number of internal edges is equal to the total number of edges minus the number of edges leading to leaves: (2n 3) n = n 3. Lemma (Number of trees) There are unrooted trees on n leaves, and rooted trees. U(n) = (2n 5)!! := (2n 5) R(n) = (2n 3)!! = U(n) (2n 3) = (2n 3) The number of possible trees grows extremely rapidly for increasing values of n. For example, for n = 10 we have U(n) = and R(n) = Distances in phylogenetic trees Let T = (V, E) be a phylogenetic tree with node set V and edge set E. An edge weight function ω : E R 0 assigns a non-negative number ω(e) to every edge e. These weights (or lengths) often represent

5 70 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 the number of mutations on evolutionary path from one species to another, or a time estimate for evolution of one species into another. If T is a phylogenetic tree with edge weights ω, then the tree distance (or patristic distance) between i and j is given by d T (i, j) = ω(e), e P (i,j) where the sum is taken over all edges e that are contained in the path P (i, j) from i to j. Example: A B 8 9 D C E The tree distance between B and C is d T (B, C) = = Computer representation of a phylogenetic tree Let X be a set of taxa and T a phylogenetic tree on X. To represent a phylogenetic tree in a computer we maintain a set of nodes V and a set of edges E. Each edge e E maintains a reference to its source node s(e) and target node t(e). Each node v V maintains a list of references to all incident edges. Each leaf node v V maintains a reference λ(v) to the taxon that it is labeled by, and, vice versa, ν(x) maps a taxon x to the node that it labels. Each edge e E maintains its length ω(e) Nested structure A rooted phylogenetic tree T = (V, E, λ) is a nested structure: Consider any node v V. Let T v denote the subtree rooted at v. Let v 1, v 2,..., v k denote the children of v. Then T v is obtainable from its subtrees T v1, T v2,..., T vk by connecting v to each of their roots. Such a nested structure can be written using nested brackets:

6 Bioinformatics I, WS 09-10, D. Huson, December 1, v9 taxon1 Phylogenetic tree v2 v1 v3 v4 v5 v6 v7 v8 taxon3 v10 taxon2 Description: (((taxon 1, taxon 2 ), taxon 3 ), (taxon 4, taxon 5, taxon 6 )) taxon4 taxon5 taxon6 nested description v 1 (v 2, v 3 ) ((v 4, v 5 ), (v 6, v 7, v 8 )) (((v 9, v 10 ), v 5 ), (v 6, v 7, v 8 )) Newick format Phylogenetic trees are usually read and printed using the so-called Newick format. The Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses, noticed in 1857 by the famous English mathematician Arthur Cayley. The tree ends with a semicolon. Interior nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas. Leaves are represented by their names. Trees can be multifurcating at any level. Branch lengths can be incorporated into a tree by putting a real number after a node and preceded by a colon. This represents the length of the branch immediately below that node. Examples: Unrooted tree Rooted tree with edge lengths a e b d c Newick string: ((a,b),c,(d,e)); Newick string: (((A:0.5,B:0.5):1.5,C:2.0):1.0,(D:1.0,E:1.0):2.0); In the Newick format, each pair of matching brackets corresponds to an internal node and each taxon label corresponds to an external node. To add edge lengths to the format, specify the length len of the edge along which given node we visited by adding :len behind the closing bracket, in the case of an internal node, and behind the label, in the case of a leaf. For example, consider the tree (a,b,(c,d)). If all edges have the same length 1, then this tree is written as (a:1,b:1,(c:1,d:1):1) Algorithm for Tree to Newick This algorithm recursively prints a tree in Newick format. Initialisation with e = null and v = ρ, if the tree is rooted, or v set to an arbitrary internal node, else. Algorithm (tonewick(e, v)) Input: Phylogenetic tree T = (V, E) on X, with labeling λ : V X

7 72 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Output: Newick description of tree begin if v is a leaf then print λ(v) else // v is an internal node print ( for each edge f e incident to v do If this is not the first pass of the loop, print, Let w v be the other node incident to f call tonewick(f, w) print ) print ; end Algorithm for parsing Newick Algorithm parses a tree in Newick format from a (space-free) string s = s 1 s 2... s m. Initialisation: parsenewick(1, m, null). Algorithm (parsenewick(i,j,v)) Input: a substring s i... s j and a root node v Output: a phylogenetic tree T = (V, E, λ) begin while i j do Add a new node w to V if v null then add a new edge {v, w} to E if s i = ( then // hit a subtree Set k to the position of the balancing close-bracket for i call parsenewick(i + 1, k 1, w) // strip brackets and recurse else // hit a leaf Set k = min{k i s k+1 =, or k + 1 = j} Set λ(w) = s i... s k // grab label for leaf end Set i = k + 1 // advance to next, or j if i < j then Check that i + 1 j and s i+1 =, Increment i // advance to next token 5.6 Constructing phylogenetic trees There are four main approaches to constructing phylogenetic trees: 1. Using a distance method, one first computes a distance matrix for a given set of biological data and then one computes a tree that represents these distances as closely as possible. 2. Maximum parsimony takes as input a set of aligned sequences and attempts to find a tree and a labeling of its internal nodes by auxiliary sequences such that the number of mutations along the tree is minimum. 3. Given a probabilistic model of evolution, maximum likelihood approaches aim at finding a phylogenetic tree that maximizes the likelihood of obtaining the given sequences. 4. Bayesian reconstruction methods estimate a posterior probability distribution of trees.

8 Bioinformatics I, WS 09-10, D. Huson, December 1, Distance-based tree reconstruction Assume we are given a set X = {x 1, x 2,..., x n } of taxa. The input to a distance method is a distance matrix D : X X R 0 that associates a distance d(x i, x j ) with every pair of taxa x i, x j X. We require that D satisfies at least the first two properties of a metric: 1. symmetry, d(x, y) = d(y, x), 2. the triangle inequality, d(x, y) + d(y, z) (dx, z) and 3. identity, x y d(x, y) > 0. Distance-Based Phylogeny Problem: Reconstruct an evolutionary tree on n leaves from an n n distance matrix Input: A distance matrix D = (d ij ) on X. Output: A phylogenetic tree T (rooted or unrooted) with edge lengths on X such that the distances in T approximate D in some prescribed way. The most commonly distance-based methods used are 1. UPGMA 2. Neighbor joining 3. Minimum evolution and related methods 5.8 The neighbor-joining algorithm The most widely used distance method is neighbor joining (NJ), originally introduced by Saitou and Nei (1987) 1 and modified by Studier and Keppler (1988). Given a distance matrix D, neighbor joining produces an unrooted phylogenetic tree T with edge lengths. It is more widely applicable than UPGMA, as it does not assume a molecular clock. Starting from a star tree topology the algorithm proceeds to build a tree based on D by repeatedly pairing neighboring taxa: c a d c a d b g b g e f e f 1 Saitou, N., Nei, Y. (1987) SIAM J. on Comp. 10:

9 74 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Algorithm (Neighbor joining) Input: Distance matrix D on X Output: Unrooted phylogenetic tree T Initialization: Define L to be the set of leaf nodes, one for each taxon x X Set T = L while L > 2 do Compute the neighbor-joining matrix N: For all i, j set N ij = d ij (r i + r j ), where r i = 1 L 2 k L d ik Pick a pair i, j L for which N ij is minimum Create a new node k in T Update D: For all m L set d km = 1 2 (d im + d jm d ij ) Connect k to i and to j using two new edges of lengths d ik = 1 2 (d ij + r i r j ) and d jk = d ij d ik, respectively. Remove i and j from L and add k to L. if L = {i, j}, then create a new edge between i and j of length d ij. Let i and j be two neighboring leaves that have the same parent node, k. Remove i, j from the list of nodes L and add k to the current list of nodes. How do we update its distance to any given leaf m? Consider: i i j m j k m We must have: thus which implies d im = d ik + d km, d jm = d jk + d km and d ij = d ik + d jk, d im + d jm = d ik + d km + d jk + d km = d ij + 2d km, d km = 1 2 (d im + d jm d ij ). Why do we set the length of the new edges to d ik = 1 2 (d ij + r i r j ) and d jk = d ij d ik? By definition, r i = 1 L 2 k L d ik equals the average distance q i = 1 L 2 other nodes m i, j, plus d ij L 2 : i j dij k q i m 1 average distance q j m 5 m m 2 4 m 3 k L,k i,j d ik from i to all

10 Bioinformatics I, WS 09-10, D. Huson, December 1, As we see from this figure: 2d ik = d ij + q i q j = d ij + q i q j Example Taxa A B C D A B C - 11 D - d ij L 2 Initialisation: L = X = {A, B, C, D} Iteration: 1. Compute N: first compute column sums r i : d ij L 2 = d ij + r i r j. r A = = 27, r B = = 31, r C = = 27, r D = = 37 N = = T axa A B C D A B 8 (ra + rb)/ C 7 (ra + rc)/2 9 (rb + rc)/2 11 D 12 (ra + rd)/2 14 (rb + rd)/2 11 (rc + rd)/2 T axa A B C D A B C D One minimum is N(A, B) = 21. Define new internal node k to join A and B and compute edge lengths of k to A and of k to B: d(a, k) = 1 2 (d(a, B) + r A r B ) = 1 (8 + 27/2 31/2) = 3 2 d(b, k) = d(a, B) d(a, k) = 8 3 = 5 Compute new D : where T axa k C D k d(k, C) d(k, D) C 11 D d(k, C) = 1 (d(a, C) + d(b, C) d(a, B)) = 4 2 d(k, D) = 1 (d(a, D) + d(b, D) d(a, B)) = 9 2 Second iteration: Compute N: we have L = {AB, C, D}, L = 3, L 2 = 1 r k = = 13, r C = = 15, r D = = 20 N = = T axa AB C D AB 4 9 C 4 ( )/1 11 D 9 ( )/1 11 ( )/1 T axa AB C D AB 4 9 C D 24 24

11 76 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 A minimum is given by N(AB, C). Define a new node k to join AB and C: d(c, k) = 1 2 (d(ab, C) + rc rab) = 1 ( ) = 3 2 d(ab, k) = d(ab, C) d(c, k) = 4 3 = 1 Termination: Only two nodes left, D and ABC. Compute length of edge d(abc, D): d(abc, D) = d(c, D) d(abc, C) = 11 3 = 8 This is the resulting tree: B 5 D A 3 C Additivity and the four-point condition A distance matrix D is called additive, or a tree metric, if it can be represented by a phylogenetic tree. How to tell whether D is additive? Definition (Four-point condition) Assume we are given a set of taxa X. A distance matrix D on X is said to fulfill the four-point condition if for all i, j, k, l X holds. d(i, j) + d(k, l) max{d(i, k) + d(j, l), d(i, l) + d(j, k)} (That is, two of the three expressions d(x i, x j )+d(x k, x l ), d(x i, x k )+d(x j, x l ), and d(x i, x l )+d(x j, x k ) are equal and are not smaller than the third.) We have the following classical result (Buneman, 1971): Theorem (Additivity metrics and the 4PC) A distance matrix D on X is additive, if and only if it fulfills the four-point condition. We will prove one direction of the theorem. Proof : Assume that D is additive. Then there exists a phylogenetic tree T that represents D. Let i, j, k, l be elements of X. If they are all different, then we can assume w.l.o.g. that (i, j) and (k, l) are neighbors in the induced tree T {i,j,k,l}, respectively. It follows that d(i, j) + d(k, l) d(i, k) + d(j, l) = d(i, l) + d(j, k),

12 Bioinformatics I, WS 09-10, D. Huson, December 1, as can be seen from the following figure: and thus D fulfills the four-point condition. Example additive metric: B C D D = A B 3 6 C 5 Check the four-point condition: d(a, B) + d(c, D) = = 12 d(a, C) + d(b, D) = = 12 d(a, D) + d(b, C) = = 8 the four-point condition holds. D = A non-additive metric: B C D A B 4 7 C 5 Check the four-point condition: d(a, B) + d(c, D) = = 12 d(a, C) + d(b, D) = = 14 d(a, D) + d(b, C) = = 10 d(a, C) + d(b, D) max (d(a, B) + d(c, D), d(a, D) + d(b, C)) 4-point condition doesn t hold. D D C C A B A B 5.9 The Jukes-Cantor model of evolution T. Jukes and C. Cantor (1969) introduced a very simple model of DNA sequence evolution: Definition (Jukes-Cantor model) The Jukes-Cantor model of evolution makes the following assumptions: 1. We are given a rooted phylogenetic tree T along which sequences evolve. 2. The sequence length is an input parameter and possible states for each site are A, C, G and T. 3. The possible states for each site are A, C, G and T. 4. For each site of the sequence at the root of T, the state is drawn from a given distribution (typically uniform). 5. The sites evolve identically and independently (i.i.d.) down the edges of the tree from the root at a fixed rate u. 6. With each edge e E we associate a duration t = t(e) and the expected number of mutations per site along e is uτ(e). The probabilities of change to each of the 3 remaining states are equal.

13 78 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 How do we evolve a sequence down an edge e under the Jukes-Cantor model? Let v = (v 1,..., v n ) and w = (w 1,..., w n ) denote the source and target sequences associated with e. We assume that v has already been determined and we want to determine w. Under the Jukes-Cantor model, the evolutionary event C 3 = nucleotide changes to one of the other three bases occurs at a fixed rate of u. Now consider the event C 4 = nucleotide changes to any base (including back to the same base). The rate of occurrences of this event is taken to be 4 3 u. The probability that event C 4 does not occur within time t is: e 4 3 ut (Poisson distribution with k = 0). Given that the event C 4 occurs, the probability of ending in any particular base is 1 4. Let P and Q denote the base pair present at a given site before and after a given time period t. The probability that an event of type C 4 occurs and changes P {A, C, G, T} to base Q within time t is: Prob(Q P, t) = 1 ( 1 e 4 ut) 3. 4 Now, adding the probability that the event C 4 does not occur, we obtain the probability that the state does not change, i.e. P = Q: Prob(Q = P P, t) = e 4 3 ut + 1 ( 1 e 4 ut) 3 = 1 ( 1 + 3e 4 ut) Thus, we obtain a probability-of-change formula for the probability of an observable change occurring at any given site in time t: Prob(change t) = 1 Prob(Q = P P, t) = (1 + 3e 4 ut) 3 ( 1 e 4 ut) 3. = 3 4 We can use this model to generate artificial sequences along a model tree. E.g., set the sequence length to 10 and the mutation rate to u = 0.1. Initially, the root node is assigned a random sequence. Then the sequences are evolved down the tree, at each edge using the probability-of-change formula to decide whether a given base is to change: 1 a ACGTTTCGAG ACGTTTAGAG b 2 1 c * 3 ACCTTGCGGG 1 d 1 e 1 f E.g., the probability of change along the edge marked is 0.75(1 e ) = 0.75(1 e 0.4 ) The Jukes-Cantor distance transformation The Hamming distance between two sequences underestimates their true evolutionary distance (average number of mutations that took place per site over the elapsed time), because it doesn t count back mutations and multiple mutations at the same site.

14 Bioinformatics I, WS 09-10, D. Huson, December 1, Thus, a correction formula is required. For the Jukes-Cantor model, inversion of the probability-ofchange formula leads to the following: Definition (Jukes-Cantor transformation) The Jukes-Cantor distance transformation for a pair of sequences a and b is defined as: formula: JC(a, b) = 3 ( 4 ln 1 4 ) 3 Ham(a, b). (5.1) 5.11 More general transformations The Jukes-Cantor model assumes that all nucleotides occur with the same frequency. This assumption is relaxed under the Felsenstein 81 model introduced by Joe Felsenstein (1981), which has the following distance transformation: F 81(a i, a j ) = B ln(1 Ham(a i, a j )/B), (5.2) where B = 1 (π 2 A + π 2 C + π 2 G + π 2 T) and π Q is the frequency of base Q. The base frequency is obtained from the pair of sequences to be compared, or better, from the complete set of given sequences. Note that this contains the Jukes-Cantor transformation as a special case with equal base frequencies π A = π C = π G = π T = 0.25, thus B = 3 4. Unlike the Jukes-Cantor transformation, this transformation can also be applied to protein sequences, setting B = if all amino acids are generated with the same frequency, and B = 1 20 i=1 π2 i, in the proportional case. Given equal base frequencies, but different proportions P and Q of transitions and transversions between a i and a j, the distance for the Kimura 2 parameter model is computed as: K2P (a i, a j ) = 1 ( ) 2 ln ( ) 1 1 2P Q 4 ln. (5.3) 1 2Q If we drop the assumption of equal base frequencies, then we obtain the Felsenstein 84 transformation: ( F 84(a i, a j ) = 2A ln 1 P ) ( (A B)Q + 2(A B C) ln 1 Q ), (5.4) 2A 2AC 2C where A = π C π T /π Y +π A π G /π R, B = π C π T +π A π G, and C = π R π Y with π Y = π C +π T, π R = π A +π G. Even more general models exist, but we will skip the details. The following figure summarizes the complete overview of the time reversible models:

15 80 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 GTR 3 substitution parameters (2 f. Transitions, 1 f. Transv.) Equal base frequency 2 substitution parametera (1 f. Transitions, 1 f. Transv.) 1 substitution parameter TrN HKY =F84 Equal base frequency SYM K3SP 3 substitution parameters (1 f. Transitions, 2 f. Transv) 2 substitution parameters (1 f. Transitions, 1 f. Transv) GTR = general time reversible TrN = Tamura-Nei SYM = symmetric HKY = Hasegawa-Kishino-Yano K3ST = Kimura 3-Parameter F81 K2P Equal base frequency 1 substitution parameter JC

16 Bioinformatics I, WS 09-10, D. Huson, December 1, Maximum parsimony In phylogenetic analysis, the Maximum Parsimony problem is to find a phylogenetic tree that explains a given set of aligned sequences using a minimum number of evolutionary events. The maximum parsimony method is one of the most widely used used tree reconstruction methods. Suppose we are given a multiple alignment of sequences A of length L and a phylogenetic tree T. The maximum parsimony method makes two fundamental assumptions: Changes at different positions of the sequences are independent. Changes in different parts of the tree are independent The parsimony score of a tree The difference between two sequences X = (x 1,..., x L ) and Y = (y 1,..., y L ) is simply their nonnormalized Hamming distance d(x, Y ) = {k x k y k }, that is, the number of positions at which they differ. Assume we are given a multiple alignment of sequences A = {A 1, A 2,..., A n } and a corresponding phylogenetic tree T, leaf-labeled by A. If we assign a hypothetical ancestor sequence to every internal node in T, then we can obtain a score for T together with this assignment by summing over all differences d(x, Y ), where X and Y are two sequences labeling two nodes that are joined by an edge in T. Example: AGT AGA CGA CCA 1 AGA CGA score = 3 CGA Definition (Parsimony score of a tree) Suppose we are given a multiple alignment of sequences A of length L and a corresponding phylogenetic tree T leaf-labeled by A. Then the parsimony score for T and A is defined as: P S(T, A) = min λ {u,v} E d(λ(u), λ(v)), where the minimum is taken over all possible labelings λ of the nodes of T by sequences of length L that extend the labeling of the leaves by A.. Small Parsimony problem: The Small Parsimony problem is to compute the parsimony score for a given tree T.

17 82 Bioinformatics I, WS 09-10, D. Huson, December 1, The Fitch algorithm In 1971 Walter Fitch 2 published a dynamic programming algorithm that solves the Small Parsimony problem efficiently. Input: A phylogenetic tree T with n leaves. The leaves are labeled by the sequences of an MSA with n sequences over an alphabet Σ. Since maximum parsimony assumes indepedence of each column, one can compute the parsimony score column by column and the final result is given by the sum of the column results. The Fitch algorithm runs in two passes: The forward pass first computes the parsimony score and the backward pass then computes a labeling of the internal nodes The Fitch algorithm: Forward pass As the algorithms proceeds column-by-column, we can assume that each leaf v is labeled by a single character-state c(v). We will assign a set S(v) of states and a number l(v) to each node v as follows: For each leaf v: S(v) = {c(v)}, l(v) = 0 For each inner node v with children u, w: { S(u) S(w) if S(u) S(w) S(v) = S(u) S(w) if S(u) S(w) = l(v) = l(u) + l(w) l(v) = l(u) + l(w) + 1 if S(u) S(w) if S(u) S(w) = To compute S(v) and l(v) we traverse the tree in postorder or bottom-up. The number of changes required is given by the value of l at the root. The forward pass algorithm requires O(n) steps for each column and thus O(nL) in total The Fitch algorithm: Example Example: we use the Fitch algorithm to compute the parsimony score for the following labeled tree: G A G C C A Fitch algorithm: Backward pass The forward pass algorithm computes the parsimony score. An optimal labeling of the internal nodes is obtained via traceback: starting at the root node r, we label r using any character in S(r). Then, 2 Fitch, W. (1971) Toward defining the course of evolution: minimum change of specified tree topology. Syst. Zoology 20:

18 Bioinformatics I, WS 09-10, D. Huson, December 1, for each child w, we use the same letter, if it is contained in S(w), otherwise we use any letter in S(w) as label for w. We then visit the children von w etc. {C,G} {G} {A,G} {G} {G} {C} {A} {C} {C} {A,C} {A} G Again, the algorithm requires O(nL) steps, in total. A G C C A The Fitch algorithm: Backward pass Algorithm (Backward pass) Input: A phylogenetic tree T and the set S(v) of each node v in T Output: The fully labeled tree T do for all nodes v, starting at the root node, going to the leaves if v is the root node then Set c(v) = t with t S(v) else Let w be the parent node of v if c(w) S(v) then Set c(v) = c(w) else Set c(v) = t with t S(v) end Sankoff s algorithm In Fitch s algorithm each mutation is equally weighted independent of the nature of the mutation. For nucleotides and especially for amino acids the assumption of equal substitution costs for all possible character changes is inappropriate. In 1975, David Sankoff 3 proposed an alternative to Fitch s algorithm in which substitutions can be weighted. Input: a phylogenetic tree T = (V, E) with n leaves a character-state c(v) for each leaf v a cost matrix C that specifies the cost C(α, β) of changing state α into state β Forward pass: In a post-order traversal, for each node v and each state α Σ we will compute S α (v) as the minimum cost of the subtree rooted at v, given c(v) = α. Initialization: For each leaf node v and state α set S α (v) = { 0, c(v) = α, else. 3 Sankoff D. (1975) Minimal mutation trees of sequences. SIAM J. Appl. Math. 28:35-42

19 84 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Main recursion: For each internal node v with children u and w recursively compute S β (u) and S β (w) for all states β. Then set S α (v) = min(s β (u) + C(α, β)) + min(s β (w) + C(α, β)). β β Result: The minimal cost of the whole tree is given by where ρ is the root of the tree. Example: min S α (ρ), α Backward pass: Based on the values of S v (α) from the forward pass we will now choose an optimal character state α for each inner node. As in Fitch s algorithm we start at the root and visit all nodes in a pre-order traversal. For the root ρ set: c(ρ) = arg min S α (ρ) α For every internal node w ρ with parent node v we set: Example: c(w) = arg min(s α (w) + C(α, c(v))). α

20 Bioinformatics I, WS 09-10, D. Huson, December 1, The Large Parsimony problem Definition (Parsimony score of an alignment) Given a multiple alignment A = {a 1,..., a n }, its parsimony score is defined as P S(A) = min{p S(T, A) T is a phylogenetic tree on A}. Definition (Large Parsimony problem) The Large Parsimony problem is to compute P S(A). Theorem (Parsimony is hard) The Large Parsimony problem is NP-hard Maximum likelihood estimation Let A = (a 1,..., a n ) be a multiple sequence alignment of length m on X. The basic idea of the maximum likelihood estimation method is to compute a phylogenetic tree T, together with edge lengths ω, that maximizes the likelihood P(A T, ω, M) that the multiple sequence alignment A was generated on the phylogenetic tree T with edge lengths ω under the given model of sequence evolution M. Such a model M specifies how to choose the initial sequence at the root of the tree T and how sequences evolve along edges of the tree. For example, the Jukes-Cantor model specifies that at any position i of the root sequence, all four nucleotides appear with the same probability of 1 4. The mutation rate is u. Moreover, the length ω(e) of any edge e is interpreted as a duration and the probability that an observable change occurs along an edge of length t is given by the probability-of-change formula: P(change t) = 3 ( 1 e 4 ut) 3. 4 In practice, the models employed are more general than the simple Jukes-Cantor model, usually allowing unequal nucleotide frequencies and different transition probabilities between different nucleotides. The models are generally time reversible, which implies that the optimal score attainable is independent of the location of the root. A main advantage of maximum likelihood estimation is that it provides a systematic framework for explicitly incorporating assumptions and knowledge about the process that generated the given data. Although evolutionary models are only a rough estimate of real-world biological evolution, in practice, maximum likelihood methods have proved to be quite robust to violations of the assumptions made in the models Maximum likelihood estimation is hard As with maximum parsimony, the main draw-back of this approach is that finding an optimal phylogenetic tree is NP-hard. In fact, the situation is in some sense worse: Theorem (Maximum likelihood estimation is hard) For a given phylogenetic tree T, the problem of defining edge lengths ω so as to maximize the likelihood P (A T, ω, M) is NP-hard.

21 86 Bioinformatics I, WS 09-10, D. Huson, December 1, Solving maximum likelihood Assume we are given the following DNA alignment A as input: 1 2 i m a 1 A C... G C G... A a 2 A C... G C C... G a 3 A C... T A C... G a 4 A C... T G G... A There are three possible unrooted phylogenetic trees on a set of four taxa: a1 a3 a1 a2 a1 a2 a2 a4 a3 To simplify the calculation of the maximum likelihood score for such a phylogenetic tree T we root it by choosing one of its internal nodes to be the root ρ: (1) (2) (3) (4) C C A G (1) (3) (5) (5) (2) (4) We will discuss how to compute the maximum likelihood for the first tree of the three unrooted trees. The other two trees are processed similarly. Conceptually, we obtain the likelihood L(i) that a specific rooted phylogenetic tree T generated A i, the i-th character of A, by summing over all possible labellings α of the internal nodes by states. For a bifurcating unrooted phylogenetic tree on n taxa there are n 2 internal nodes, and so there are k (n 2) possible labellings to be considered, where k is the number of possible character states. For our example, we need to consider 16 different possibilities: C C A G + P L(i) = P A i A A (6) a4 a4 A i P A i P A i C C A G C A C C A G G C C C A G T T (6) a3 In general, for a given rooted phylogenetic tree T and model M, the likelihood of T given a single character A i is defined as L(i) = P(A i T, ω, M) = α P(A i, α T, ω, M) = ( α P(α(ρ)) ) e={v,w} P e(α(v), α(w)), where the summation is over all possible labellings α of the internal nodes, P(α(ρ)) is the probability of generating the state α(ρ) at the root and P e (α(v), α(w)) is the probability of observing the two states α(v) and α(w) at the two ends of an edge e = {v, w}.

22 Bioinformatics I, WS 09-10, D. Huson, December 1, The total likelihood that T generated the multiple sequence alignment A is obtained as the product of the likelihoods for all positions of A: L(T ) = L(1) L(2) L(m) = m L(i). i=1 The computation of L(i) is actually more complicated than we have described so far, because not only do we have to consider all possible rooted phylogenetic trees on n taxa, but we must also consider all possible ways of defining edge lengths ω on each such tree. Assume that we have a rooted phylogenetic tree T with edge lengths ω, we will now discuss in more detail how to calculate P (A i, α T, ω, M) = P(α(ρ)) P e (α(v), α(w)) e={v,w} for the case of the Jukes-Cantor model with mutation rate u. Under this model, the probability of observing any particular nucleotide at the root node is 1 4. The probability of observing two different states α(v) and α(w) at opposite ends of an edge e of length t = ω(e) is given by P (α(v) α(w) T ) = 3 4 (1 e 4 3 ut ), whereas the probability of observing the same state at opposite nodes of e equals 1 P (α(v) = α(w) T ) = 1 4 (1 + 3e 4 3 ut ). For example, the value of is given by P A i C C A G A A P (A)P e1 (A, C)P e2 (A, C)P e3 (A, A)P e4 (A, A)P e5 (A, G) = (1 e 4 3 ut 1 )(1 e 4 3 ut 2 )(1 + 3e 4 3 ut 3 )(1 + 3e 4 3 ut 4 )(1 e 4 3 ut 5 ), where e 1,..., e 5 are the five edges in T and t i = ω(e i ) for i = 1,..., Felsenstein s algorithm One can efficiently compute the likelihood for a given bifurcating, rooted phylogenetic tree T with edge lengths ω using Felsenstein s algorithm (Felsenstein, 1981). In this approach, for each node v in T and every possible character state a, in a post-order traversal of T, one recursively computes the conditional likelihood L v (a) for the subtree T v rooted at v, under the condition that the state of v is a, that is, we have α(v) = a. More precisely, for any internal node v with children w 1 and w 2, the main recursion is ( ) ( ) L v (a) = P {v,w1 }(a, b)l w1 (b) P {v,w2 }(a, b)l w2 (b), b b

23 88 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 where summation is over all possible states b and, as above, P {v,w} (a, b) is the probability of observing states a and b at the two ends of the edge {v, w} under the given model M. To initialize the computation, for each leaf v and every symbol a, we set L v (a) = 1, if the node v is labeled by the character state a, and L v (a) = 0, else Summary As mentioned above, we must consider all possible choices of edge lengths ω. Determining an optimal choice of ω is in general a computationally hard problem. One possible heuristic is to optimize the length of each edge in turn, while keeping all other edge lengths fixed. In summary, for a given multiple sequence alignment A on X, in maximum likelihood estimation, a result is obtained by searching through all possible phylogenetic trees T and for each such tree T, computing the maximum likelihood score. The method returns the tree that attains the best score Quartet puzzling Quartet puzzling 4 is a heuristic for approximating the maximum likelihood (MLE) tree. It is based on the idea of computing the MLE trees for all quartets of taxa and then using many rounds of combination steps in which one attempts to fit a random set of quartet trees together so as to obtain a tree on the whole dataset. Assume we are given a multiple alignment of sequences A = {a 1,..., a n }. heuristic proceeds in three steps: The quartet puzzling 1. For each quartet q of four distinct taxa a i, a j, a k, a l A compute the maximum likelihood tree T q. This step is called maximum likelihood step. 2. Choose a random order a i1, a i2,..., a in of the taxa and then use it to guide a combining or puzzle step that produces a tree on the whole data set from quartet trees. 3. Repeat step (2) many times to obtain trees T 1, T 2,..., T R. Report the majority consensus tree T Computing all quartet trees Given four taxa {a 1, a 2, a 3, a 4 }, there are three possible bifurcating topologies: a1 a3 a1 a2 a1 a2 a2 a4 a3 a4 a4 a3 {{a 1, a 2 }, {a 3, a 4 }} {{a 1, a 3 }, {a 2, a 4 }} {{a 1, a 4 }, {a 2, a 3 }} Now, let T be any phylogenetic tree that contains leaves labeled a 1, a 2, a 3, a 4, We say that T induces the split or tree topology {{a 1, a 2 }, {a 3, a 4 }}, if there exists an edge in T that separates the nodes labeled a 1 and a 2 from the nodes a 3 and a 4. We will sometimes abbreviate {{a 1, a 2 }, {a 3, a 4 }} by a 1, a 2 a 3, a 4. Assume we are given this tree T : 4 Korbinian Strimmer and Arndt von Haeseler (1996) Quartet-puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13,

24 Bioinformatics I, WS 09-10, D. Huson, December 1, a1 a5 a2 a3 a4 This tree induces the following five quartet tree topologies: a1 a3 a1 a3 a1 a4 a1 a4 a2 a4 a2 a5 a2 a5 a3 a5 a2 a4 a3 a5 It is easy to reconstruct the original tree from its set of quartets. However, in the presence of false quartet topologies (as inevitably produced by any phylogenetic reconstruction method), the reconstruction of the original tree can become difficult. In quartet puzzling, quartets are greedily combined to produce a tree for the whole data set. In an attempt to avoid the usual problems of a greedy approach, the combining step is run many times using different random orderings of the taxa The puzzle step The puzzle step is initialized with the maximum likelihood tree for {a 1, a 2, a 3, a 4 }, e.g.: a a3 a2 0 0 a4 Each edge e is labeled with a number of conflicts c(e) that is originally set to 0. In the puzzle step, the task is to decide how to insert the next taxon a i+1 into the tree T i already constructed for taxa {a 1,..., a i }. We proceed as follows: For each edge e in T i we consider what would happen if we were to attach taxon a i+1 into the middle of e by a new edge: We consider all quartets consisting of the taxon a i+1 and three taxa from {a 1, a 2,..., a i }. For each such quartet q we compare the precomputed maximum likelihood tree T q with the restriction of T i to q, denoted by T i q. If these two trees differ, then we increment c(e) by one. Finally, we attach e to an edge f in T i that has a minimum number of conflicts c(f), thereby obtaining T i+1. Example: We illustrate the puzzle step using the five taxon tree shown above. For each of the five quartets we compute the respective ML tree. Assume that these are indeed a1 a3 a1 a3 a1 a4 a1 a4 a2 a4 a2 a5 a2 a5 a3 a5 a2 a4 a3 a5 We start with the maximum likelihood tree on {a 1, a 2, a 3, a 4 }. The puzzle step for a 5 proceeds as

25 90 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 follows: Initial tree: a1 a a3 a4 a 1, a 2 a 4, a 5 a1 a a3 a4 a 1, a 3 a 4, a 5 a 1, a 2 a 3, a 5 a1 a a2 2 0 a4 a1 a a2 a4 a 2, a 3 a 4, a 5 a1 a a3 a4 Hence, the puzzle step suggests that taxon a 5 should be attached in the edge adjacent to the leaf labeled a 4 to obtain the tree a1 a5 a2 a3 a4 In geneal, this procedure is repeated for different orders of taxa many times (say, 1000 for a large dataset). Finally a consensus of all computed trees is derived as the final result The quartet puzzling algorithm Algorithm (Quartet puzzling) Input: Aligned sequences A = {a 1,..., a n }, number R of trees Output: Phylogenetic trees T 1, T 2,..., T R on A Initialization: Compute an MLE tree t q for each quartet q A for r = 1 to R do Randomly permute the order of the sequences in A Set T 4 equal to the precomputed MLE tree on {a 1, a 2, a 3, a 4 } for i = 5 to n do Set c(e) = 0 for all edges in T i 1 for each quartet q {a 1,..., a i } with a i q do Let x, y z, a i be the MLE tree topology on q Let p denote the path from x to y in T i 1 Increment c(e) by 1 for each edge e that is contained in p or that is separated from z by p Choose any edge f for which c(f) is minimal Obtain T i from T i 1 by attaching a new leaf with label a i to f Output T n end

26 Bioinformatics I, WS 09-10, D. Huson, December 1, Bayesian methods Bayesian inference of phylogenetic trees is a method whose goal is to obtain a quantity called the posterior probability of a phylogenetic tree. Generally speaking, the posterior probability of a result is the conditional probability of the result being observed, computed after seeing a given input dataset. The prior probability of a result is the marginal probability of the result being observed, set before the input dataset is examined. In other words, the prior probability distribution of phylogenetic trees on X reflects our belief of the distribution of all possible phylogenetic trees on X. This distribution is determined in advance, without knowledge of the actual sequence data to be examined. Let A be a multiple sequence alignment and T a phylogenetic tree with edge lengths ω, on X. Also, we will assume that some fixed model of evolution M is given. The posterior probability of T can be obtained from the prior probability of T with the help of the likelihood P (A T ) = P (A T, ω, M) using Bayes Theorem: Theorem (Bayes Theorem) P (T A) = P (A T ) P (T ). P (A) The posterior probability P (T A) can be interpreted as the probability that the tree T is correct, given the data. In the simplest case, all trees on X are considered to be equally probable and then the prior distribution is called a flat prior. In the case of a flat prior, the posterior probability P (T A) is proportional to the likelihood P (A T ) (computed under the model M) and the maximum likelihood tree also has maximum posterior probability. The maximum posterior probability over all trees is usually very low, if we think of tree space as a discrete sample space. In Bayesian tree inference the goal is not to determine a single phylogentic tree of maximum posterior probability, but rather to compute a sample of phylogenetic trees according to the posterior probability distribution of trees. Such a sample of trees is then processed further, for example, to generate a single consensus tree (as described later) and then to label the clusters in such a tree by their posterior probabilities. The expression for the posterior probability given by Bayes Theorem looks quite simple and we can efficiently compute P (A T ) using Felsensteins algorithm, as described in in the context of maximum likelihood. However, it is usually impossible to solve the complete expression analytically, as computation of the normalizing factor P (A) usually involves summing over all possible phylogenetic tree topologies T and integrating over all possible edge lengths ω and model parameters µ (of the given model M): where P (T A) = P (A T ) = P (A T )P (T ) P (A) ω µ = P (A T )P (T ) T P (A T )P (T ), P (A T, ω, µ)p (ω, µ)dµdω. To avoid this problem, Bayesian inference employs the Markov chain Monte Carlo (MCMC) approach, which is based on the idea of sampling from the posterior probability distribution using a suitably constructed chain of results.

27 92 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 This is a general approach that is used in a wide range of applications. In the context of phylogenetic tree reconstruction, the MCMC approach constructs a chain of phylogenetic trees. At each step, a modification of the current phylogenetic tree is proposed and then a probabilistic decision is made whether to accept the newly proposed phylogenetic tree or whether to keep the current one Metropolis-Hastings algorithm Algorithm (Metropolis-Hastings) Let T i be the phylogenetic tree at position i of the MCMC chain and let T be a proposed modification of T i. Using what is called the Metropolis-Hastings algorithm, the decision whether to accept T or to keep T i is based on the ratio of the posterior probabilities of the two trees, which is given by R = P (T A) P (T i A) = P (A T )P (T )P (A) P (A T i )P (T i )P (A) = P (A T )P (T ) P (A T i )P (T i ). Note that the normalizing factor P (A), which is so difficult to compute, cancels out and this expression is much easier to calculate than the actual posterior probability of either T i or T. Based on R, we accept the proposed phylogenetic tree T with probability min(1, R). In other words, we always accept T, if it has a better posterior probability than T i, and otherwise we draw a random number p uniformly from the interval [0, 1] and accept the tree T if p < R, and reject T, otherwise. Acceptance means that we set T i+1 = T, whereas rejection implies T i+1 = T i. The aim is to produce an MCMC chain M of bifurcating phylogenetic trees that has the property that the stationary distribution of the trees (including the parameters ω and µ) in the chain equals the posterior probability distribution of the trees (again, including ω and µ). One can show that, by construction, the decision rule described above implies that the number of times that a phylogenetic tree T occurs in M is a valid approximation of the posterior probability of that tree Practical issues For this approach to work well in practice, a number of issues must be taken into account. First, the mechanism for modifying the current phylogenetic tree must be designed with care. On the one hand, the suggested changes should be quite small so as to avoid a drastic drop in posterior probability. On the other hand, the modifications should be large enough so as to ensure that M is able to explore a sufficient portion of parameter space in a reasonable amount of time. Second, as the MCMC chain M starts at a random location in parameter space, usually of very low posterior probability, the chain must be allowed to burn in, that is, it must be given time to reach stationarity, a state in which the chain is actually sampling from the posterior probability. In practice, the first 10,000 elements of M, say, are excluded from further analysis. Third, trees that are close to each other in the MCMC chain M are highly correlated by construction, as they differ only by small modifications. To avoid this problem of autocorrelation, the final set of output trees is obtained by sparsely sampling M, retaining only every 1,000-th tree, say.

28 Bioinformatics I, WS 09-10, D. Huson, December 1, The local algorithm What types of modifications are used in practice? Branch-swapping methods such as NNI (nearest neighbor interchange), SPR (subtree prune and regraft) and TBR (tree bisection and reattachment) are used in the context of maximum parsimony and maximum likelihood methods to search through the space of phylogenetic tree topologies. For the purposes of constructing an MCMC chain we need to couple such an approach with a method for proposing modifications of edge lengths and parameters of the evolutionary model. We will now describe one approach that is used in practice called the local algorithm, which is a modification of the NNI method. The algorithm starts by choosing an internal edge (v, w) of the current phylogenetic tree T i at random. As we assume that T i is bifurcating, both of the nodes v and w are each connected to two further nodes, which we will randomly label a and b for v, and c and d for w, respectively. Let x, y and z denote the distance from a to v, w and d, respectively, as obtained by adding the edge lengths ω of the edges from a to v, v to w, and w to d, appropriately. Here we show a phylogenetic tree before and after a modification by the local algorithm: 0 x y z 0 y' x' z' a v w d a w v d b c c b The distance z from node a to node d is modified by a randomly chosen factor to a new value of z. In this example, the node w is moved by a random amount along the path from a to d, whereas the relative position of node v on the path from a to d is not changed Details of the local algorithm The local algorithm changes the length of the path from a to d and modifies the positions of v and w in that path. Here are some details: First define the new distance from a to d as z = z e p 0.5, where p is a random number uniformly chosen from the interval (0, 1). The lengths of edges on the path from a to d will be modified so that the distance from a to d will be z. Second, randomly decide with equal odds whether node v or node w is to be moved by a random amount. If we decide to move node v, then define x = q z and y = y z /z. Otherwise, define x = x z /z and y = q z. In both cases, q is a random number uniformly chosen from the interval (0, 1). Third, to reposition the nodes v and w according to the values just chosen, remove the two nodes v and w from the path connecting a to d and then reinsert them into the path at distances x and y from a, respectively. If x < y, then the local algorithm does not result in a change of the current tree topology. In

29 94 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 this case, only the lengths of the edges from a to v, v to w, and w to d change. In more detail, we set ω(a, v) = x, ω(v, w) = y x and ω(w, d) = z y. If x > y, then the modifications result is a new topology in which a and w, w and v, and v and d, respectively, are both connected by an edge. In this case, the edge lengths are ω(a, w) = y, ω(w, v) = x y and ω(v, d) = z x. Simple parameters of the evolutionary model, such as the mutation rate µ in the Jukes-Cantor model, are modified by adding a random number uniformly chosen from an appropriate interval centered at 0. For a set of parameters that is constrained to sum to some specific value, such as 1 in the case of probabilities, the values are randomly modified according to a so-called Dirichlet distribution. A proposal for modifying the tree and a proposal for new values for all parameters of the specified evolutionary model are generated independently and simultaneously, and then accepted or rejected together in in a single application of the ratio-based decision step of the Metropolis-Hasting algorithm Metropolis-coupled Markov chain Monte Carlo It is important that the MCMC chain M is given the opportunity to visit all relevant parts of parameter space. In consequence, the larger the number of taxa is, and the more parameters the evolutionary model M has, the longer the chain must be run. A popular approach for exploring more of parameter space in less time is to use a variant of MCMC called Metropolis-coupled Markov chain Monte Carlo, or (MC 3 ), for short. In this approach k different MCMC chains M 1,..., M k are run in parallel. The first chain M 1 is called the cold chain and this is the one from which inferences will be made. The other k 1 chains are called heated chains. Each of the heated chains aims at sampling from a posterior probability distribution raised to a power of β > 1. Heating a chain has the effect of flattening out the probability landscape and the hope is that a heated chain will be able to move more freely through parameter space. For example, in a strategy called incremental heating, the i-th chain M i aims at sampling from the distribution P (T A) β(i), with 1 β(i) = 1 + (i 1)t, where t is a user-specified heating parameter. After all k chains have moved one step, a swap of states is proposed between two randomly chosen chains M i and M j, using a Metropolis-like algorithm. Let T i denote the current state of chain M i, for all i = 1,..., k. (Here, by current state we mean the total state consisting of the current tree T, the current length parameters ω and the current model parameters µ.) A swap between M i and M j is accepted with probability α = P (T j A) β(i) P (T i A) β(j) P (T i A) β(i) P (T j A) β(j). The main idea is that the heated chains act as scouts exploring the parameter space and that swapping will occasionally help the cold chain to move to a new part of the space which it would otherwise have difficulty reaching because of a deep valley in the original distribution. At the end of the run, the heated chains are discarded and the cold chain is then processed as described above Outlook A major challenge is how do we determine whether an MCMC chain M has been run long enough so that that chain has converged, that is, it provides a good approximation of the posterior distribution of trees. A simple approach is to monitor the likelihood of the phylogenetic trees added to M and to

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics Bioinformatics 1 Biology, Sequences, Phylogenetics Part 4 Sepp Hochreiter Klausur Mo. 30.01.2011 Zeit: 15:30 17:00 Raum: HS14 Anmeldung Kusss Contents Methods and Bootstrapping of Maximum Methods Methods

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution Today s topics Inferring phylogeny Introduction! Distance methods! Parsimony method!"#$%&'(!)* +,-.'/01!23454(6!7!2845*0&4'9#6!:&454(6 ;?@AB=C?DEF Overview of phylogenetic inferences Methodology Methods

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree) I9 Introduction to Bioinformatics, 0 Phylogenetic ree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & omputing, IUB Evolution theory Speciation Evolution of new organisms is driven by

More information

Phylogeny. November 7, 2017

Phylogeny. November 7, 2017 Phylogeny November 7, 2017 Phylogenetics Phylon = tribe/race, genetikos = relative to birth Phylogenetics: study of evolutionary relationships among organisms, sequences, or anything in between Related

More information

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Building Phylogenetic Trees 1 Phylogenetics: Building Phylogenetic Trees COMP 571 Luay Nakhleh, Rice University 2 Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary model should

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center Outline Basic Concepts Tree Construction Methods Distance-based methods

More information

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1 Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003 Lecturer: Wing-Kin Sung Scribe: Ning K., Shan T., Xiang

More information

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Minimum evolution using ordinary least-squares is less robust than neighbor-joining Minimum evolution using ordinary least-squares is less robust than neighbor-joining Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA email: swillson@iastate.edu November

More information

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University Phylogenetics: Building Phylogenetic Trees COMP 571 - Fall 2010 Luay Nakhleh, Rice University Four Questions Need to be Answered What data should we use? Which method should we use? Which evolutionary

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University Phylogenetics: Parsimony and Likelihood COMP 571 - Spring 2016 Luay Nakhleh, Rice University The Problem Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions

More information

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz Phylogenetic Trees What They Are Why We Do It & How To Do It Presented by Amy Harris Dr Brad Morantz Overview What is a phylogenetic tree Why do we do it How do we do it Methods and programs Parallels

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe? How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Theory of Evolution Charles Darwin

Theory of Evolution Charles Darwin Theory of Evolution Charles arwin 858-59: Origin of Species 5 year voyage of H.M.S. eagle (83-36) Populations have variations. Natural Selection & Survival of the fittest: nature selects best adapted varieties

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary CSCI1950 Z Computa4onal Methods for Biology Lecture 4 Ben Raphael February 2, 2009 hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary Parsimony Probabilis4c Method Input Output Sankoff s & Fitch

More information

Phylogeny: traditional and Bayesian approaches

Phylogeny: traditional and Bayesian approaches Phylogeny: traditional and Bayesian approaches 5-Feb-2014 DEKM book Notes from Dr. B. John Holder and Lewis, Nature Reviews Genetics 4, 275-284, 2003 1 Phylogeny A graph depicting the ancestor-descendent

More information

Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree Reconstruction 1 4 Molecular Evolution and Phylogenetic Tree Reconstruction 3 2 5 1 4 2 3 5 Orthology, Paralogy, Inparalogs, Outparalogs Phylogenetic Trees Nodes: species Edges: time of independent evolution Edge length

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Distance Methods Character Methods

More information

Bayesian Models for Phylogenetic Trees

Bayesian Models for Phylogenetic Trees Bayesian Models for Phylogenetic Trees Clarence Leung* 1 1 McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada ABSTRACT Introduction: Inferring genetic ancestry of different species

More information

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies Inferring Phylogenetic Trees Distance Approaches Representing distances in rooted and unrooted trees The distance approach to phylogenies given: an n n matrix M where M ij is the distance between taxa

More information

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D 7.91 Lecture #5 Database Searching & Molecular Phylogenetics Michael Yaffe B C D B C D (((,B)C)D) Outline Distance Matrix Methods Neighbor-Joining Method and Related Neighbor Methods Maximum Likelihood

More information

Phylogenetics: Parsimony

Phylogenetics: Parsimony 1 Phylogenetics: Parsimony COMP 571 Luay Nakhleh, Rice University he Problem 2 Input: Multiple alignment of a set S of sequences Output: ree leaf-labeled with S Assumptions Characters are mutually independent

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

An Investigation of Phylogenetic Likelihood Methods

An Investigation of Phylogenetic Likelihood Methods An Investigation of Phylogenetic Likelihood Methods Tiffani L. Williams and Bernard M.E. Moret Department of Computer Science University of New Mexico Albuquerque, NM 87131-1386 Email: tlw,moret @cs.unm.edu

More information

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method Phylogeny 1 Plan: Phylogeny is an important subject. We have 2.5 hours. So I will teach all the concepts via one example of a chain letter evolution. The concepts we will discuss include: Evolutionary

More information

The Generalized Neighbor Joining method

The Generalized Neighbor Joining method The Generalized Neighbor Joining method Ruriko Yoshida Dept. of Mathematics Duke University Joint work with Dan Levy and Lior Pachter www.math.duke.edu/ ruriko data mining 1 Challenge We would like to

More information

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony ioinformatics -- lecture 9 Phylogenetic trees istance-based tree building Parsimony (,(,(,))) rees can be represented in "parenthesis notation". Each set of parentheses represents a branch-point (bifurcation),

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Multiple Sequence Alignment. Sequences

Multiple Sequence Alignment. Sequences Multiple Sequence Alignment Sequences > YOR020c mstllksaksivplmdrvlvqrikaqaktasglylpe knveklnqaevvavgpgftdangnkvvpqvkvgdqvl ipqfggstiklgnddevilfrdaeilakiakd > crassa mattvrsvksliplldrvlvqrvkaeaktasgiflpe

More information

TheDisk-Covering MethodforTree Reconstruction

TheDisk-Covering MethodforTree Reconstruction TheDisk-Covering MethodforTree Reconstruction Daniel Huson PACM, Princeton University Bonn, 1998 1 Copyright (c) 2008 Daniel Huson. Permission is granted to copy, distribute and/or modify this document

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees. Describe the relationship between objects, e.g. species or genes Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies The evolutionary relationships between

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Consistency Index (CI)

Consistency Index (CI) Consistency Index (CI) minimum number of changes divided by the number required on the tree. CI=1 if there is no homoplasy negatively correlated with the number of species sampled Retention Index (RI)

More information

Inferring Molecular Phylogeny

Inferring Molecular Phylogeny Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

More information

Week 5: Distance methods, DNA and protein models

Week 5: Distance methods, DNA and protein models Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69 A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03

More information

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Integrative Biology 200 PRINCIPLES OF PHYLOGENETICS Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft] Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley K.W. Will Parsimony & Likelihood [draft] 1. Hennig and Parsimony: Hennig was not concerned with parsimony

More information

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5. Five Sami Khuri Department of Computer Science San José State University San José, California, USA sami.khuri@sjsu.edu v Distance Methods v Character Methods v Molecular Clock v UPGMA v Maximum Parsimony

More information

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem? Who was Bayes? Bayesian Phylogenetics Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison October 6, 2011 The Reverand Thomas Bayes was born in London in 1702. He was the

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

Bayesian Phylogenetics

Bayesian Phylogenetics Bayesian Phylogenetics Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison October 6, 2011 Bayesian Phylogenetics 1 / 27 Who was Bayes? The Reverand Thomas Bayes was born

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

arxiv: v1 [q-bio.pe] 1 Jun 2014

arxiv: v1 [q-bio.pe] 1 Jun 2014 THE MOST PARSIMONIOUS TREE FOR RANDOM DATA MAREIKE FISCHER, MICHELLE GALLA, LINA HERBST AND MIKE STEEL arxiv:46.27v [q-bio.pe] Jun 24 Abstract. Applying a method to reconstruct a phylogenetic tree from

More information

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used Molecular Phylogenetics and Evolution 31 (2004) 865 873 MOLECULAR PHYLOGENETICS AND EVOLUTION www.elsevier.com/locate/ympev Efficiencies of maximum likelihood methods of phylogenetic inferences when different

More information

What is Phylogenetics

What is Phylogenetics What is Phylogenetics Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The basic idea is to compare specific characters (features)

More information

A (short) introduction to phylogenetics

A (short) introduction to phylogenetics A (short) introduction to phylogenetics Thibaut Jombart, Marie-Pauline Beugin MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis with PR Statistics, Millport Field

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

Reconstruction of certain phylogenetic networks from their tree-average distances

Reconstruction of certain phylogenetic networks from their tree-average distances Reconstruction of certain phylogenetic networks from their tree-average distances Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu October 10,

More information

Finding the best tree by heuristic search

Finding the best tree by heuristic search Chapter 4 Finding the best tree by heuristic search If we cannot find the best trees by examining all possible trees, we could imagine searching in the space of possible trees. In this chapter we will

More information

Tree-average distances on certain phylogenetic networks have their weights uniquely determined

Tree-average distances on certain phylogenetic networks have their weights uniquely determined Tree-average distances on certain phylogenetic networks have their weights uniquely determined Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Phylogenetics: Likelihood

Phylogenetics: Likelihood 1 Phylogenetics: Likelihood COMP 571 Luay Nakhleh, Rice University The Problem 2 Input: Multiple alignment of a set S of sequences Output: Tree T leaf-labeled with S Assumptions 3 Characters are mutually

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Reconstructing Trees from Subtree Weights

Reconstructing Trees from Subtree Weights Reconstructing Trees from Subtree Weights Lior Pachter David E Speyer October 7, 2003 Abstract The tree-metric theorem provides a necessary and sufficient condition for a dissimilarity matrix to be a tree

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Evolutionary trees. Describe the relationship between objects, e.g. species or genes Evolutionary trees Bonobo Chimpanzee Human Neanderthal Gorilla Orangutan Describe the relationship between objects, e.g. species or genes Early evolutionary studies Anatomical features were the dominant

More information

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

CSCI1950 Z Computa4onal Methods for Biology Lecture 5 CSCI1950 Z Computa4onal Methods for Biology Lecture 5 Ben Raphael February 6, 2009 hip://cs.brown.edu/courses/csci1950 z/ Alignment vs. Distance Matrix Mouse: ACAGTGACGCCACACACGT Gorilla: CCTGCGACGTAACAAACGC

More information

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science Phylogeny and Evolution Gina Cannarozzi ETH Zurich Institute of Computational Science History Aristotle (384-322 BC) classified animals. He found that dolphins do not belong to the fish but to the mammals.

More information

Concepts and Methods in Molecular Divergence Time Estimation

Concepts and Methods in Molecular Divergence Time Estimation Concepts and Methods in Molecular Divergence Time Estimation 26 November 2012 Prashant P. Sharma American Museum of Natural History Overview 1. Why do we date trees? 2. The molecular clock 3. Local clocks

More information

C.DARWIN ( )

C.DARWIN ( ) C.DARWIN (1809-1882) LAMARCK Each evolutionary lineage has evolved, transforming itself, from a ancestor appeared by spontaneous generation DARWIN All organisms are historically interconnected. Their relationships

More information

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi DNA Phylogeny Signals and Systems in Biology Kushal Shah @ EE, IIT Delhi Phylogenetics Grouping and Division of organisms Keeps changing with time Splitting, hybridization and termination Cladistics :

More information

BINF6201/8201. Molecular phylogenetic methods

BINF6201/8201. Molecular phylogenetic methods BINF60/80 Molecular phylogenetic methods 0-7-06 Phylogenetics Ø According to the evolutionary theory, all life forms on this planet are related to one another by descent. Ø Traditionally, phylogenetics

More information

Examining Phylogenetic Reconstruction Algorithms

Examining Phylogenetic Reconstruction Algorithms Examining Phylogenetic Reconstruction Algorithms Evan Albright ichbinevan@gmail.com Jack Hessel jmhessel@gmail.com Cody Wang mcdy143@gmail.com Nao Hiranuma hiranumn@gmail.com Abstract Understanding the

More information

Phylogeny Tree Algorithms

Phylogeny Tree Algorithms Phylogeny Tree lgorithms Jianlin heng, PhD School of Electrical Engineering and omputer Science University of entral Florida 2006 Free for academic use. opyright @ Jianlin heng & original sources for some

More information

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms A Comparative Analysis of Popular Phylogenetic Reconstruction Algorithms Evan Albright, Jack Hessel, Nao Hiranuma, Cody Wang, and Sherri Goings Department of Computer Science Carleton College MN, 55057

More information

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction

Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Pitfalls of Heterogeneous Processes for Phylogenetic Reconstruction Daniel Štefankovič Eric Vigoda June 30, 2006 Department of Computer Science, University of Rochester, Rochester, NY 14627, and Comenius

More information

Estimating Evolutionary Trees. Phylogenetic Methods

Estimating Evolutionary Trees. Phylogenetic Methods Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Lecture 6 Phylogenetic Inference

Lecture 6 Phylogenetic Inference Lecture 6 Phylogenetic Inference From Darwin s notebook in 1837 Charles Darwin Willi Hennig From The Origin in 1859 Cladistics Phylogenetic inference Willi Hennig, Cladistics 1. Clade, Monophyletic group,

More information

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço jcarrico@fm.ul.pt Charles Darwin (1809-1882) Charles Darwin s tree of life in Notebook B, 1837-1838 Ernst Haeckel (1934-1919)

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

A Minimum Spanning Tree Framework for Inferring Phylogenies

A Minimum Spanning Tree Framework for Inferring Phylogenies A Minimum Spanning Tree Framework for Inferring Phylogenies Daniel Giannico Adkins Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-010-157

More information

Phylogeny of Mixture Models

Phylogeny of Mixture Models Phylogeny of Mixture Models Daniel Štefankovič Department of Computer Science University of Rochester joint work with Eric Vigoda College of Computing Georgia Institute of Technology Outline Introduction

More information

Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

More information