66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866

Size: px

Start display at page:

Download "66 Bioinformatics I, WS 09-10, D. Huson, December 1, Evolutionary tree of organisms, Ernst Haeckel, 1866"

Corey Jacobs
5 years ago
Views:

66 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 5 Phylogeny Evolutionary tree of organisms, Ernst Haeckel, 1866 5.1 References J. Felsenstein, Inferring Phylogenies, Sinauer, 2004. C.

1 66 Bioinformatics I, WS 09-10, D. Huson, December 1, Phylogeny Evolutionary tree of organisms, Ernst Haeckel, References J. Felsenstein, Inferring Phylogenies, Sinauer, C. Semple and M. Steel, Phylogenetics, Oxford University Press, R. Durbin, S. Eddy, A. Krogh & G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.L. Swofford, G.J. Olsen, P.J.Waddell & D.M. Hillis, Phylogenetic Inference, in: D.M. Hillis (ed.), Molecular Systematics, 2 ed., Sunderland Mass., M. Salemi and A-M. Vandamme, The Phylogenetic Handbook Cambridge University Press 5.2 Why phylogenetic analysis? Phylogenetic analysis has many applications, including understanding the emergence and development of diseases:

2 Bioinformatics I, WS 09-10, D. Huson, December 1, Outline of the chapter 1. Enumerating phylogenetic trees 2. Computer representation of trees 3. Distance-based reconstruction methods 4. Parsimony-based reconstruction methods 5. Jukes-Cantor model of DNA sequences evolution 6. Maximum likelihood reconstruction methods 7. Bayesian reconstruction methods 8. Phylogenetic Networks 1, 3, 4 will cover some material already presented in Grundlagen der Bioinformatik 5.4 Phylogenetic analysis The goal of phylogenetic analysis is to determine and describe the evolutionary relationship between a givens set of species. In particular, this involves determining the order of speciation events and their approximate timing. It is generally assumed that speciation is a branching process: a population of organisms becomes separated into two sub-populations. Over time, these evolve into separate species that do not crossbreed. Because of this assumption, a tree is often used to represent a proposed phylogeny for a set of species, showing how the species evolved from a common ancestor.

3 68 Bioinformatics I, WS 09-10, D. Huson, December 1, Early Evolutionary Studies Before the early 1960s: Anatomical/Morphological features were the dominant criteria used to derive evolutionary relationships between species 1958: Francis Crick suggested to use molecular sequences for phylogenetic tree reconstruction. 1962: Emile Zuckerkandl and Linus Pauling built the first phylogenetic tree from aligned amino acid sequences. They also proposed the theory of the molecular clock. 1967: Walter Fitch and Emanuel Margoliash published the first algorithm for phylogenetic tree reconstruction (least-squares algorithm). 1977: Carl Woese uses phylogenies based on SSU rrna sequences to show that are are three Domains of Life: Eukaryotes, Bacteria and Archaea Phylogenetic trees In the following, we will use X = {x 1, x 2,..., x n } to denote a set of taxa, where a taxon x i is simply a representative of a group of individuals defined in some way. Definition (Phylogenetic tree) A phylogenetic tree (on X) is a system T = (V, E, λ) consisting of a connected graph (V, E) without cycles, together with a labeling λ of the leaves by elements of X, such that: 1. every leaf is labeled by exactly one taxon, and 2. every taxon appears exactly once as a label Unrooted trees An unrooted phylogenetic tree is obtained by placing a set of taxa on the leaves of a tree: Pan_panisc Gorilla Homo_sap Mus_mouse Rattus_norv harbor_sel Bos_ta(cow) fin_whale blue whale Gorilla Pan_panisc Homo_sap Mus mouse harbor_sel Rattus norv Bos_ta(cow) Taxa X + tree phylogenetic tree T on X fin_whale blue_whale

Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 69 5.4.4 Unrooted and rooted trees 5.4.5 Enumerating phylogenetic trees Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves.

4 Bioinformatics I, WS 09-10, D. Huson, December 1, Unrooted and rooted trees Enumerating phylogenetic trees Let T be an unrooted phylogenetic tree on n taxa, i.e., with n leaves. We assume that T is bifurcating, i.e. that each internal node has degree 3. Any non-bifurcating tree on n taxa will have less nodes and edges. Lemma (Number of nodes and edges) A bifurcating phylogenetic tree T on n taxa has 2n 2 nodes, 2n 3 edges and n 3 internal edges for all n 3. Proof: by induction. For n = 3 there is exactly one tree: it has 3 outer edges and zero internal edges. It has 3 leaves and one internal node. Assume that the statement holds for n. Any tree T with n + 1 leaves can be obtained from some tree T with n leaves by inserting a new node v into some edge e of T and connecting v to a new leaf w. This increases both the number of nodes and the number of edges by 2. Thus T has 2n = 2(n + 1) 2 nodes and 2n = 2n 1 = 2(n + 1) 3 edges. The number of internal edges is equal to the total number of edges minus the number of edges leading to leaves: (2n 3) n = n 3. Lemma (Number of trees) There are unrooted trees on n leaves, and rooted trees. U(n) = (2n 5)!! := (2n 5) R(n) = (2n 3)!! = U(n) (2n 3) = (2n 3) The number of possible trees grows extremely rapidly for increasing values of n. For example, for n = 10 we have U(n) = and R(n) = Distances in phylogenetic trees Let T = (V, E) be a phylogenetic tree with node set V and edge set E. An edge weight function ω : E R 0 assigns a non-negative number ω(e) to every edge e. These weights (or lengths) often represent

5 70 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 the number of mutations on evolutionary path from one species to another, or a time estimate for evolution of one species into another. If T is a phylogenetic tree with edge weights ω, then the tree distance (or patristic distance) between i and j is given by d T (i, j) = ω(e), e P (i,j) where the sum is taken over all edges e that are contained in the path P (i, j) from i to j. Example: A B 8 9 D C E The tree distance between B and C is d T (B, C) = = Computer representation of a phylogenetic tree Let X be a set of taxa and T a phylogenetic tree on X. To represent a phylogenetic tree in a computer we maintain a set of nodes V and a set of edges E. Each edge e E maintains a reference to its source node s(e) and target node t(e). Each node v V maintains a list of references to all incident edges. Each leaf node v V maintains a reference λ(v) to the taxon that it is labeled by, and, vice versa, ν(x) maps a taxon x to the node that it labels. Each edge e E maintains its length ω(e) Nested structure A rooted phylogenetic tree T = (V, E, λ) is a nested structure: Consider any node v V. Let T v denote the subtree rooted at v. Let v 1, v 2,..., v k denote the children of v. Then T v is obtainable from its subtrees T v1, T v2,..., T vk by connecting v to each of their roots. Such a nested structure can be written using nested brackets:

6 Bioinformatics I, WS 09-10, D. Huson, December 1, v9 taxon1 Phylogenetic tree v2 v1 v3 v4 v5 v6 v7 v8 taxon3 v10 taxon2 Description: (((taxon 1, taxon 2 ), taxon 3 ), (taxon 4, taxon 5, taxon 6 )) taxon4 taxon5 taxon6 nested description v 1 (v 2, v 3 ) ((v 4, v 5 ), (v 6, v 7, v 8 )) (((v 9, v 10 ), v 5 ), (v 6, v 7, v 8 )) Newick format Phylogenetic trees are usually read and printed using the so-called Newick format. The Newick Standard for representing trees in computer-readable form makes use of the correspondence between trees and nested parentheses, noticed in 1857 by the famous English mathematician Arthur Cayley. The tree ends with a semicolon. Interior nodes are represented by a pair of matched parentheses. Between them are representations of the nodes that are immediately descended from that node, separated by commas. Leaves are represented by their names. Trees can be multifurcating at any level. Branch lengths can be incorporated into a tree by putting a real number after a node and preceded by a colon. This represents the length of the branch immediately below that node. Examples: Unrooted tree Rooted tree with edge lengths a e b d c Newick string: ((a,b),c,(d,e)); Newick string: (((A:0.5,B:0.5):1.5,C:2.0):1.0,(D:1.0,E:1.0):2.0); In the Newick format, each pair of matching brackets corresponds to an internal node and each taxon label corresponds to an external node. To add edge lengths to the format, specify the length len of the edge along which given node we visited by adding :len behind the closing bracket, in the case of an internal node, and behind the label, in the case of a leaf. For example, consider the tree (a,b,(c,d)). If all edges have the same length 1, then this tree is written as (a:1,b:1,(c:1,d:1):1) Algorithm for Tree to Newick This algorithm recursively prints a tree in Newick format. Initialisation with e = null and v = ρ, if the tree is rooted, or v set to an arbitrary internal node, else. Algorithm (tonewick(e, v)) Input: Phylogenetic tree T = (V, E) on X, with labeling λ : V X

7 72 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Output: Newick description of tree begin if v is a leaf then print λ(v) else // v is an internal node print ( for each edge f e incident to v do If this is not the first pass of the loop, print, Let w v be the other node incident to f call tonewick(f, w) print ) print ; end Algorithm for parsing Newick Algorithm parses a tree in Newick format from a (space-free) string s = s 1 s 2... s m. Initialisation: parsenewick(1, m, null). Algorithm (parsenewick(i,j,v)) Input: a substring s i... s j and a root node v Output: a phylogenetic tree T = (V, E, λ) begin while i j do Add a new node w to V if v null then add a new edge {v, w} to E if s i = ( then // hit a subtree Set k to the position of the balancing close-bracket for i call parsenewick(i + 1, k 1, w) // strip brackets and recurse else // hit a leaf Set k = min{k i s k+1 =, or k + 1 = j} Set λ(w) = s i... s k // grab label for leaf end Set i = k + 1 // advance to next, or j if i < j then Check that i + 1 j and s i+1 =, Increment i // advance to next token 5.6 Constructing phylogenetic trees There are four main approaches to constructing phylogenetic trees: 1. Using a distance method, one first computes a distance matrix for a given set of biological data and then one computes a tree that represents these distances as closely as possible. 2. Maximum parsimony takes as input a set of aligned sequences and attempts to find a tree and a labeling of its internal nodes by auxiliary sequences such that the number of mutations along the tree is minimum. 3. Given a probabilistic model of evolution, maximum likelihood approaches aim at finding a phylogenetic tree that maximizes the likelihood of obtaining the given sequences. 4. Bayesian reconstruction methods estimate a posterior probability distribution of trees.

8 Bioinformatics I, WS 09-10, D. Huson, December 1, Distance-based tree reconstruction Assume we are given a set X = {x 1, x 2,..., x n } of taxa. The input to a distance method is a distance matrix D : X X R 0 that associates a distance d(x i, x j ) with every pair of taxa x i, x j X. We require that D satisfies at least the first two properties of a metric: 1. symmetry, d(x, y) = d(y, x), 2. the triangle inequality, d(x, y) + d(y, z) (dx, z) and 3. identity, x y d(x, y) > 0. Distance-Based Phylogeny Problem: Reconstruct an evolutionary tree on n leaves from an n n distance matrix Input: A distance matrix D = (d ij ) on X. Output: A phylogenetic tree T (rooted or unrooted) with edge lengths on X such that the distances in T approximate D in some prescribed way. The most commonly distance-based methods used are 1. UPGMA 2. Neighbor joining 3. Minimum evolution and related methods 5.8 The neighbor-joining algorithm The most widely used distance method is neighbor joining (NJ), originally introduced by Saitou and Nei (1987) 1 and modified by Studier and Keppler (1988). Given a distance matrix D, neighbor joining produces an unrooted phylogenetic tree T with edge lengths. It is more widely applicable than UPGMA, as it does not assume a molecular clock. Starting from a star tree topology the algorithm proceeds to build a tree based on D by repeatedly pairing neighboring taxa: c a d c a d b g b g e f e f 1 Saitou, N., Nei, Y. (1987) SIAM J. on Comp. 10:

9 74 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Algorithm (Neighbor joining) Input: Distance matrix D on X Output: Unrooted phylogenetic tree T Initialization: Define L to be the set of leaf nodes, one for each taxon x X Set T = L while L > 2 do Compute the neighbor-joining matrix N: For all i, j set N ij = d ij (r i + r j ), where r i = 1 L 2 k L d ik Pick a pair i, j L for which N ij is minimum Create a new node k in T Update D: For all m L set d km = 1 2 (d im + d jm d ij ) Connect k to i and to j using two new edges of lengths d ik = 1 2 (d ij + r i r j ) and d jk = d ij d ik, respectively. Remove i and j from L and add k to L. if L = {i, j}, then create a new edge between i and j of length d ij. Let i and j be two neighboring leaves that have the same parent node, k. Remove i, j from the list of nodes L and add k to the current list of nodes. How do we update its distance to any given leaf m? Consider: i i j m j k m We must have: thus which implies d im = d ik + d km, d jm = d jk + d km and d ij = d ik + d jk, d im + d jm = d ik + d km + d jk + d km = d ij + 2d km, d km = 1 2 (d im + d jm d ij ). Why do we set the length of the new edges to d ik = 1 2 (d ij + r i r j ) and d jk = d ij d ik? By definition, r i = 1 L 2 k L d ik equals the average distance q i = 1 L 2 other nodes m i, j, plus d ij L 2 : i j dij k q i m 1 average distance q j m 5 m m 2 4 m 3 k L,k i,j d ik from i to all

10 Bioinformatics I, WS 09-10, D. Huson, December 1, As we see from this figure: 2d ik = d ij + q i q j = d ij + q i q j Example Taxa A B C D A B C - 11 D - d ij L 2 Initialisation: L = X = {A, B, C, D} Iteration: 1. Compute N: first compute column sums r i : d ij L 2 = d ij + r i r j. r A = = 27, r B = = 31, r C = = 27, r D = = 37 N = = T axa A B C D A B 8 (ra + rb)/ C 7 (ra + rc)/2 9 (rb + rc)/2 11 D 12 (ra + rd)/2 14 (rb + rd)/2 11 (rc + rd)/2 T axa A B C D A B C D One minimum is N(A, B) = 21. Define new internal node k to join A and B and compute edge lengths of k to A and of k to B: d(a, k) = 1 2 (d(a, B) + r A r B ) = 1 (8 + 27/2 31/2) = 3 2 d(b, k) = d(a, B) d(a, k) = 8 3 = 5 Compute new D : where T axa k C D k d(k, C) d(k, D) C 11 D d(k, C) = 1 (d(a, C) + d(b, C) d(a, B)) = 4 2 d(k, D) = 1 (d(a, D) + d(b, D) d(a, B)) = 9 2 Second iteration: Compute N: we have L = {AB, C, D}, L = 3, L 2 = 1 r k = = 13, r C = = 15, r D = = 20 N = = T axa AB C D AB 4 9 C 4 ( )/1 11 D 9 ( )/1 11 ( )/1 T axa AB C D AB 4 9 C D 24 24

11 76 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 A minimum is given by N(AB, C). Define a new node k to join AB and C: d(c, k) = 1 2 (d(ab, C) + rc rab) = 1 ( ) = 3 2 d(ab, k) = d(ab, C) d(c, k) = 4 3 = 1 Termination: Only two nodes left, D and ABC. Compute length of edge d(abc, D): d(abc, D) = d(c, D) d(abc, C) = 11 3 = 8 This is the resulting tree: B 5 D A 3 C Additivity and the four-point condition A distance matrix D is called additive, or a tree metric, if it can be represented by a phylogenetic tree. How to tell whether D is additive? Definition (Four-point condition) Assume we are given a set of taxa X. A distance matrix D on X is said to fulfill the four-point condition if for all i, j, k, l X holds. d(i, j) + d(k, l) max{d(i, k) + d(j, l), d(i, l) + d(j, k)} (That is, two of the three expressions d(x i, x j )+d(x k, x l ), d(x i, x k )+d(x j, x l ), and d(x i, x l )+d(x j, x k ) are equal and are not smaller than the third.) We have the following classical result (Buneman, 1971): Theorem (Additivity metrics and the 4PC) A distance matrix D on X is additive, if and only if it fulfills the four-point condition. We will prove one direction of the theorem. Proof : Assume that D is additive. Then there exists a phylogenetic tree T that represents D. Let i, j, k, l be elements of X. If they are all different, then we can assume w.l.o.g. that (i, j) and (k, l) are neighbors in the induced tree T {i,j,k,l}, respectively. It follows that d(i, j) + d(k, l) d(i, k) + d(j, l) = d(i, l) + d(j, k),

12 Bioinformatics I, WS 09-10, D. Huson, December 1, as can be seen from the following figure: and thus D fulfills the four-point condition. Example additive metric: B C D D = A B 3 6 C 5 Check the four-point condition: d(a, B) + d(c, D) = = 12 d(a, C) + d(b, D) = = 12 d(a, D) + d(b, C) = = 8 the four-point condition holds. D = A non-additive metric: B C D A B 4 7 C 5 Check the four-point condition: d(a, B) + d(c, D) = = 12 d(a, C) + d(b, D) = = 14 d(a, D) + d(b, C) = = 10 d(a, C) + d(b, D) max (d(a, B) + d(c, D), d(a, D) + d(b, C)) 4-point condition doesn t hold. D D C C A B A B 5.9 The Jukes-Cantor model of evolution T. Jukes and C. Cantor (1969) introduced a very simple model of DNA sequence evolution: Definition (Jukes-Cantor model) The Jukes-Cantor model of evolution makes the following assumptions: 1. We are given a rooted phylogenetic tree T along which sequences evolve. 2. The sequence length is an input parameter and possible states for each site are A, C, G and T. 3. The possible states for each site are A, C, G and T. 4. For each site of the sequence at the root of T, the state is drawn from a given distribution (typically uniform). 5. The sites evolve identically and independently (i.i.d.) down the edges of the tree from the root at a fixed rate u. 6. With each edge e E we associate a duration t = t(e) and the expected number of mutations per site along e is uτ(e). The probabilities of change to each of the 3 remaining states are equal.

13 78 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 How do we evolve a sequence down an edge e under the Jukes-Cantor model? Let v = (v 1,..., v n ) and w = (w 1,..., w n ) denote the source and target sequences associated with e. We assume that v has already been determined and we want to determine w. Under the Jukes-Cantor model, the evolutionary event C 3 = nucleotide changes to one of the other three bases occurs at a fixed rate of u. Now consider the event C 4 = nucleotide changes to any base (including back to the same base). The rate of occurrences of this event is taken to be 4 3 u. The probability that event C 4 does not occur within time t is: e 4 3 ut (Poisson distribution with k = 0). Given that the event C 4 occurs, the probability of ending in any particular base is 1 4. Let P and Q denote the base pair present at a given site before and after a given time period t. The probability that an event of type C 4 occurs and changes P {A, C, G, T} to base Q within time t is: Prob(Q P, t) = 1 ( 1 e 4 ut) 3. 4 Now, adding the probability that the event C 4 does not occur, we obtain the probability that the state does not change, i.e. P = Q: Prob(Q = P P, t) = e 4 3 ut + 1 ( 1 e 4 ut) 3 = 1 ( 1 + 3e 4 ut) Thus, we obtain a probability-of-change formula for the probability of an observable change occurring at any given site in time t: Prob(change t) = 1 Prob(Q = P P, t) = (1 + 3e 4 ut) 3 ( 1 e 4 ut) 3. = 3 4 We can use this model to generate artificial sequences along a model tree. E.g., set the sequence length to 10 and the mutation rate to u = 0.1. Initially, the root node is assigned a random sequence. Then the sequences are evolved down the tree, at each edge using the probability-of-change formula to decide whether a given base is to change: 1 a ACGTTTCGAG ACGTTTAGAG b 2 1 c * 3 ACCTTGCGGG 1 d 1 e 1 f E.g., the probability of change along the edge marked is 0.75(1 e ) = 0.75(1 e 0.4 ) The Jukes-Cantor distance transformation The Hamming distance between two sequences underestimates their true evolutionary distance (average number of mutations that took place per site over the elapsed time), because it doesn t count back mutations and multiple mutations at the same site.

14 Bioinformatics I, WS 09-10, D. Huson, December 1, Thus, a correction formula is required. For the Jukes-Cantor model, inversion of the probability-ofchange formula leads to the following: Definition (Jukes-Cantor transformation) The Jukes-Cantor distance transformation for a pair of sequences a and b is defined as: formula: JC(a, b) = 3 ( 4 ln 1 4 ) 3 Ham(a, b). (5.1) 5.11 More general transformations The Jukes-Cantor model assumes that all nucleotides occur with the same frequency. This assumption is relaxed under the Felsenstein 81 model introduced by Joe Felsenstein (1981), which has the following distance transformation: F 81(a i, a j ) = B ln(1 Ham(a i, a j )/B), (5.2) where B = 1 (π 2 A + π 2 C + π 2 G + π 2 T) and π Q is the frequency of base Q. The base frequency is obtained from the pair of sequences to be compared, or better, from the complete set of given sequences. Note that this contains the Jukes-Cantor transformation as a special case with equal base frequencies π A = π C = π G = π T = 0.25, thus B = 3 4. Unlike the Jukes-Cantor transformation, this transformation can also be applied to protein sequences, setting B = if all amino acids are generated with the same frequency, and B = 1 20 i=1 π2 i, in the proportional case. Given equal base frequencies, but different proportions P and Q of transitions and transversions between a i and a j, the distance for the Kimura 2 parameter model is computed as: K2P (a i, a j ) = 1 ( ) 2 ln ( ) 1 1 2P Q 4 ln. (5.3) 1 2Q If we drop the assumption of equal base frequencies, then we obtain the Felsenstein 84 transformation: ( F 84(a i, a j ) = 2A ln 1 P ) ( (A B)Q + 2(A B C) ln 1 Q ), (5.4) 2A 2AC 2C where A = π C π T /π Y +π A π G /π R, B = π C π T +π A π G, and C = π R π Y with π Y = π C +π T, π R = π A +π G. Even more general models exist, but we will skip the details. The following figure summarizes the complete overview of the time reversible models:

15 80 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 GTR 3 substitution parameters (2 f. Transitions, 1 f. Transv.) Equal base frequency 2 substitution parametera (1 f. Transitions, 1 f. Transv.) 1 substitution parameter TrN HKY =F84 Equal base frequency SYM K3SP 3 substitution parameters (1 f. Transitions, 2 f. Transv) 2 substitution parameters (1 f. Transitions, 1 f. Transv) GTR = general time reversible TrN = Tamura-Nei SYM = symmetric HKY = Hasegawa-Kishino-Yano K3ST = Kimura 3-Parameter F81 K2P Equal base frequency 1 substitution parameter JC

16 Bioinformatics I, WS 09-10, D. Huson, December 1, Maximum parsimony In phylogenetic analysis, the Maximum Parsimony problem is to find a phylogenetic tree that explains a given set of aligned sequences using a minimum number of evolutionary events. The maximum parsimony method is one of the most widely used used tree reconstruction methods. Suppose we are given a multiple alignment of sequences A of length L and a phylogenetic tree T. The maximum parsimony method makes two fundamental assumptions: Changes at different positions of the sequences are independent. Changes in different parts of the tree are independent The parsimony score of a tree The difference between two sequences X = (x 1,..., x L ) and Y = (y 1,..., y L ) is simply their nonnormalized Hamming distance d(x, Y ) = {k x k y k }, that is, the number of positions at which they differ. Assume we are given a multiple alignment of sequences A = {A 1, A 2,..., A n } and a corresponding phylogenetic tree T, leaf-labeled by A. If we assign a hypothetical ancestor sequence to every internal node in T, then we can obtain a score for T together with this assignment by summing over all differences d(x, Y ), where X and Y are two sequences labeling two nodes that are joined by an edge in T. Example: AGT AGA CGA CCA 1 AGA CGA score = 3 CGA Definition (Parsimony score of a tree) Suppose we are given a multiple alignment of sequences A of length L and a corresponding phylogenetic tree T leaf-labeled by A. Then the parsimony score for T and A is defined as: P S(T, A) = min λ {u,v} E d(λ(u), λ(v)), where the minimum is taken over all possible labelings λ of the nodes of T by sequences of length L that extend the labeling of the leaves by A.. Small Parsimony problem: The Small Parsimony problem is to compute the parsimony score for a given tree T.

17 82 Bioinformatics I, WS 09-10, D. Huson, December 1, The Fitch algorithm In 1971 Walter Fitch 2 published a dynamic programming algorithm that solves the Small Parsimony problem efficiently. Input: A phylogenetic tree T with n leaves. The leaves are labeled by the sequences of an MSA with n sequences over an alphabet Σ. Since maximum parsimony assumes indepedence of each column, one can compute the parsimony score column by column and the final result is given by the sum of the column results. The Fitch algorithm runs in two passes: The forward pass first computes the parsimony score and the backward pass then computes a labeling of the internal nodes The Fitch algorithm: Forward pass As the algorithms proceeds column-by-column, we can assume that each leaf v is labeled by a single character-state c(v). We will assign a set S(v) of states and a number l(v) to each node v as follows: For each leaf v: S(v) = {c(v)}, l(v) = 0 For each inner node v with children u, w: { S(u) S(w) if S(u) S(w) S(v) = S(u) S(w) if S(u) S(w) = l(v) = l(u) + l(w) l(v) = l(u) + l(w) + 1 if S(u) S(w) if S(u) S(w) = To compute S(v) and l(v) we traverse the tree in postorder or bottom-up. The number of changes required is given by the value of l at the root. The forward pass algorithm requires O(n) steps for each column and thus O(nL) in total The Fitch algorithm: Example Example: we use the Fitch algorithm to compute the parsimony score for the following labeled tree: G A G C C A Fitch algorithm: Backward pass The forward pass algorithm computes the parsimony score. An optimal labeling of the internal nodes is obtained via traceback: starting at the root node r, we label r using any character in S(r). Then, 2 Fitch, W. (1971) Toward defining the course of evolution: minimum change of specified tree topology. Syst. Zoology 20:

18 Bioinformatics I, WS 09-10, D. Huson, December 1, for each child w, we use the same letter, if it is contained in S(w), otherwise we use any letter in S(w) as label for w. We then visit the children von w etc. {C,G} {G} {A,G} {G} {G} {C} {A} {C} {C} {A,C} {A} G Again, the algorithm requires O(nL) steps, in total. A G C C A The Fitch algorithm: Backward pass Algorithm (Backward pass) Input: A phylogenetic tree T and the set S(v) of each node v in T Output: The fully labeled tree T do for all nodes v, starting at the root node, going to the leaves if v is the root node then Set c(v) = t with t S(v) else Let w be the parent node of v if c(w) S(v) then Set c(v) = c(w) else Set c(v) = t with t S(v) end Sankoff s algorithm In Fitch s algorithm each mutation is equally weighted independent of the nature of the mutation. For nucleotides and especially for amino acids the assumption of equal substitution costs for all possible character changes is inappropriate. In 1975, David Sankoff 3 proposed an alternative to Fitch s algorithm in which substitutions can be weighted. Input: a phylogenetic tree T = (V, E) with n leaves a character-state c(v) for each leaf v a cost matrix C that specifies the cost C(α, β) of changing state α into state β Forward pass: In a post-order traversal, for each node v and each state α Σ we will compute S α (v) as the minimum cost of the subtree rooted at v, given c(v) = α. Initialization: For each leaf node v and state α set S α (v) = { 0, c(v) = α, else. 3 Sankoff D. (1975) Minimal mutation trees of sequences. SIAM J. Appl. Math. 28:35-42

84 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Main recursion: For each internal node v with children u and w recursively compute S β (u) and S β (w) for all states β.

19 84 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 Main recursion: For each internal node v with children u and w recursively compute S β (u) and S β (w) for all states β. Then set S α (v) = min(s β (u) + C(α, β)) + min(s β (w) + C(α, β)). β β Result: The minimal cost of the whole tree is given by where ρ is the root of the tree. Example: min S α (ρ), α Backward pass: Based on the values of S v (α) from the forward pass we will now choose an optimal character state α for each inner node. As in Fitch s algorithm we start at the root and visit all nodes in a pre-order traversal. For the root ρ set: c(ρ) = arg min S α (ρ) α For every internal node w ρ with parent node v we set: Example: c(w) = arg min(s α (w) + C(α, c(v))). α

20 Bioinformatics I, WS 09-10, D. Huson, December 1, The Large Parsimony problem Definition (Parsimony score of an alignment) Given a multiple alignment A = {a 1,..., a n }, its parsimony score is defined as P S(A) = min{p S(T, A) T is a phylogenetic tree on A}. Definition (Large Parsimony problem) The Large Parsimony problem is to compute P S(A). Theorem (Parsimony is hard) The Large Parsimony problem is NP-hard Maximum likelihood estimation Let A = (a 1,..., a n ) be a multiple sequence alignment of length m on X. The basic idea of the maximum likelihood estimation method is to compute a phylogenetic tree T, together with edge lengths ω, that maximizes the likelihood P(A T, ω, M) that the multiple sequence alignment A was generated on the phylogenetic tree T with edge lengths ω under the given model of sequence evolution M. Such a model M specifies how to choose the initial sequence at the root of the tree T and how sequences evolve along edges of the tree. For example, the Jukes-Cantor model specifies that at any position i of the root sequence, all four nucleotides appear with the same probability of 1 4. The mutation rate is u. Moreover, the length ω(e) of any edge e is interpreted as a duration and the probability that an observable change occurs along an edge of length t is given by the probability-of-change formula: P(change t) = 3 ( 1 e 4 ut) 3. 4 In practice, the models employed are more general than the simple Jukes-Cantor model, usually allowing unequal nucleotide frequencies and different transition probabilities between different nucleotides. The models are generally time reversible, which implies that the optimal score attainable is independent of the location of the root. A main advantage of maximum likelihood estimation is that it provides a systematic framework for explicitly incorporating assumptions and knowledge about the process that generated the given data. Although evolutionary models are only a rough estimate of real-world biological evolution, in practice, maximum likelihood methods have proved to be quite robust to violations of the assumptions made in the models Maximum likelihood estimation is hard As with maximum parsimony, the main draw-back of this approach is that finding an optimal phylogenetic tree is NP-hard. In fact, the situation is in some sense worse: Theorem (Maximum likelihood estimation is hard) For a given phylogenetic tree T, the problem of defining edge lengths ω so as to maximize the likelihood P (A T, ω, M) is NP-hard.

21 86 Bioinformatics I, WS 09-10, D. Huson, December 1, Solving maximum likelihood Assume we are given the following DNA alignment A as input: 1 2 i m a 1 A C... G C G... A a 2 A C... G C C... G a 3 A C... T A C... G a 4 A C... T G G... A There are three possible unrooted phylogenetic trees on a set of four taxa: a1 a3 a1 a2 a1 a2 a2 a4 a3 To simplify the calculation of the maximum likelihood score for such a phylogenetic tree T we root it by choosing one of its internal nodes to be the root ρ: (1) (2) (3) (4) C C A G (1) (3) (5) (5) (2) (4) We will discuss how to compute the maximum likelihood for the first tree of the three unrooted trees. The other two trees are processed similarly. Conceptually, we obtain the likelihood L(i) that a specific rooted phylogenetic tree T generated A i, the i-th character of A, by summing over all possible labellings α of the internal nodes by states. For a bifurcating unrooted phylogenetic tree on n taxa there are n 2 internal nodes, and so there are k (n 2) possible labellings to be considered, where k is the number of possible character states. For our example, we need to consider 16 different possibilities: C C A G + P L(i) = P A i A A (6) a4 a4 A i P A i P A i C C A G C A C C A G G C C C A G T T (6) a3 In general, for a given rooted phylogenetic tree T and model M, the likelihood of T given a single character A i is defined as L(i) = P(A i T, ω, M) = α P(A i, α T, ω, M) = ( α P(α(ρ)) ) e={v,w} P e(α(v), α(w)), where the summation is over all possible labellings α of the internal nodes, P(α(ρ)) is the probability of generating the state α(ρ) at the root and P e (α(v), α(w)) is the probability of observing the two states α(v) and α(w) at the two ends of an edge e = {v, w}.

22 Bioinformatics I, WS 09-10, D. Huson, December 1, The total likelihood that T generated the multiple sequence alignment A is obtained as the product of the likelihoods for all positions of A: L(T ) = L(1) L(2) L(m) = m L(i). i=1 The computation of L(i) is actually more complicated than we have described so far, because not only do we have to consider all possible rooted phylogenetic trees on n taxa, but we must also consider all possible ways of defining edge lengths ω on each such tree. Assume that we have a rooted phylogenetic tree T with edge lengths ω, we will now discuss in more detail how to calculate P (A i, α T, ω, M) = P(α(ρ)) P e (α(v), α(w)) e={v,w} for the case of the Jukes-Cantor model with mutation rate u. Under this model, the probability of observing any particular nucleotide at the root node is 1 4. The probability of observing two different states α(v) and α(w) at opposite ends of an edge e of length t = ω(e) is given by P (α(v) α(w) T ) = 3 4 (1 e 4 3 ut ), whereas the probability of observing the same state at opposite nodes of e equals 1 P (α(v) = α(w) T ) = 1 4 (1 + 3e 4 3 ut ). For example, the value of is given by P A i C C A G A A P (A)P e1 (A, C)P e2 (A, C)P e3 (A, A)P e4 (A, A)P e5 (A, G) = (1 e 4 3 ut 1 )(1 e 4 3 ut 2 )(1 + 3e 4 3 ut 3 )(1 + 3e 4 3 ut 4 )(1 e 4 3 ut 5 ), where e 1,..., e 5 are the five edges in T and t i = ω(e i ) for i = 1,..., Felsenstein s algorithm One can efficiently compute the likelihood for a given bifurcating, rooted phylogenetic tree T with edge lengths ω using Felsenstein s algorithm (Felsenstein, 1981). In this approach, for each node v in T and every possible character state a, in a post-order traversal of T, one recursively computes the conditional likelihood L v (a) for the subtree T v rooted at v, under the condition that the state of v is a, that is, we have α(v) = a. More precisely, for any internal node v with children w 1 and w 2, the main recursion is ( ) ( ) L v (a) = P {v,w1 }(a, b)l w1 (b) P {v,w2 }(a, b)l w2 (b), b b

23 88 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 where summation is over all possible states b and, as above, P {v,w} (a, b) is the probability of observing states a and b at the two ends of the edge {v, w} under the given model M. To initialize the computation, for each leaf v and every symbol a, we set L v (a) = 1, if the node v is labeled by the character state a, and L v (a) = 0, else Summary As mentioned above, we must consider all possible choices of edge lengths ω. Determining an optimal choice of ω is in general a computationally hard problem. One possible heuristic is to optimize the length of each edge in turn, while keeping all other edge lengths fixed. In summary, for a given multiple sequence alignment A on X, in maximum likelihood estimation, a result is obtained by searching through all possible phylogenetic trees T and for each such tree T, computing the maximum likelihood score. The method returns the tree that attains the best score Quartet puzzling Quartet puzzling 4 is a heuristic for approximating the maximum likelihood (MLE) tree. It is based on the idea of computing the MLE trees for all quartets of taxa and then using many rounds of combination steps in which one attempts to fit a random set of quartet trees together so as to obtain a tree on the whole dataset. Assume we are given a multiple alignment of sequences A = {a 1,..., a n }. heuristic proceeds in three steps: The quartet puzzling 1. For each quartet q of four distinct taxa a i, a j, a k, a l A compute the maximum likelihood tree T q. This step is called maximum likelihood step. 2. Choose a random order a i1, a i2,..., a in of the taxa and then use it to guide a combining or puzzle step that produces a tree on the whole data set from quartet trees. 3. Repeat step (2) many times to obtain trees T 1, T 2,..., T R. Report the majority consensus tree T Computing all quartet trees Given four taxa {a 1, a 2, a 3, a 4 }, there are three possible bifurcating topologies: a1 a3 a1 a2 a1 a2 a2 a4 a3 a4 a4 a3 {{a 1, a 2 }, {a 3, a 4 }} {{a 1, a 3 }, {a 2, a 4 }} {{a 1, a 4 }, {a 2, a 3 }} Now, let T be any phylogenetic tree that contains leaves labeled a 1, a 2, a 3, a 4, We say that T induces the split or tree topology {{a 1, a 2 }, {a 3, a 4 }}, if there exists an edge in T that separates the nodes labeled a 1 and a 2 from the nodes a 3 and a 4. We will sometimes abbreviate {{a 1, a 2 }, {a 3, a 4 }} by a 1, a 2 a 3, a 4. Assume we are given this tree T : 4 Korbinian Strimmer and Arndt von Haeseler (1996) Quartet-puzzling: a quartet maximum-likelihood method for reconstructing tree topologies. Mol Biol Evol 13,

24 Bioinformatics I, WS 09-10, D. Huson, December 1, a1 a5 a2 a3 a4 This tree induces the following five quartet tree topologies: a1 a3 a1 a3 a1 a4 a1 a4 a2 a4 a2 a5 a2 a5 a3 a5 a2 a4 a3 a5 It is easy to reconstruct the original tree from its set of quartets. However, in the presence of false quartet topologies (as inevitably produced by any phylogenetic reconstruction method), the reconstruction of the original tree can become difficult. In quartet puzzling, quartets are greedily combined to produce a tree for the whole data set. In an attempt to avoid the usual problems of a greedy approach, the combining step is run many times using different random orderings of the taxa The puzzle step The puzzle step is initialized with the maximum likelihood tree for {a 1, a 2, a 3, a 4 }, e.g.: a a3 a2 0 0 a4 Each edge e is labeled with a number of conflicts c(e) that is originally set to 0. In the puzzle step, the task is to decide how to insert the next taxon a i+1 into the tree T i already constructed for taxa {a 1,..., a i }. We proceed as follows: For each edge e in T i we consider what would happen if we were to attach taxon a i+1 into the middle of e by a new edge: We consider all quartets consisting of the taxon a i+1 and three taxa from {a 1, a 2,..., a i }. For each such quartet q we compare the precomputed maximum likelihood tree T q with the restriction of T i to q, denoted by T i q. If these two trees differ, then we increment c(e) by one. Finally, we attach e to an edge f in T i that has a minimum number of conflicts c(f), thereby obtaining T i+1. Example: We illustrate the puzzle step using the five taxon tree shown above. For each of the five quartets we compute the respective ML tree. Assume that these are indeed a1 a3 a1 a3 a1 a4 a1 a4 a2 a4 a2 a5 a2 a5 a3 a5 a2 a4 a3 a5 We start with the maximum likelihood tree on {a 1, a 2, a 3, a 4 }. The puzzle step for a 5 proceeds as

25 90 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 follows: Initial tree: a1 a a3 a4 a 1, a 2 a 4, a 5 a1 a a3 a4 a 1, a 3 a 4, a 5 a 1, a 2 a 3, a 5 a1 a a2 2 0 a4 a1 a a2 a4 a 2, a 3 a 4, a 5 a1 a a3 a4 Hence, the puzzle step suggests that taxon a 5 should be attached in the edge adjacent to the leaf labeled a 4 to obtain the tree a1 a5 a2 a3 a4 In geneal, this procedure is repeated for different orders of taxa many times (say, 1000 for a large dataset). Finally a consensus of all computed trees is derived as the final result The quartet puzzling algorithm Algorithm (Quartet puzzling) Input: Aligned sequences A = {a 1,..., a n }, number R of trees Output: Phylogenetic trees T 1, T 2,..., T R on A Initialization: Compute an MLE tree t q for each quartet q A for r = 1 to R do Randomly permute the order of the sequences in A Set T 4 equal to the precomputed MLE tree on {a 1, a 2, a 3, a 4 } for i = 5 to n do Set c(e) = 0 for all edges in T i 1 for each quartet q {a 1,..., a i } with a i q do Let x, y z, a i be the MLE tree topology on q Let p denote the path from x to y in T i 1 Increment c(e) by 1 for each edge e that is contained in p or that is separated from z by p Choose any edge f for which c(f) is minimal Obtain T i from T i 1 by attaching a new leaf with label a i to f Output T n end

26 Bioinformatics I, WS 09-10, D. Huson, December 1, Bayesian methods Bayesian inference of phylogenetic trees is a method whose goal is to obtain a quantity called the posterior probability of a phylogenetic tree. Generally speaking, the posterior probability of a result is the conditional probability of the result being observed, computed after seeing a given input dataset. The prior probability of a result is the marginal probability of the result being observed, set before the input dataset is examined. In other words, the prior probability distribution of phylogenetic trees on X reflects our belief of the distribution of all possible phylogenetic trees on X. This distribution is determined in advance, without knowledge of the actual sequence data to be examined. Let A be a multiple sequence alignment and T a phylogenetic tree with edge lengths ω, on X. Also, we will assume that some fixed model of evolution M is given. The posterior probability of T can be obtained from the prior probability of T with the help of the likelihood P (A T ) = P (A T, ω, M) using Bayes Theorem: Theorem (Bayes Theorem) P (T A) = P (A T ) P (T ). P (A) The posterior probability P (T A) can be interpreted as the probability that the tree T is correct, given the data. In the simplest case, all trees on X are considered to be equally probable and then the prior distribution is called a flat prior. In the case of a flat prior, the posterior probability P (T A) is proportional to the likelihood P (A T ) (computed under the model M) and the maximum likelihood tree also has maximum posterior probability. The maximum posterior probability over all trees is usually very low, if we think of tree space as a discrete sample space. In Bayesian tree inference the goal is not to determine a single phylogentic tree of maximum posterior probability, but rather to compute a sample of phylogenetic trees according to the posterior probability distribution of trees. Such a sample of trees is then processed further, for example, to generate a single consensus tree (as described later) and then to label the clusters in such a tree by their posterior probabilities. The expression for the posterior probability given by Bayes Theorem looks quite simple and we can efficiently compute P (A T ) using Felsensteins algorithm, as described in in the context of maximum likelihood. However, it is usually impossible to solve the complete expression analytically, as computation of the normalizing factor P (A) usually involves summing over all possible phylogenetic tree topologies T and integrating over all possible edge lengths ω and model parameters µ (of the given model M): where P (T A) = P (A T ) = P (A T )P (T ) P (A) ω µ = P (A T )P (T ) T P (A T )P (T ), P (A T, ω, µ)p (ω, µ)dµdω. To avoid this problem, Bayesian inference employs the Markov chain Monte Carlo (MCMC) approach, which is based on the idea of sampling from the posterior probability distribution using a suitably constructed chain of results.

27 92 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 This is a general approach that is used in a wide range of applications. In the context of phylogenetic tree reconstruction, the MCMC approach constructs a chain of phylogenetic trees. At each step, a modification of the current phylogenetic tree is proposed and then a probabilistic decision is made whether to accept the newly proposed phylogenetic tree or whether to keep the current one Metropolis-Hastings algorithm Algorithm (Metropolis-Hastings) Let T i be the phylogenetic tree at position i of the MCMC chain and let T be a proposed modification of T i. Using what is called the Metropolis-Hastings algorithm, the decision whether to accept T or to keep T i is based on the ratio of the posterior probabilities of the two trees, which is given by R = P (T A) P (T i A) = P (A T )P (T )P (A) P (A T i )P (T i )P (A) = P (A T )P (T ) P (A T i )P (T i ). Note that the normalizing factor P (A), which is so difficult to compute, cancels out and this expression is much easier to calculate than the actual posterior probability of either T i or T. Based on R, we accept the proposed phylogenetic tree T with probability min(1, R). In other words, we always accept T, if it has a better posterior probability than T i, and otherwise we draw a random number p uniformly from the interval [0, 1] and accept the tree T if p < R, and reject T, otherwise. Acceptance means that we set T i+1 = T, whereas rejection implies T i+1 = T i. The aim is to produce an MCMC chain M of bifurcating phylogenetic trees that has the property that the stationary distribution of the trees (including the parameters ω and µ) in the chain equals the posterior probability distribution of the trees (again, including ω and µ). One can show that, by construction, the decision rule described above implies that the number of times that a phylogenetic tree T occurs in M is a valid approximation of the posterior probability of that tree Practical issues For this approach to work well in practice, a number of issues must be taken into account. First, the mechanism for modifying the current phylogenetic tree must be designed with care. On the one hand, the suggested changes should be quite small so as to avoid a drastic drop in posterior probability. On the other hand, the modifications should be large enough so as to ensure that M is able to explore a sufficient portion of parameter space in a reasonable amount of time. Second, as the MCMC chain M starts at a random location in parameter space, usually of very low posterior probability, the chain must be allowed to burn in, that is, it must be given time to reach stationarity, a state in which the chain is actually sampling from the posterior probability. In practice, the first 10,000 elements of M, say, are excluded from further analysis. Third, trees that are close to each other in the MCMC chain M are highly correlated by construction, as they differ only by small modifications. To avoid this problem of autocorrelation, the final set of output trees is obtained by sparsely sampling M, retaining only every 1,000-th tree, say.

28 Bioinformatics I, WS 09-10, D. Huson, December 1, The local algorithm What types of modifications are used in practice? Branch-swapping methods such as NNI (nearest neighbor interchange), SPR (subtree prune and regraft) and TBR (tree bisection and reattachment) are used in the context of maximum parsimony and maximum likelihood methods to search through the space of phylogenetic tree topologies. For the purposes of constructing an MCMC chain we need to couple such an approach with a method for proposing modifications of edge lengths and parameters of the evolutionary model. We will now describe one approach that is used in practice called the local algorithm, which is a modification of the NNI method. The algorithm starts by choosing an internal edge (v, w) of the current phylogenetic tree T i at random. As we assume that T i is bifurcating, both of the nodes v and w are each connected to two further nodes, which we will randomly label a and b for v, and c and d for w, respectively. Let x, y and z denote the distance from a to v, w and d, respectively, as obtained by adding the edge lengths ω of the edges from a to v, v to w, and w to d, appropriately. Here we show a phylogenetic tree before and after a modification by the local algorithm: 0 x y z 0 y' x' z' a v w d a w v d b c c b The distance z from node a to node d is modified by a randomly chosen factor to a new value of z. In this example, the node w is moved by a random amount along the path from a to d, whereas the relative position of node v on the path from a to d is not changed Details of the local algorithm The local algorithm changes the length of the path from a to d and modifies the positions of v and w in that path. Here are some details: First define the new distance from a to d as z = z e p 0.5, where p is a random number uniformly chosen from the interval (0, 1). The lengths of edges on the path from a to d will be modified so that the distance from a to d will be z. Second, randomly decide with equal odds whether node v or node w is to be moved by a random amount. If we decide to move node v, then define x = q z and y = y z /z. Otherwise, define x = x z /z and y = q z. In both cases, q is a random number uniformly chosen from the interval (0, 1). Third, to reposition the nodes v and w according to the values just chosen, remove the two nodes v and w from the path connecting a to d and then reinsert them into the path at distances x and y from a, respectively. If x < y, then the local algorithm does not result in a change of the current tree topology. In

29 94 Bioinformatics I, WS 09-10, D. Huson, December 1, 2009 this case, only the lengths of the edges from a to v, v to w, and w to d change. In more detail, we set ω(a, v) = x, ω(v, w) = y x and ω(w, d) = z y. If x > y, then the modifications result is a new topology in which a and w, w and v, and v and d, respectively, are both connected by an edge. In this case, the edge lengths are ω(a, w) = y, ω(w, v) = x y and ω(v, d) = z x. Simple parameters of the evolutionary model, such as the mutation rate µ in the Jukes-Cantor model, are modified by adding a random number uniformly chosen from an appropriate interval centered at 0. For a set of parameters that is constrained to sum to some specific value, such as 1 in the case of probabilities, the values are randomly modified according to a so-called Dirichlet distribution. A proposal for modifying the tree and a proposal for new values for all parameters of the specified evolutionary model are generated independently and simultaneously, and then accepted or rejected together in in a single application of the ratio-based decision step of the Metropolis-Hasting algorithm Metropolis-coupled Markov chain Monte Carlo It is important that the MCMC chain M is given the opportunity to visit all relevant parts of parameter space. In consequence, the larger the number of taxa is, and the more parameters the evolutionary model M has, the longer the chain must be run. A popular approach for exploring more of parameter space in less time is to use a variant of MCMC called Metropolis-coupled Markov chain Monte Carlo, or (MC 3 ), for short. In this approach k different MCMC chains M 1,..., M k are run in parallel. The first chain M 1 is called the cold chain and this is the one from which inferences will be made. The other k 1 chains are called heated chains. Each of the heated chains aims at sampling from a posterior probability distribution raised to a power of β > 1. Heating a chain has the effect of flattening out the probability landscape and the hope is that a heated chain will be able to move more freely through parameter space. For example, in a strategy called incremental heating, the i-th chain M i aims at sampling from the distribution P (T A) β(i), with 1 β(i) = 1 + (i 1)t, where t is a user-specified heating parameter. After all k chains have moved one step, a swap of states is proposed between two randomly chosen chains M i and M j, using a Metropolis-like algorithm. Let T i denote the current state of chain M i, for all i = 1,..., k. (Here, by current state we mean the total state consisting of the current tree T, the current length parameters ω and the current model parameters µ.) A swap between M i and M j is accepted with probability α = P (T j A) β(i) P (T i A) β(j) P (T i A) β(i) P (T j A) β(j). The main idea is that the heated chains act as scouts exploring the parameter space and that swapping will occasionally help the cold chain to move to a new part of the space which it would otherwise have difficulty reaching because of a deep valley in the original distribution. At the end of the run, the heated chains are discarded and the cold chain is then processed as described above Outlook A major challenge is how do we determine whether an MCMC chain M has been run long enough so that that chain has converged, that is, it provides a good approximation of the posterior distribution of trees. A simple approach is to monitor the likelihood of the phylogenetic trees added to M and to

Phylogenetic Tree Reconstruction

I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven