Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

1 ioinformatics -- lecture 9 Phylogenetic trees istance-based tree building Parsimony

2 (,(,(,))) rees can be represented in "parenthesis notation". Each set of parentheses represents a branch-point (bifurcation), the comma separates left and right lineages. (,(,(,))) = Parenthesis notation can contain sequence labels too.

3 Evolutionary time ladogram Phylogram Ultrametric tree no meaning genetic change time (:5,(:,(:,:6):):3) parenthesis notation can have both labels and distances.

4 istance metrics MERI ISNES between any two or three taxa (a, b, and c) have the following properties: Property : d (a, b) 0 Non-negativity Property 2: d (a, b) = d (b, a) Symmetry Property 3: d (a, b) = 0 if and only if a = b istinctness Property 4: d (a, c) d (a, b) + d (b, c) riangle inequality a 9 6 b 5 c triangle inequality

5 ULRMERI ISNES...must satisfy the previous four conditions, plus: Property 5 istance metrics he distances from any branch point to the taxa in the clade defined by that branch point are equal a b c If distances are ultrametric, then the sequences are evolving in a perfectly clock-like manner. So any two sequences always have the same distance to their common ancestor.

6 istance metrics dditivity Property 6: Example: if (a,b) are nearest neighbors, d (a, b) + d (c, d) maximum [d (a, c) + d (b, d), d (a, d) + d (b, c)] For distances to fit into an evolutionary tree, they must be additive. Estimated distances often fall short of these criteria, and thus can fail to produce correct evolutionary trees. d (a, b) d (c, d) lineage that goes backwards in time violates additivity.

7 What s wrong with these distances?

8 What s wrong with this tree? 2 6 3

9 id the Florida entist infect his patients with HIV? Phylogenetic tree of HIV sequences from the ENIS, his Patients, & Local HIV-infected People: ENIS Patient Patient Patient G Patient Patient E Patient ENIS Local control 2 Local control 3 Patient F Local control 9 Local control 35 Local control 3 Patient Yes: he HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. No No From Ou et al. (992) and Page & Holmes (998)

10 haracter-based versus distance-based methods for tree building haracter-based methods: Use the aligned sequences directly during tree inference. axa Species Species Species Species Species E haracters GGGG GG GGGG GGGGGG GGGG istance-based methods: ransform the sequence data into pairwise distances, and then use the matrix during tree building, ignoring characters. E Species Species Species Species Species E

11 alculating distances Uncorrected p-distance: count the changes, divide by the length. Species Species Species Species Species E GGGG GG GGGG GGGGGG GGGG op: uncorrected p-distance, ottom: Jukes-antor distance E Species Species Species Species Species E Jukes-antor correction: K(,) = -3/4 ln [ - 4/3 (,)] (,) = 4/20

12 Homoplasy Independent evolution of the same character. () onvergent events (in either related on unrelated entities), (2) Parallel events (in related entities) (3) Reversals (in related entities) G G G G G G G G () (2) (3) he Jukes-antor correction assumes homoplasy occurs at the rate predicted by random mutations.

13 Neighbor joining: a distance-based method hoose the closest neighbors. dd a node between them. hoose the next closest, ad so on. E Species Species Species Species Species E E

14 Neighbor joining: phylogram Finally, adjust the branch lengths to fit the distances, if possible! E Species Species Species Species Species E E

15 Fitch-Margoliash algorithm for calculating the branch lengths. Find the most closely-related pair of sequences, and 2. alculate the average distance from to all other sequences, then from to all other sequences. x x x 3. djust the position of the common ancestor node for and so that the difference between the averages is equal to the difference between the and branch lengths, while the sum of the branch lengths is still equal to d(,). d()-d() = (d(,)+d(,))/2 - (d(,)+d(,))/2 NOE: the difference between the averages may be greater than (,), making step 3 impossible.

16 In class: create a rooted phylogram with 4 taxa GGGGGG GGGGGGG GGGGGG GGGGGG K(,) = -3/4 ln [ - 4/3 pdist(,)] pdist irections:.make a distance matrix. (p-distance, then convert to J- distance) 2.Use Neighbor-joining to make a tree. 3.djust branch lengths using Fitch-Margoliash. 4.hoose the root using the Midpoint method.

17 Which method do I use? Sequence similarity strong weak very weak Method to use distance parsimony maximum likelihood

18 Maximum parsimony -- it's character-building Optimality criterion: he most-parsimonious tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. E GGGG GG GGGG GGGGGG GGGG For this column, and this tree, one mutation event is required.

19 character-based tree-building For this other column, the same tree requires two mutation events. different tree would require only one. E GGGG GG GGGG GGGGGG GGGG

20 Finding the minimum number of mutations Given a tree and a set of taxa, one-letter each () choose optional characters for each ancestor. (2) Select the root character that minimizes the number of mutations by selecting each and propagating it through the tree. // / / / minimum 2 mutations minimum mutation

21 Ignore non-informative sites No mismatchs ---> 0 mutations, all trees mismatch --> mutation, all trees. all different --> all trees equivalent. 2

22 Max Unweighted Parsimony: rying all trees E GGGG GGG GGGG GGGGGG GGGG OLS

