Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Similar documents
BINF6201/8201. Molecular phylogenetic methods

Evolutionary Tree Analysis. Overview

Phylogenetic Tree Reconstruction

Algorithms in Bioinformatics


Phylogeny: building the tree of life

Phylogenetic trees 07/10/13

What is Phylogenetics

Theory of Evolution Charles Darwin

Phylogenetic inference

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Constructing Evolutionary/Phylogenetic Trees

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

EVOLUTIONARY DISTANCES

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogeny: traditional and Bayesian approaches

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Multiple Sequence Alignment. Sequences

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Chapter 26: Phylogeny and the Tree of Life

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

8/23/2014. Phylogeny and the Tree of Life

Introduction to Molecular Phylogeny

Phylogenetic Analysis

Phylogenetic Analysis

1 ATGGGTCTC 2 ATGAGTCTC

Theory of Evolution. Charles Darwin

How should we organize the diversity of animal life?

A (short) introduction to phylogenetics

Molecular Evolution and Phylogenetic Tree Reconstruction

Dr. Amira A. AL-Hosary

How to read and make phylogenetic trees Zuzana Starostová

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic Analysis

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

PHYLOGENY & THE TREE OF LIFE

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Lecture 6 Phylogenetic Inference

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

CHAPTER 26 PHYLOGENY AND THE TREE OF LIFE Connecting Classification to Phylogeny

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Biology 211 (2) Week 1 KEY!

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Concepts and Methods in Molecular Divergence Time Estimation

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Phylogeny Tree Algorithms

Chapter 26 Phylogeny and the Tree of Life

Chapter 16: Reconstructing and Using Phylogenies

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Macroevolution Part I: Phylogenies

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

C.DARWIN ( )

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Chapter 19: Taxonomy, Systematics, and Phylogeny

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Name: Class: Date: ID: A

Constructing Evolutionary/Phylogenetic Trees

Bioinformatics course

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

PHYLOGENY AND SYSTEMATICS

Organisatorische Details

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

A Phylogenetic Network Construction due to Constrained Recombination

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

Organizing Life s Diversity

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Classification, Phylogeny yand Evolutionary History

Phylogeny is the evolutionary history of a group of organisms. Based on the idea that organisms are related by evolution

Introduction to Bioinformatics Introduction to Bioinformatics

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Phylogeny and the Tree of Life

Phylogeny. November 7, 2017

Phylogeny and Systematics

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Phylogenetic inference: from sequences to trees

CLASSIFICATION OF LIVING THINGS. Chapter 18

Effects of Gap Open and Gap Extension Penalties

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Consistency Index (CI)

molecular evolution and phylogenetics

Gene Families part 2. Review: Gene Families /727 Lecture 8. Protein family. (Multi)gene family

Seuqence Analysis '17--lecture 10. Trees types of trees Newick notation UPGMA Fitch Margoliash Distance vs Parsimony

Unit 9: Evolution Guided Reading Questions (80 pts total)

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

Transcription:

Phylogeny and Evolution Gina Cannarozzi ETH Zurich Institute of Computational Science

History Aristotle (384-322 BC) classified animals. He found that dolphins do not belong to the fish but to the mammals. Carolus Linneus (1758) introduced binomial classification Darwin 1859 explained evolution as a process of random mutation and natural selection. Zimmerman in the 1930s and Hennig in the 50 s began to define objective measures for reconstructing evolutionary history based on shared attributes of extant and fossil organisms. They worked on cladistics- the systematic classification of organisms based shared derived properties 1965 Zuckerkandl and Pauling were the first to use molecular sequences as indicators of phylogeny

Introduction Goal: reconstruct the evolutionary history of life Carl Woese proposed the third domain or kingdom of life based on ribosomal RNA in 1990.

Motivation

Topology Unrooted Tree Rooted Tree Root Internal node Leaf node topology - shape of tree, branching order between nodes rotation about a branch does not change the topology

Tree representations L3 L4 L1 L2 L5 L6 A B C D Figure 7: Least Squares Tree of Matrix D ((A,B)(C,D)) = ((B,A)(C,D)) = ((C,D),(B,A)) 5.2.1 Easy Case I: Ultrametric Tree For an ultrametric tree we assume that a) all proteins evolve at the same time (molecular clock), b) all distances can be measured without error and c) all leaves are equidistant from the root. Tree(Tree(Leaf(A,L1+L3,1),L3,Leaf(B,L2+L3,2)), d(a, X) = d(b, X) = d(c, X) 0, Tree(Leaf(D,L6+L4,4),L4,Leaf d(a, Y ) = d(b, Y ) (C,L5+L4,3))) To compute an ultrametric tree, we can apply following algorithm: X Y

Tree Components topology - branching pattern of a tree root- place on the tree from which everything evolves- common ancestor of everything at the leaves external nodes, leaves, taxonomic units internal nodes or hypothetical taxonomic units (HTU) represent speciation or gene duplication events branches or edges - can have a length

Rooting a tree Most phylogenetic methods produce unrooted trees. This is because they detect differences between sequences, but have no means to orient residue changes relatively to time. There are two ways to root an unrooted tree: use an outgroup- include a group of sequences known to be outside the group of interest assume a molecular clock- all lineages have evolved with the same rate from their common ancestor (usually not a good assumption)

Phylogenetic Trees: graphical representation of the evolutionary history of a set of species Monkey Human Chimp Dog Rat Mouse Cow ancestor of mammals Frog ancestor of vertebrates Puffer fish Possum Puffer fish Zebrafish Zebrafish Chicken Chicken Cow Human Chimp Monkey Dog Puffer fish Mouse Puffer fish Rat Vertebrates Possum Frog Vertebrates

Phylogeny, Evolution, and Alignments 789: ;<=>?8@< '())*#+*,-+,-.'/(0-12)*++/+++2334+5.3++,20.!!""""#""#"!#!""!"#"$"%%"!!!!"%!%"#!"$"&!!! '(*,12-1*.6,+.))(3.'1*!!)/+++(63134.).1720. Rice Corn Dog Fly Mosquito alignment implies an evolutionary relationship also represented by Phylogenetic Tree aligns amino acids that diverged from the same residue in (hypothetical) most recent common ancestor darwinian evolution is driven by random mutation and natural selection our model allows for point mutations and insertions/deletions (indels) mutations may be adaptive, neutral or deleterious alignment shows accepted substitutions since divergence proteins evolve under functional constraints - mutations that destroy function do not appear in database via organism death "correct" alignment represents actual events- substitutions, indels impossible to verify -> take alignment with the highest probability that the alignment is correct under our model

String Alignments [Rice, Mosquito] triosephosphate isomerase lengths=55,53 simil=117.9, PAM_dist=111, identity=36.4% NGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQVAAQNCW...!..!.!...!.:....!.:.!...!! NGDKASIADLCKVLTTGPLNAD TEVVVGCPAPYLTLARSQLPDSVCVAAQNCY Similarity Score (Likelihood Based) PAM distance (evolutionary distance) For pairwise string alignments, the dynamic programming algorithm guarantees that the highest scoring alignment is found. Local alignment- find the highest scoring substring Global alignment- find the highest score for aligning the complete strings

NGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQVAAQNCW.:.:!.:!.! :..!.:. : :..! :::!..:!.! NGDKASIADLCKVLTTGPLNAD TEVVVGCPAPYLTLARSQLPDSVCVAAQNCY PAM distance lengths=55,53 simil=117.9, PAM_dist=111, identity=36.4% Evolutionary NGTTDQVDKIVKILNEGQIASTDVVEVVVSPPYVFLPVVKSQLRPEIQVAAQNCW distance (not time) definition:...!..!.!...!.:. a 1 PAM transformation.. is.!.:.!...! an evolutionary! step where 1% of NGDKASIADLCKVLTTGPLNAD TEVVVGCPAPYLTLARSQLPDSVCVAAQNCY the amino acids are expected to mutate Definition: A 1 PAM (Accepted Point Mutations) transformation is an M is a mutation matrix for which each element describes a probability of evolutionary step where 1(expected) of the amino acids mutate. a mutation This transformation can be described by a 20 x 20 mutation matrix M where M ij = Pr x j x i. M = 0.98 0.01... 0.01 0 0.99... 0.002...... 0.001 0... 0.97 20 i=1 f i (1 M ii ) = 0.01 where f i is the naturally occurring frequency of amino acid i Each mutation matrix is associated with some amount of mutation. For historical reasons the unit of mutation is called a PAM (point accepted mutation) unit. The 1 PAM mutation matrix mutates amino acids with

by reasons of common ancestry divided by the probabilities that the pairing would occur by chance. Here is how the elements of the Dayhoff Matrix are calculated from Mutation matrices. The probability of alignment by random chance is just the frequencies of the amino acids in the database. The probability of alignment by reasons of ancestry is calculated by summing over all amino acids the probability that amino acid mutated into the ones at this position in the alignment. (see slide) We sum over all amino acids because we do not know what X was. We have a Dayhoff entry for each pair of amino acids. We align sequences such that this probability is maximized. ie We maximize the probability of having evolved from a common ancestor ( a maximum likelihood alignment) against the null hypothesis of being randomly aligned. Similarity score Our score compares two events- the probability of alignment by reasons of common ancestry divided by the probability of alignement by random chance - -A- - sequence 1 - -X- - ancestor X. - -S- - sequence 2 - -A- - -S- Match by Chance P r{a}p r{s} = fa fs Pr{A and S from Ancestor X}! r{x A}P r{x S} X fx P! =!X fx MAX MSX = X fs MAX MXS 2 = fs MAS 2 = fa MSA where fa is the frequency of A in nature Compare Two Events 2 fa MAS CommonAncestry = DAS = 10log10 Chance fa fs dynamic programming maximizes this score and thus maximizes the probability (maximum likelihood) that the two sequences evolved from a common ancestor against the null hypothesis of having occurred by random chance.

Dayhoff Matrices 1 PAM C 17.2 S -18.5 12.1 T -21.6-12.7 12.0 P -33.2-18.6-19.5 13.4 A -18.1-14.3-17.5-18.8 11.0 G -25.2-18.7-25.3-24.9-18.2 11.3 N -24.1-15.5-17.5-24.0-22.3-19.1 13.4 D -32.1-18.7-20.0-22.7-21.2-20.5-14.0 12.7 E -35.3-19.4-20.8-21.6-18.6-23.7-19.5-12.8 12.3 Q -28.7-18.4-18.9-19.7-19.4-22.8-17.4-18.7-13.2 H -22.1-20.2-19.7-22.8-22.1-24.1-15.3-19.4-19.4 www.biorecipes.com/dayhoff/code.html 250 PAM C 11.5 S 0.1 2.2 T -0.5 1.5 2.5 P -3.1 0.4 0.1 7.6 A 0.5 1.1 0.6 0.3 2.4 G -2.0 0.4-1.1-1.6 0.5 6.6 N -1.8 0.9 0.5-0.9-0.3 0.4 3.8 D -3.2 0.5-0.0-0.7-0.3 0.1 2.2 4.7 E -3.0 0.2-0.1-0.5-0.0-0.8 0.9 2.7 3.6 Q -2.4 0.2 0.0-0.2-0.2-1.0 0.7 0.9 1.7 H -1.3-0.2-0.3-1.1-0.8-1.4 1.2 0.4 0.4

Multiple Sequence alignments Xenopus Gallus Bos Homo Mus Rattus ATGCATGGGCCAACATGACCAGGAGTTGGTGTCGGTCCAAACAGCGTT---GGCTCTCTA ATGCATGGGCCAGCATGACCAGCAGGAGGTAGC---CAAAATAACACCAACATGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACCCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCGGGAGGTAGCTCTCAAAACAGCACCAACGTGCAAATG ****** **** ********* * *** * * *** * * * each column is descended from one position in the sequence of the common ancestor can not be built by algorithms which guarantee optimal score reasonable heuristic algorithms for constructing MSAs existclustal, MAlign, T-Coffee

Markovian Model of Evolution mutations occur with probability independent of previous substitutions substitutions occur indepdently at different positions in the polypeptide chain a single substitution matrix represents the probability of amino acid substitution at any position Proteins do not have Markovian Behavior distant residues come together in the 3D fold and influence each other surface amino acids tolerate more variation than interior residues biological function constrains accepted substitutions - active site conservation back mutations are more probable L -> I -> L chemically similar substitutions are more probable nature is too complex to model exactly

things that do not fit in our evolutionary Lateral Gene Transfer Reversals (snakes) model Convergent evolution (flight evolved 5 different times)

Phylogenetic Trees

How to build trees Starting point: molecular sequences (for this discussion) Goal: a phylogenetic tree describing the evolutionary relationships of the taxa

How many trees are there? Number of leaves Number of unrooted trees Number of rooted trees 2 NA 1 3 1 3 4 3 15 5 15 105 6 105 945 10 2027025 34459425 20 2.216e+20 8.201e+21 50 2.838e+74 2.753e+76 n (2n 5)!! (2n 3)!! Unrooted Tree Rooted Tree Conclusion: We can not evaluate every tree topology when searching for the highest scoring tree.

Clustering Algorithms For certain types of trees, clustering algorithms will work well Ultrametric Trees Additive Trees Advantage: very fast Disadvantage: most real trees do not satisfy these conditions.

Ultrametric Trees d(a, Y ) = d(b, Y ) To compute an ultrametric tree, we can apply following algorithm: X Y D = D = D AX BX CX A B C Figure 8: Ultrametric tree Assume all evolution occurs at the same Choose leaf rate pair with (molecular minimum distance clock) Assume all distances are measured without error 6 Assume all leaves are equidistant from the root UPGMA (unweighted pair group method with arithmetic averages) algorithm for tree building will usually work well for these trees (not mathematically guaranteed)

UPGMA Find i and j that have minimum entry D[i,j] in D Create new group (ij) which has nij = ni + nj members connect i and j on the tree to a new node which corresponds to the group (ij). give the two branches connecting i to (ij) and j to (ij) each length Dij/2 compute distances of all nodes k to (ij) - as d[k,ij] = (ni/(ni+nj))*d[k,i] + (nj/(nj+nj))d[k,j] repeat while number of matrix elements is > 1 join d and c a b c d a b c,d a 0 12 24 24 b 0 24 24 a 0 12 24 c 0 8 b 0 24 d 0 c,d 0 join a and b a,b c,d a,b 0 24 c,d 0

Additive Trees assume that Compute pairwise edge lengths distances and construct have tree for no allerror other nodes based on ultrametricity conditions assume that distances in matrix correspond exactly to For non-singleton nodes apply algorithm recursively branch lengths neighbor-joining 5.2.2 Easy Case algorithm II: Additivity is guaranteed to recover the For additivity we expect that a) pairwise distances can be measured without true tree if the distance matrix is an exact reflection of error and b) distances correspond exactly to branch lengths. Unfortunately, this the tree is not the case for real biological trees. d(a, B) = L1 + L2 d(a, C) = L1 + L3 + L4 d(b, C) = L2 + L3 + L4 L3 L4 L1 L2 A B C Figure 9: Additive tree

neighbor joining algorithm Distance Based Methods: Least Squares (LS) does not assume clock-like evolution For each tip, compute. Choose the and for which is the smallest Join iterms and. Compute the branch length from to the new node ( ) and from to the new node ( ) as: Compute the distance between the new node ( as ) and each of the remaining tips Delete tips and from the tables and replace them by the new node ( ) which is now treated as a tip. if more than 2 nodes remain go back to step 1. Otherwise connect the 2 remaining nodes (say and ) by a branch of length.

5 Construct an initial tree Random tree Finding the Optimal Tree Heuristic for specific data types (Neighbor joining or UPGMA) Search for better scoring topologies using 4-, 5-, or 6-optim while evaluating the tree with a given scoring function (parsimony, distance, or likelihood) Continue to optimize under a scoring criterium until the score no longer improves 5.1 General Tree Construction Procedure For any method finding the best tree is NP-hard. Hence the following procedu is performed: 1. Construct an initial tree evaluation step Random tree Heuristic tree (easy cases, Neighbor Joining) 2. Iterative refinement: 4-/5-/6-optim evaluation steps 3. Randomization evaluation steps while improve 6!optim 5!optim while improve Initial Tree Result Tree no improvement while improve 4!optim Figure 6: General Tree Construction Procedure 5.2 Distance Based Methods Input: Distance matrix D: B C D A d(a, B) d(a, C) d(a, D) B d(b, C) d(b, D) C d(c, D)

4-optim There are 3 different topologies with 4 subtrees. A C A C L2 L1 L3 L4 L5 B D B A D B D C Divide the tree into 4 subtrees (A, B, C and D) Compute the quality for all possible topologies Select the best configuration Repeat for different subtrees until there is no improvement

5-optim and 6-optim 4-optim improves the topologies towards the leaves 5- and 6-optime improve towards the interior of the tree 4-optim 4 subtrees 3 topologies 5-optim 5 subtrees 15 topologies 6-optim 6 subtrees 105 topologies

Types of Tree Construction Methods Character based - Parsimony Distance based - least squares Probability based - Maximum Likelihood or Bayesian Distance Parsimony Maximum Likelihood Input pairwise distance matrix character tables (multiple sequence alignment) pairwise dist. matrix multiple sequence alignment Output branch lengths topology topology branch lengths topology

asy Case I: Ultrametric Tree Distance trees Input: Distance matrix D describing the measured distance between all taxa of interest A B C B C D D(A,B) D(A,C) D(A,D) D(B,C) D(B,D) D(C,D) D s come from pairwise sequence alignments L1 L3 L2 L5 L4 L6 d(a,b) = L1 + L2 d(a,c) = L1 + L3 + L4 + L5 d(a,d) = L1 + L3 + L4 + L6 d(b,c) = L2 + L3 + L4 + L5 d(b,d) = L2 + L3 + L4 + L6 d(c,d) = L5 + L6 the Ls are fit A B C D Figure 7: Least Squares Tree of Matrix D

What to minimize Distance Based Methods: Least Squares (LS) where is what we are trying to minimize, is the number of leaves, is a weighting factor, 1 over the Pam variance, ( ), D is the matrix of experimentally determined distances from the pairwise alignments (for example), d is a matrix of distances calculated from the fit tree.. p.1/1

Distance Methods consider pairwise distances as estimates of the branch length separating two species each distance infers the best unrooted tree for that pair of species in effect, we have many estimated 2-species trees and we try to find the best n-species tree implied by them individual distances are not exactly the path lengths in the full n- species tree between any two species we search for the full tree that does the best job of approximating these individual two-species trees search for the branch lengths and topologies that minimize difference between approximated branch lengths and experimental branch lengths for a given topology, it is possible to solve for the branch lengths that minimize Q using standard least squares methods

Character Based Methods What is a character? finite number of states discrete backbone skull opening hip socket grasping warmblooded alligator 1 1 0 0 0 T. rex 1 1 1 0 0 sparrow 1 1 1 0 0 chimp 1 0 0 1 1 human 1 0 0 1 1 cat 1 0 0 0 1

each character fits on one branch of a phylogenetic tree changes in character happen only once species with the same character are all under the same subtree Perfect Phylogeny warmblooded backbone skull opening hip socket grasping alligator 1 1 0 0 0 T. rex 1 1 1 0 0 sparrow 1 1 1 0 0 chimp 1 0 0 1 1 human 1 0 0 1 1 cat 1 0 0 0 1

Parsimony For molecular sequence data, each column of the MSA will be considered a character. Xenopus Gallus Bos Homo Mus Rattus ATGCATGGGCCAACATGACCAGGAGTTGGTGTCGGTCCAAACAGCGTT---GGCTCTCTA ATGCATGGGCCAGCATGACCAGCAGGAGGTAGC---CAAAATAACACCAACATGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACCCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCAGGAGGTAGCACTCAAAACAGCACCAACGTGCAAATG ATGCATCCGCCACCATGACCAGCGGGAGGTAGCTCTCAAAACAGCACCAACGTGCAAATG ****** **** ********* * *** * * *** * * *

Parsimony The parsimony score is the number of changes of state on the evolutionary tree. The most parsimonious tree is that which minimizes the amount of evolutionary change. The topology is given, parsimony is a method for finding the tree with the least amount of state changes. The highest scoring tree minimizes the number of changes. Occam's Razor- William of Occam (1300-1349): Entities should not be multiplied more than necessary- the fewer assumptions an explanation of a phenomenon depends on, the better it is.

Parsimony Algorithm Use labels at leaves to reconstruct the possible labels at internal nodes Compare the labels at each of the two children of each node. If there is an intersection of the two sets of labels, the parent node is labeled with the result of the intersection and there is no penalty If the intersection is empty, then the node is labeled with the union of the two sets of labels and the penalty increases by +1 Continue from the leaves to the root until all nodes have been labeled

Parsimony Number of Changes: 3 {G,T} +1 {T} {R,T} +1 {T} {G,T} +1 A B C D E F Characters R T T T G G

Optimizing under parsimony For a given topology and alignment position, determine what ancestral residues require the least amount of changes. Compute this for each alignment column (character). Add the number of changes for each position together to obtain the parsimony score (length of the tree). Compute this score for many tree topologies and keep the one(s) with the lowest score.

Assigning ancestral states start at the root, if the set contains more than one character, pick one at random Move from the root towards the leaves. If an intersection exists between the chosen state of the parent and the child, choose it. If not, choose another character at random Many trees may exist with the same parsimony score Number of Changes: 3 {G,T} +1 Number of Changes: 3 T {T} T {R,T} +1 {T} {G,T} +1 T T T A B C D E F A B C D E F Characters R T T T G G Characters R T T T G G

Parsimony problems Inconsistency C D C D Backflips C A G A B A B A true tree parsimony tree there is no information on branch length, only change or no change

Maximum Likelihood Maximum Likelihood: general parameter estimation procedure Parameters are estimated from the data D such that the likelihood L of the data given the parameters is maximized parameters - tree topology and branch lengths input data - aligned molecular sequences goal: find the topology and branch lengths that maximize the likelihood of the data Use Dayhoff matrices to obtain the likelihood of a transition for a given period of time (PAM distance).

Maximum Liklihood Y X Z L1 L2 L3 L5 L4 L6 Y Z A B C D L1 L2 L5 Figure 15: Maximum Likelihood Tree For the whole tree T at position i: L6 A B C D Figure 15: Maximum Likelihood Tree L(T ) i = For Pthe r(x whole i ) tree T atpposition r L3 (X i: i Y i )P r L1 (Y i A i )P r L2 (Y i B i ) XL(T ) i = P r(x ) i Y i P r L3 (X i Y i )P r L1 (Y i A i )P r L2 (Y i B i ) X i Y i Z i P r L4 (X i Z i )P r L5 (Z i C i )P r L6 (Z i D i ) Z i P r L4 (X i Z i )P r L5 (Z i C i )P r L6 (Z i D i ) Over all positions and transformed to a log-scale: Over all positions and transformed to a log-scale: log(l(t )) = log(l(t )) = 8 Data selection N log(l(t ) i ) N i=1 log(l(t ) i ) i=1 We carefully have to select the data to build the tree on. We should verify that

Selecting data to Reconstruct Species Trees Sequences must be derived from a common ancestor (Homologous) Orthologs - sequences related by a speciation event Paralogs- sequences related by a gene-duplication event

Tree of Life