Frank Oliver löckner Information Phylogeny Who are we: Dr. Frank Oliver löckner Dr. Jörg Peplies Max Planck Institute for Marine Microbiology Microbial enomics roup Bremen, ermany ontact: arb@mpibremen.de Mailinglist: arb_users@yahoogroups.com RBWorkshop 4/5.07.004 EH Oxford Where can you find additional information: www.arbhome.de ftp.mpibremen.de/molecol_p/arb > all files needed to install RB can be found in the EH_Oxford folder Frank Oliver löckner Phylogeny he Backbone of Biology Why? o track back the origin of organisms o unravel evolutionary relationships o sort and classify organisms Molecular Markers Zuckerkandl and Pauling 965 Use macromolecules as molecular clocks DN/RN Proteins How? Botany and Zoology Morphology Fossils Microbiology Molecular markers Problem: Species vs. enephylogeny Lateral gene transfer enome plasticity/patchwork Orthologous/paralogous genes Frank Oliver löckner Frank Oliver löckner 4 Universal ree Homology Definition Homology wo sequences are homolog when they evolved from a common ancestor sequence Homology can not be quantified! Sequences are homolog or not! Orthologous genes Direct common ancestor Paralogous genes Originates from a gene duplication Doolittle, Science 999, 84:48 Frank Oliver löckner 5 Frank Oliver löckner 6
Frank Oliver löckner 7 Orthologs/Paralogs species Phylogenetic Markers geneduplication speciation speciation 4 4 species B species species B 6S rrn S rrn Elongationfactors EFu EF PSynthase Reg Hsp60 RNPolymerase yrase Housekeeping enes ranscription ranslation Frank Oliver löckner 8 rrn as Phylogenetic Marker dvantages Functional constancy Ubiquitous distribution Large size (information content) onserved and highly variable structural elements No lateral gene transfer Drawbacks No continuous sequence change Multiple genes/operons Different species with identical 6S rrns One base change needs nearly one million years Steps in Phylogenetic nalysis Sequence determination lignment Data analysis Phylogenetic reconstruction Frank Oliver löckner 9 Frank Oliver löckner 0 Sequence Determination utomatic sequencers BI Prism 77 (gel) BI Prism 00 (6 capillary) BI Prism 700 (96 capillary) Megabase 500 (48 capillary) Megabase 000 (96 capillary) he European Database on Ribosomal RN Maintenance Department of Biochemistry, University ntwerpen Services SSU, LSU sequences, annotations lignments, secondary structures, variability maps WWW interface for sequence retrieval Software for alignment and tree reconstruction ontent (aligned sequences) Release September 00 0,85 SSU sequences,400 LSU sequences www.psb.ugent.be/rrn/ Frank Oliver löckner Frank Oliver löckner
Frank Oliver löckner RDPII Ribosomal Database Project RB Software Environment for Sequence Data http://rdp.cme.msu.edu Maintenance enter for Microbial Ecology, Michigan State University Services SSU, LSU sequences, annotations lignments Phylogenetic trees nalysis services via WWW server ontent (aligned sequences) RDP Preview Release from 05/05/004 97,8 SSU sequences 7 LSU sequences Maintenance Department for Microbiology, echnical University Munich Services SSU, LSU sequences, annotations lignments, Phylogenetic trees Probe design, Probe match Software suite RB ontent (aligned sequences) Prerelease July 04 59,609 SSU sequences 698 LSU sequences www.arbhome.de Frank Oliver löckner 4 lignment Problem Variable Regions lign the sequences in a way, that homologous bases will stand one below the other in a column Frank Oliver löckner 5 Frank Oliver löckner 6 he 6S Secondary Structure RBEdit proteins 6S rrn 0S subunits 70S ribosome 50S 4 proteins 5S rrn S rrn Escherichia coli 6S rrn primary and secondarystructure Frank Oliver löckner 7 Frank Oliver löckner 8
Frank Oliver löckner 9 Secondary Structures Secondary Structures UUUUUUUU UUUUUUUUU UUUUUU Escherichia coli Secondary structure information UUUUUUUU UUUUUUUUU UUUUUU Mycoplasma hypopneumoniae Streptococcus oralis Frank Oliver löckner 0 SSU Secondary structure Data nalysis Information ontent Size (E.coli) Information (bits) Similarity 6S rrn 54 n 084 >67% S rrn 904 n 5808 >67% EFu 94 aa 706 >60% Pase β subunit 460 aa 99 >6% onserved Variable Information (variable) Information (real) 568 974 948 506 65 89 4578 8 47 656 59 555 79 Ludwig and Klenk, Bergeys Frank Oliver löckner Frank Oliver löckner Information ontent 6S rrn haracters No % 568 7 58 09 4 4 407 6 5 6 7 8 9 0 4 Phylogeny, 5 Oxford 004 S rrn EFu Pase βsub. No % No % No % 94 84 4 65 79 0 8 8 507 8 46 58 848 9 49 4 9 8 7 4 9 6 4 5 0 5 5 6 6 4 5 9 4 4 0.8 9 0.8 5 0. 4 Oliver 0. 0 0 Frank löckner Models of Evolution Models of substitution rates between bases ransition >, >, >, > ransversion >, >, >, > and reverse minoacids: PM and BLOSUM matrices Base frequencies Models of amongsite substitution rate heterogeneity Weighting particular sites according to relative mutation frequencies (position variability) Frank Oliver löckner 4
Frank Oliver löckner 5 Models of Evolution Jukesantor model ll substitution types and base frequencies are presumed equal ime reversible Kimura parameter model ransitions are more likely than transversions Equal base frequencies ime reversible Substitution Models reeing methods Maximum Parsimony Fixed costs matrices Distance Matrix and Maximum Likelihood eneral model of sequence evolution Not addressed Lineagespecific substitutions Different rates of evolution between lineages (a+a+a) a4 a7 a0 a (a4+a5+a6) a8 a a a5 (a7+a8+a9) a a a6 a9 (a0+a+a) a = relative rate between the different substitutions x frequency of target base Frank Oliver löckner Swofford, Book (Hillis), 996, p. 4 6 Models of Evolution eneral matrix Models of Evolution eneral ime Reversal (R) Jukes antor Kimura s parameter model (KP) Frank Oliver löckner Swofford, Book (Hillis), 996, p. 4 7 Phylogeny, Oxford Swofford, 004 Book (Hillis), 996, p. 44 Frank Oliver löckner 8 reeing Methods lassification Inferring a phylogeny is really an estimation procedure; we are making a best estimate of an evolutionary history based on incomplete information Swofford, 990 Distancebased ompute pairwise distances and use them to derive the tree haracterbased Work directly on each character of the data. Derive trees that optimize the distribution of the actual data pattern for each character Maximum Parsimony, Maximum Likelihood lgorithmbased enerate a tree according to a series of steps (e.g. neighbor joining) riterionbased Evaluation of alternative trees according to some optimization functions Frank Oliver löckner 9 Frank Oliver löckner 0
Frank Oliver löckner he Most ommon Methods for ree Reconstruction Distance Matrix alculation of distance matrices by binary comparison of the aligned sequences UPM or Neighbor Joining Maximum Parsimony Preservation is more likely than change Search for topologies that minimize the total tree length assuming a minimum number of base changes Maximum Likelihood Searches for the evolutionary model, including the tree itself, that has the highest likelihood of producing the observed data Models: transition/transversion; base frequencies; positional variability Definitions peripheral branch internal branch Radial tree central branch terminal nodes/tips links/edges internal nodes Dendrogram Unrooted tree: the location of the common ancestor is not specified Frank Oliver löckner Distance Matrix Ultrametric Data Distance Matrix Non Ultrametric Data UPM Unweighted Pair roup Method with rithmetic Mean 0.6 0.7 Frank Oliver löckner Frank Oliver löckner 4 Distance Matrix dditive rees Example additive trees: FitchMargoliash algorithm alculate the matrix Find the most closely related pair of sequences and link it by an internal node Link the next related sequence with an internal node alculate branch length B 9 B 4 9 4 0. E Frank Oliver löckner 5 to B = a+b = () to = a+c = 9 () B to = b+c = 4 () Subtract () from (), 94 = (4) dd () and (4), = 0, a = 0 From () and (), b =, c = 9 B Frank Oliver löckner 6 a b c Mount, Book, Bioinformatics 00, p. 57
Frank Oliver löckner 7 Principle of Neighbor Joining (Saitou and Nei, 987) he fully resolved tree is decomposed from a fully unresolved star tree by successively inserting branches between a pair of closest neighbors and the remaining terminals in the tree B H D F E star decomposition B H D F E Distance Matrix dditive rees Finding a tree that fits to the matrix Find the optimal values for the branching pattern and the branch length NJ will find the correct tree if the distances are additive Problem: Nonadditive distances caused by superimposed changes Observed distance Real distance Frank Oliver löckner 8 Dealing with nonadditive distances Instead of using raw dissimilarity correct distances based on expected numbers of hidden changes For some models (J, KP, F84) simple distance equations exist For others one must use ML Outcome: dditivity is not restored!! > Optimality criterion is needed Most widely used = leastsquares criterion (e.g., Fitch Margoliash) can lead to negative branch length Minimal Evolution (PUP) Distance Matrix orrect for Multiple hanges Jukes and antor, 969 Frank Oliver löckner 9 Frank Oliver löckner 40 Pros and ons Very fast Only one tree is derived opology and branch lengths are calculated ounts for false identities Works with different models of evolution Discards the primary character data Different sequences can yield the same matrix distance method would reconstruct the true tree if all genetic divergence events were accurately recorded in the sequence Swofford, 996 Maximum Parsimony MP is an optimality criterion that appeals to the principle: he simplest explanation of the data is the best Model of evolution: Preservation is more likely than change haracter based method Evaluates trees Selects trees that minimize the total tree length Needs a set of outgroup taxa alculations are done from the terminal nodes towards the (arbitrary) root Implicit model of evolution no additional model needed Frank Oliver löckner 4 Frank Oliver löckner 4
Frank Oliver löckner 4 Maximum Parsimony Evaluation of rees he alignment is checked for informative positions o be informative, a site must have the same sequence characters in at least two taxa (e.g. site,,, 5) nd they must favor one topology over another (only site 5) Only the informative sites are analyzed S / S S S4 / S S S S4 S / S4 S S / S S S S4 4 5 / mutations mutation / mutations / Frank Oliver löckner 44 Pros and ons Works directly on the data Works fine on data with strong similarity Relatively fast Does not need a model of evolution alculates only topologies Performs weakly on distantly related data Prone to false identities (multiple changes) long branch attraction an produce many trees with the same parsimony score Maximum Likelihood ML evaluates a hypothesis about evolutionary history in terms of probability that a proposed model of the evolutionary process and the hypothesized history would give rise to the observed data haracter based method oncrete model of evolution needed ssumes that nucleotide sites evolve independently Likelihood for each site is calculated separately and combined to a total value for a tree Looks for the tree with the highest likelihood; L () = maximal Frank Oliver löckner 45 Frank Oliver löckner 46 Maximum Likelihood Maximum Likelihood he likelihood of the full tree is the product of the likelihood at each site L () = L () x L () x. x L (N) = N j = L(j) Because the probability of any single observation is an extremely small number they are normally handled as logarithms For every internal node all four nucleotides are allowed > 4x4 = 6 probabilities Each probability is the product of the probability of the base in (6) and the transition/transversion probabilities e.g. prob. = 0.5 or average frequency of in the sequence (> depends on model) > transversion = 0 6 and > transition = x0 6 Likelihood of = 0.5 x x0 6 x 0 6 = 5x0 ln L () = ln L () + ln L () +. + ln L (N) = N j = lnl( j ) Frank Oliver löckner Swofford, Book (Hillis), 996, p. 4 47 Frank Oliver löckner 48
Frank Oliver löckner 49 Pros and ons MP vs. ML Works directly on the data Performs well also on distantly related data Includes models of evolutions he whole tree is under evaluation topologies and branch lengths are optimized urrently regarded as the best method omputationally intense number of sequences is limited Frank Oliver löckner Swofford, Book (Hillis), 996, p. 49 50 Searching for optimal trees Exact lgorithms Exhaustive search Branchandbound Methods How many trees do we have to evaluate Places to add another taxon wo taxa = Heuristic pproaches Stepwise addition Star decomposition Branch swapping hree taxa Four taxa = = 5 Five taxa = 7 Frank Oliver löckner 5 Frank Oliver löckner 5 he 5 possible unrooted trees for 5 taxa Exhaustive opologies Number of unrooted, bifurcating trees No of sequences 4 5 6 7 8 9 0 50 No of trees 5 05 945 0,95 5,5,07,05.8x0 74 B( ) = i= (i 5) he root is just another taxon so: No of sequences 4 No of trees 5 Frank Oliver löckner Swofford, Book (Hillis), 996, p. 479 5 Frank Oliver löckner 54
Frank Oliver löckner 55 Exact algorithms Search tree for BranchandBound Exhaustive (< taxa) ll trees are evaluated Branchandbound (<0 taxa) onstruct a random tree with all sequences and evaluate its value L under the chosen optimality criterion according to the reconstruction method and model used his is the initial upper bound of L Start to reconstruct trees from to X taxon by stepwise addition of taxa Evaluate each tree if the score exceeds L there is no need to go further along this path, if the score < L proceed If the score at the end of the path is less than L take this for the new upper bound Frank Oliver löckner Swofford, Book (Hillis), 996, p. 480 56 Heuristic pproaches lobal vs. Local Optimum lobal vs. local optimum Heuristic tree searches generally operate by hill climbing methods Start with an initial tree Optimize (rearrange) it under the chosen optimality criterion If we find no way for further improvement stop Problem: here is no way of knowing if we reached the global or merely a local optimum Frank Oliver löckner 57 Frank Oliver löckner 58 lobal vs. Local Optimum Heuristics Stepwise ddition Stepwise ddition Start with three sequences dd next taxon evaluate tree, do rearrangements Save the one with the best score add next taxon ddition order In the order of the data in the alignment Use a distance algorithm to decide order e.g. by closest taxon addition dd the taxon that makes the optimal e.g. shortest tree Random taxon addition order Frank Oliver löckner 59 Frank Oliver löckner 60
Frank Oliver löckner 6 Heuristics Branch swapping Heuristics NNI Branch Swapping Nearest Neighbor Interchange (NNI) Subtree pruning and recrafting (SPR) ree bisection (BR) Hoping to find a better tree by disturbing (rearranging) the tree to overcome local optima Problem: If the tree is on a plateau and the global optimum several steps away we might still not reach it Frank Oliver löckner Felsenstein, Book, 004, p. 9 6 Heuristics SPR Heuristics BR Frank Oliver löckner Felsenstein, Book, 004, p. 4 6 Frank Oliver löckner Felsenstein, Book, 004, p. 4 64 onfidence ests Bootstrapping Bootstrapping Resampling tree evaluation technique New data sets are created from the original data set by sampling columns of characters by random with replacement Each site can be sampled again with the same probability as any of the other sites Problem: Some positions can be over represented, some sites are missing t least 00, better,000 trees should be calculated Remember: High bootstrap values can make wrong phylogeny look good!! * * * * * * * * * *** ********* ** * *** *** * **** ***** **** **** 0 0 0 40 50 consensus tngccatctttcacgnaacanncnctngcngaca HI attgcagtgtattggggacaaaatggaaatgaagggtctttgcaagatgc PSHI atagctgtttactggggccaaaacggtggagaaggatccttagcagacac NIDL atagtaatatattggggccaaaatgggaatgaaggtagcttagctgacac S6608 attgtcatatactggggccaaaatggtgatgaaggaagtcttgctgacac USSEQ_ atcgccatctattggggccaaaacggcaacgaaggctctcttgcatccac USSEQ_ atcgccatctattggggtcaaaacggcaacgagggctctcttgcatccac USSEQ_ atcggcatctattggggccaaaacggcaacgaaggctctcttgcatccac VIRE atttccgtctactggggtcaaaacggtaacgagggctccctggccgacgc VURNH auuuccgucuacuggggucaaaacggcaacgagggcucucuggccgacgc HHI atagccatctattggggccaaaacggaaacgaaggtaacctctctgccac VURNHB auagccaucuacuggggccaaaacggcaacgagggaacgcuuuccgaagc NBSIL attgtagtctattggggccaagatgtaggagaaggtaaattgattgacac Frank Oliver löckner 65 Frank Oliver löckner 66
Frank Oliver löckner 67 Bootstrapping Why do trees differ? Information content Sequencing errors lignment homology of characters Nonadditive data (false identities) Different and simplified models of evolution Independence of data Lineage and/or positionspecific rate of evolution Data selection Only subsets of organisms and positions alculation heuristics Small amount of evaluated trees strong dependence on the order of input data Local or global optimum? Frank Oliver löckner 68 onsensus trees Practical implications Filters Filters: Remove or weight down individual alignment columns while treeing Keep balance between data loss and gain of accuracy E.g. 50% conservation filter olumns in the alignment are only considered for tree reconstruction, when at least 50% of the sequences show the same residue Position variability he position variability for every column is calculated and shown as numbers 9 and characters Z means highly variable Z means extremely conserved (never seen) Frank Oliver löckner 69 Frank Oliver löckner 70 RB Filter Practical implications Outgroup hose as many sequences for the outgroup as possible hey should not be too far related to the group of interest Pic RB Phylo Data Use always the largest dataset available if necessary remove sequences after the calculation ompare different algorithms Reject problematic data Never reconstruct trees or filters on partial sequence data Frank Oliver löckner 7 Frank Oliver löckner 7
Frank Oliver löckner 7 RB Internal rchitecture Probefunctions Database Databasemanagement he concept of RB Probe_Design Probe_Match request update request lignment possible probes matching sequences next relative PServer Sequencealignment Phylogenetic reconstructions Frank Oliver löckner 74 PServer Do not overdo it Not delivered with RB Different format of your database for faster performance of sequence search functions within RB It is only used to search the next relative for the automatic aligner and for Probe_Design/Probe_Match reating/updating takes a long time and a lot of memory Once it has been created searching is very fast Frank Oliver löckner 75 Frank Oliver löckner 76