Phylogenetics Todd Vision Spring 2008 Tree basics Sequence alignment Inferring a phylogeny Neighbor joining Maximum parsimony Maximum likelihood Rooting trees and measuring confidence Software and file formats Testing hypotheses on a tree Some applications Studying organismal & biogeographic history Systematics (inferring the tree of life) ating events in the fossil record Testing hypotheses about phenotypic evolution onservation biology Studying gene and protein families (molecular evolution) Studying functional specificity and divergence Identifying selection at the molecular level Understanding host-parasite/pathogen coevolution Identifying horizontal transfer events Uncultured microbial diversity Hugenholtz P (2000) Genome iology 3, 1 iccarelli et al. (2006) Science 311, 1283 1
Two views of the tree of life Unrooted networks vs. rooted trees time Taxa Unrooted Rooted 3 1 3 5 15 105 10 2,027,025 34,459,425 W. Ford oolittle Unscaled vs. scaled branches More tree thinking Which species is oldest? Unscaled Scaled 2
lades and monophyly lade: a monophyletic group Includes the most recent common ancestor (MR) of a set of leaves and all of the descendants of that MR subtree on a rooted phylogeny Tree thinking Is the frog more closely related to the fish or the human? aum et al (2005) Science 310, 979-980. Tree thinking Now what do you think? Which of the four trees depicts a different pattern of relationships than the others? 3
Polytomies Incongruence between gene and species trees Error Lineage sorting Gene duplication & gene loss (paralogy) Horizontal transfer 2 descendants per node 2 or more descendants per node (a polytomy) Lineage sorting problem between closely related species himp, human, gorilla Gene duplication Two classes of homologous genes Orthologs diverged through speciation Paralogs diverged through duplication whether or not they are in the same genome time Species 1 Species 2 4
Horizontal transfer lternative explanations for incongruence Pereira et al (2000) J iol hem. 275(2):1495-501 elwiche F and Palmer J (1996) Mol iol Evol: 873-882. Outline Tree basics Sequence alignment Inferring a phylogeny Neighbor joining Maximum parsimony Maximum likelihood Rooting trees and measuring confidence Software and file formats Testing hypotheses on a tree lignments classified by Span Global, encompassing full-length sequences Local, restricted to conserved segments Number of sequences Pairwise, involving only two sequences like LST Multiple, involving more than two Hard! 5
Trivial ifficult GGG TGGTGTT GGTGG GGG TGGTGTT GGTGG GGTT TGTGGTT GGTGG GGT TGTGTT GGTGG GGGG TTGTGTT GGTG TTGTG GGGG--- -- ---TG GGGT--GT ---- --GGTG -TG--- TG- ----TG -GTGGG G-- -TGT -TTG--- GG Twilight Zone 100 90 80 70 60 50 40 30 20 10 Percent amino acid identity 0 otplots: phage λ ci vs. P22 c2 repressor mino acids versus N N sequences give much worse alignments than amino acid sequences Fewer letters Less realistic scoring matrices Window size 1 11 25 Stringency 1 7 15 rgleuys GGTTTG x xx x x GTTGTGT rgleuys 6
Scoring matrix ffine gap score Score depends on length of contiguous gap Gap opening penalty d Gap extension penalty e "(g) = #d # (g #1)e efensins Outline Tree basics Sequence alignment Inferring a phylogeny Neighbor joining Maximum parsimony Maximum likelihood Rooting trees and measuring confidence Software and file formats Testing hypotheses on a tree Hydrophobic Hydrophilic 7
Raw distances Tree distances istance matrix approaches - 5 4 6-5 5-2 - - 4.0 4.5 5.5-4.5 5.5-3.0-2 2 1.5 1 2 Neighbor joining lgorithm Takes a distance matrix as input Starts with a star phylogeny Progressively adds nodes until tree is fully resolved esirable features Very fast Works well if rate variation is not too great Neighbor joining 0.53 0.99 1.02 0.80 0.93 0.65 Star phylogeny Neighbor-joining tree Neighbor joining is based on good estimates of distance Observed number of substitutions time Estimated number of substitutions time 8
When are distances misleading? Maximum parsimony acgttgccga acgttactgg cgtaagatcg cgtaaaaccg 111112131- Maximum parsimony Homoplasy dvantages Provides explicit mapping of character changes along branches an be used for non-molecular characters (morphology) isadvantages Nondeterministic - it is a criterion to evaluate a tree, but does not help us locate that tree Non-probabilistic - makes statistical inference difficult Inconsistent - more data can be positively misleading onvergent or parallel character state changes in multiple independent lineages 9
Maximum likelihood We get heads with probability p Prob of k heads out of n tosses is given by the inomial Probability " n% L = P(x = k n, p) = $ ' # k pk (1( p) n(k & L p To calculate the likelihood of a phylogeny The input data is the alignment Each column is independent The model includes the topology, branch lengths and substitution matrix onsider all possible ancestral states for a given topology hoose the tree with branch lengths that maximizes the probability of producing the alignment g c g c g c g g c g Maximum Likelihood (ML) dvantages Estimates are consistent Given enough data and a correct model, the estimate converges on the correct phylogeny Probabilistic framework One can test relative fit of different models Sometimes the topology itself is a nuisance parameter isadvantages Slow (because we must examine lots of trees, like maximum parsimony) ut recent advances make it practical for trees with 100s of leaves Outline Tree basics Sequence alignment Inferring a phylogeny Neighbor joining Maximum parsimony Maximum likelihood Rooting trees and measuring confidence Software and file formats Testing hypotheses on a tree 10
Locating the root ootstrap confidence values Unrooted O Outgroup Midpoint How much to trust a given branch or clade? With NJ, parsimony and ML, clades do not come with marginal probabilities ootstrap by resampling the original alignment hoose an alignment having the same number of columns with replacement ompute a new tree Repeat this many times ount the proportion of resampled trees in which each original branch appears Label the branches on the original tree with these proportions phylogeny with bootstrap values 11
Strict consensus Outline Tree basics Sequence alignment Inferring a phylogeny Neighbor joining Maximum parsimony Maximum likelihood Rooting trees and measuring confidence Software and file formats Testing hypotheses on a tree onsensus may be among bootstrap replicates, or among equally parsimonious trees http://evolution.genetics.washington.edu/phylip/software.html Recommended software PHYLIP or JalView (NJ) PUP* (Parsimony, interactive, MacOSX) RxML (fast ML) Mrayes (ayesian) MEG (NJ, MP, ML, simple molecular evolution hypothesis testing, visualization) 12
FST format ligned FST Format >gi 18033454 gb L57167.1 F334385_1 own syndrome cell adhesion molecule SM [Rattus norvegicus] MWILLSLFQSFNVFSEEPHSSLYFVNSLQEVVFSTSGTLVPPGIPPVTLRWYLTGEEIYVP GIRHVHPNGTLQIFPFPPSSFSTLIHNTYYTENPSGKIRSQVHIKVLREPYTVRVEQKTMRGNV VFKIIPSSVEYVTVVSWEKTVSLVSGSRFLITSTGLYIKVQNEGLYNYRITRHRYTGETRQS NSRLFVSPNSPSILGFHRKMGQRVELPKLGHPEPYRWLKNMPLELSGRFQKTVTGLLI ENSRPSSGSYVEVSNRYGTKVIGRLYVKQPLKTISPRKVKSSVGSQVSLSSVTGNEQELSWYRN GEILNPGKNVRITGLNHNLIMHMVKSGGYQFVRKKLSQYVQVVLEGTPKIISFSEKVVSP >gi 45827726 ref NP_996770.1 own syndrome cell adhesion molecule isoform H2-52 precursor [Homo sapiens] MWILLSLFQSFNVFSELHSSLYFVNSLQEVVFSTTGTLVPPGIPPVTLRWYLTGEEIYVP GIRHVHPNGTLQIFPFPPSSFSTLIHNTYYTENPSGKIRSQVHIKVLREPYTVRVEQKTMRGNV VFKIIPSSVEYITVVSWEKTVSLVSGSRFLITSTGLYIKVQNEGLYNYRITRHRYTGETRQS NSRLFVSPNSPSILGFHRKMGQRVELPKLGHPEPYRWLKNMPLELSGRFQKTVTGLLI ENIRPSSGSYVEVSNRYGTKVIGRLYVKQPLKTISPRKVKSSVGSQVSLSSVTGTEQELSWYRN GEILNPGKNVRITGINHENLIMHMVKSGGYQFVRKKLSQYVQVVLEGTPKIISFSEKVVSP >gi 20127422 ref NP_001380.2 own syndrome cell adhesion molecule isoform H2-42 precursor [Homo sapiens] MWILLSLFQSFNVFSELHSSLYFVNSLQEVVFSTTGTLVPPGIPPVTLRWYLTGEEIYVP GIRHVHPNGTLQIFPFPPSSFSTLIHNTYYTENPSGKIRSQVHIKVLREPYTVRVEQKTMRGNV VFKIIPSSVEYITVVSWEKTVSLVSGSRFLITSTGLYIKVQNEGLYNYRITRHRYTGETRQS NSRLFVSPNSPSILGFHRKMGQRVELPKLGHPEPYRWLKNMPLELSGRFQKTVTGLLI ENIRPSSGSYVEVSNRYGTKVIGRLYVKQPLKTISPRKVKSSVGSQVSLSSVTGTEQELSWYRN GEILNPGKNVRITGINHENLIMHMVKSGGYQFVRKKLSQYVQVVLEGTPKIISFSEKVVSP >11_1RYP.ent/1-94 -----GYRHITIFSPEGRLYQVEYFKTNQTNINSL VRGKTVVISQKKVPKLLPT-TVSYIFISRTIGMVV NGPIPRNLRKEE >14_1RYP.ent/1-93 ------GYRLSIFSPGHIFQVEYLEVKR-GTVG VKGKNVVLGERRSTLKLQTRITPSKVSKISHVVLSF SGLNSRILIEKRVEQS >16_1RYP.ent/1-100 FRNNYGTVTFSPTGRLFQVEYLEIKQGSVTVGLRSN THVLVLKRNELSSYQKKIIKEHMGLSLGLP RVLSNYLRQQNYSSLVFNR Newick format Newick format (with branch lengths) 2.3 3.3 1.4 1.8 2.2 ((,)(,)) ((:2.3,:1.8):3.3,(:1.4,:2.2)) 13
NEXUS #NEXUS EGIN T; IMENSIONS NTX=89 NHR=88; [!ata from: Laskowski, M., Jr., and W.M. Fitch. 1989. Evolution of MTRIX [ 10 20 30 40 50 60 70 80 ] [........ ] Struthio_camelus VKYPNTNEEGKEVVLPKILSPIGSGVYSNELNIEYTNVSK??????FT--VYKPVPLYMLSKTSNKNNVVESSGTLRHFGK [86] Rhea_americana...L..E..N.V.T...?.?????...--...H...S.E...N...S... [86] Pterocnemia_pennata...L..E..N.V...H?EV...--...H...S.E...N...S... [86].] hauna_chavaria.r...l.t.t...t...rkev..--...t.e...nq...s...n...s... [86] nseranas_semipalmata.r...s...l.t...hkev..--..e...t.e...nq...n...s... [86] EGIN GENETIode; StandardNULER; EN; EGIN OONS; OESET * UNTITLE = Universal: all ; EN; EGIN SSUMPTIONS; OPTIONS EFTYPE=unord PolyTcount=MINSTEPS ; EN; EGIN TREES; TRNSLTE 1 Struthio_camelus, 2 Rhea_americana, 88 arpococcyx_renauldi, 89 Podargus_strigoides ; TREE * PUP_1 = [&R] (1,(((2,3),(4,5)),((((((((((6,7),((30,31),(((32,33),34),(((((((35,57),((((53,67),70) (62,(63,64),(68,69))),(((54,(55,56)),84),74))),(((46,(48,49)),47),71)),((36,59),60),(61,(75,76))),(72,73),77),((44,45),((((50,5 1),58),52),(65,66)))),(((37,((38,39),40)),41),(42,43)))))),14),15),((((16,20),(18,19)),17),((((21,26),(27,(28,29))), 22),((23,24),25)))),(78,79,80)),87),((81,85),(82,83))),(8,9,(10,((11,(86,88)),13),12))),89))); EN; EGIN NOTES; TEXT TXON=26 TEXT= G_removed_from_end_of_sequence; EN; Outline Tree basics Sequence alignment Inferring a phylogeny Neighbor joining Maximum parsimony Maximum likelihood Rooting trees and measuring confidence Software and file formats Testing hypotheses on a tree Testing a hypothesis on a tree Kishino-Hasegawa or SOWH test The difference in likelihood between two trees is compared to a test statistic for significance The alternative trees may differ in presence of an incongruency, for example This test is easy to misuse! Summary good alignment is a prerequisite Many methods are available for infer a tree Maximum likelihood is the most accurate To interpret the tree, it helps to Root it (don t be fooled by the graphic!) Measure confidence in clades (e.g. bootstrap) ranches with low support should be collapsed to polytomies pproach the tree with a hypothesis or question in mind 14