1 Seuqence nalysis '17--lecture 10 Trees types of trees Newick notation UPGM Fitch Margoliash istance vs Parsimony
2 Phyogenetic trees What is a phylogenetic tree? model of evolutionary relationships -- common ancestors and speciation events. Why build phylogenetic trees? To trace the branch order of "taxa" (taxon = a gene, a species, a population, etc.) To understand the evolution of traits s part of a multiple sequence alignment algorithm Trees can be "rooted" or "unrooted" root no root
3 Tree Terminology time (rooted trees only) Lineages root ommon ancestors (hypothetical) outgroup taxa Taxa are observed species or genes.
4 Inferring evolutionary relationships between the taxa requires rooting the tree: To root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Root Unrooted tree Root Rooted tree
5 Where the tree is rooted changes its meaning. Each of these trees is possible by choosing a different root. This one says and branched late. This one says and branched early.
6 Taxon order doesn't matter. Rotating an ancestor node does not change the relationships. UGENE: swap siblings
7 Two strategies for rooting a tree: "outgroup" and "midpoint" 1. hoose the midpoint between the two most distant branches. cladogram 2. hoose one taxon as the "out group." (it branches first.) good outgroup is not too distant from the rest of the tree. phylogram
8 Newick notation (,(,(,))) Trees can be represented in plain text Newick notation. Each set of parentheses represents a branch-point (split), the comma separates left and right lineages. Implies a rooted tree. (,(,(,))) = Newick notation can contain sequence labels or not.
9 Here is a Newick Tree for 50 taxa (((((((((((((('Phoca caspica': ,'halichoerus grypus': ): ,'Phoca sibirica': ): ,('phoca largha': ,'Phoca vitulina': ): ): ,'phoca hispida': ): ,('phoca fasciata': ,'phoca groenlandica': ): ): ,'ystophora cristata': ): ,'Erignathus barbatus': ): ,((((('hydrurga leptonyx': ,'Leptonychotes weddellii': ): ,'lobodon carcinophaga': ): ,'ommatophoca rossii': ): ,('Mirounga angustirostris': ,'mirounga leonina': ): ): ,('Monachus schauinslandi': ,'monachus monachus': ): ): ): ,((((((('rctocephalus australis': ,'rctocephalus forsteri': ): ,'rctocephalus townsendi': ): ,('neophoca cinerea': ,'phocarctos hookeri': ): ): ,('Otaria byronia': ,'rctocephalus pusillus': ): ): ,('Eumetopias jubatus': ,'zalophus californianus': ): ): ,'allorhinus ursinus': ): ,'Odobenus rosmarus': ): ): ,((((((('Enhydra lutris': ,'lontra canadensis': ): ,'Mustela vison': ): ,(('martes americana': ,'Martes melampus': ): ,'gulo gulo': ): ): ,'Meles meles': ): ,'Taxidea taxus': ): ,'procyon lotor': ): ,(('Mephitis mephitis': ,'spilogale putorius': ): ,'ilurus fulgens': ): ): ): ,(((((('Ursus americanus': ,'ursus thibetanus': ): ,'Helarctos malayanus': ): , ('Ursus arctos': ,'ursus maritimus': ): ): ,'Melursus ursinus': ): ,'tremarctos ornatus': ): ,'iluropoda melanoleuca': ): ): ,(('lopex lagopus': ,'vulpes vulpes': ): ,('anis latrans': ,'anis lupus': ): ): ): ,((((('cinonyx jubatus': ,'puma concolor': ): ,'lynx canadensis': ): ,'Felis silvestris': ): ,((('panthera pardus': ,'Uncia uncia': ): ,'panthera tigris': ): ,'Neofelis nebulosa': ): ): ,'Herpestes auropunctatus': ): ): ,'manis 9 tetradactyla': );
10 id the Florida entist infect his patients with HIV? Phylogenetic tree of HIV sequences from the ENTIST, his Patients, & Local HIV-infected People: ENTIST Patient Patient Patient G Patient Patient E Patient ENTIST Local control 2 Local control 3 Patient F Local control 9 Local control 35 Local control 3 Patient Yes: The HIV sequences from these patients fall within the clade of HIV sequences found in the dentist. No No From Ou et al. (1992) and Page & Holmes (1998)
11 Evolutionary time ladogram Phylogram Ultrametric tree no meaning genetic change time (:5,(:1,(:1,:6):1):3) Newick format with distances
12 Neighbor joining: cladogram hoose the closest neighbors. dd a node between them. hoose the next closest, ad so on. E Species Species Species Species Species E E
13 UPGM E Species Species Species Species Species E J- corrected distances Unweighted pair group method using averages ssumes constant evolutionary clock. ll sequences have the same distance to the common ancestor. Ultrametric tree. raw p-distances 1) Generate neighbor-joining tree. (NJ) 2) For first neighbors, distance to ancestor is dij/2 3) For next neighbors, distance to ancestor is average pairwise distance between taxa in two clades, divided by two. 4) Subtract to get lineage distances. istance to common ancestor = verage distance between taxa in two clades, divided by two. fill in the blanks 2 = E 13
14 cladogram to phylogram: Fitch-Margoliash algorithm E sequence distances Species Species Species Species Species E lineages 0.59 E E Problem statement: Given a tree and a set of sequence distances, derive lineages such that the tree distances maximally match the sequence distances.
15 Fitch-Margoliash algorithm for calculating the branch lengths 1. Find the most closely-related pair of sequences, and 2. alculate the average distance from to all other sequences, then from to all other sequences. x x x 3. djust the position of the common ancestor node for and so that the difference between the averages is equal to the difference between the and branch lengths, while the sum of the branch lengths is still equal to. (=sequence distance, d=lineage distance) d -d = ( + )/2 - (+ )/2 d d NOTE: the difference between the averages may be greater than (,), making step 3 impossible.
16 Exercise 8: create a rooted phylogram with 4 taxa TTGGTGTGGTG TTGGTGGGTGG TGGTGTGTGG GTGGTTGTGTTG irections: K(,) = -3/4 ln [1-4/3 pdist(,)] pdist =1-identity 1.Make a distance matrix. (p-distance, then convert to J- distance) 2.Use Neighbor-joining to make a tree. 3.djust branch lengths using Fitch-Margoliash. 4.hoose the root using the Midpoint method. 5.Write tree in Newick format.
17 Orthologs/paralogs Orthologs: homologs originating from a speciation event Paralogs: homologs originating from a gene duplication event. 4 species 6 sequences duplication speciation gene loss speciation clam crab duck fish Species tree clam duck fish crab duck Sequence tree fish clam crab duck duck reconciled fish fish clam and fish are orthologs. clam and crab are paralogs.
18 How do I know it s a paralog? If it s a paralog, then at some point in evolutionary history, a species existed with two identical genes in it. One may have been lost since then. (escendants are still paralogs!) Paralogs can be from different species. Paralogous genes have more than the expected sequence divergence. ecause they are more likely to have different functions ecause they diverged earlier than the speciation event. Without species information or functional information, it s impossible to tell orthologs from paralogs.
19 Maximum parsimony -- it's character-building Optimality criterion: The most-parsimonious tree is the one that requires the fewest number of evolutionary events (e.g., nucleotide substitutions, amino acid replacements) to explain the sequences. E TGGTTTTTTGTG TGTGTTTTTT TTTGTGTGGT TTGGTGTGGTG TTGGTTTTGTTG T T For this column, and this tree, one mutation event is required. T
20 character-based tree-building For this other column, the same tree requires two mutation events. different tree would require only one. E TGGTTTTTTGTG TGTGTTTTTT TTTGTGTGGT TTGGTGTGGTG TTGGTTTTGTTG T T T
21 Finding the minimum number of mutations Given a tree and a set of taxa, one-letter each (1) choose optional characters for each ancestor. (2) Select the root character that minimizes the number of mutations by selecting each and propagating it through the tree. T// T/ T/ T/ T T T T minimum 2 mutations minimum 1 mutation
22 Parsimony tips: Ignore non-informative sites No mismatchs ---> noninformative! dds 0 mutations in all trees One mismatch --> noninformative! dds 1 mutation in all trees. ll different --> noninformative! dds same number of mutation in all trees. Only possible if number of sequences alphabet 22
23 Find noninformative columns TGGGTTTTTGTG TGGTGTGTTTTT TGTTGTGGT TGGTGTGGTG 23
24 Sum the Max Unweighted Parsimony for 4 5-taxa trees......,......, TGGTTTTTTGTG TGGTGTTTTTT TTTGTGTGGT TTGGTGTGGTG E TTGGTTTTGTTG TOTLS
25 Which method do I use? Sequence similarity strong weak very weak Method to use distance parsimony maximum likelihood
26 Review What do nodes and lineages in a tree represent? What are the two strategies for rooting a tree? Why root a tree? What is maximum parsimony? How do I find the most parsimonious tree? What is maximum likelihood? How does it relate to maximum parsimony? What kind of MS positions are not informative of the tree? What problem does Fitch-Margoliash solve? How? What is an ortholog? Paralog? How can I tell them apart? Unroot and re-root this tree: ((,),((,),E)) 26