Principles of Phylogeny Reconstruction How do we reconstruct the tree of life? Phylogeny: asic erminology Outline: erminology Phylogenetic tree: Methods Problems parsimony maximum likelihood bootstrapping homoplasy hybridization Wayne Maddison Sean raham 1 ips represent taxa (usually extant) Nodes represent hypothesized common ancestors Root is the oldest common ancestor on a rooted tree ranches represent time or amount of change between nodes or nodes and tips (but length is often arbitrary) 2 asic erminology Looking at rees Rooted trees typically have one or more outgroups. n outgroup represents a group that diverged before the diversification of the group of interest. Outgroups tell us about the direction of change within the ingroup (the ingroup is the group under study). Rooted trees have a root, and nodes closer to the root represent older divergences than nodes near the tips. he groups on either side of a node (sister taxa) are considered of equal age. hese two trees show the same relationships, but the unrooted tree makes no claims about which of the divergences is oldest. n unrooted tree could potentially be rooted by any if its nodes. an you draw a rooted tree using one of the roots within the red group? Note: there is more than one way to depict a set of relationships. e careful not to over interpret the orientation of the branches. 4
o these phylogenies show the same relationships? mphibians irds rocodiles Snakes Lizards urtles Mammals mphibians Mammals urtles Lizards Snakes rocodiles irds Looking at rees 5 he branch lengths on phylogenetic trees may or may not not be proportional to the amount of change along their length. 6 Interpreting roupings Interpreting ranch Lengths ladogram Figure 14.1 hese terms are used to compare named entities (e.g. fished, mammals, etc.) to grouping found in phylogenetic trees Monophyletic roup or lade E Paraphyletic roup Phylogram If branch lengths are proportional to change, the tips will not be neatly lined up, and a scale should be included. E E Polyphyletic group 1 nucleotide change 8
What is the relationship between taxonomic names and phylogenetic groups? What is the relationship between taxonomic names and phylogenetic groups? mniotes Reptiles irds rocodiles Snakes Lizards urtles Mammals mphibians irds rocodiles Snakes Lizards urtles mnion old looded 9 10 What is the relationship between taxonomic names and phylogenetic groups? urtles Wings Lizards Snakes rocodiles irds ats Rodents mphibians 11 lder n example of a polyphyletic group: mentiferae Walnut Willow ll of these trees have highly reduced male flowers clustered into structures called catkins. hese specialized structures were previously thought to reflect close relationships among the trees that have them. herefore, the families of trees with catkins were grouped into the mentiferae However, it turns out that catkins are adaptations to wind pollination, that reflect common selection, not common 12 history
Willows n example of a polyphyletic group: mentiferae Walnuts Oaks What is the relationship between taxonomic names and phylogenetic groups? re these groups monophyletic, paraphyletic or polyphyletic? fish? tetrapods? (= four limbed) Evolution of catkins amphibians? mammals? ncestor with separate flowers 1 Vertebrate Phylogeny ectotherms (= warm blooded)? 14 Reconstructing Evolutionary rees I. istance Methods (phenetics) he development of methods: I. distance methods (UPM, Neighbor joining) II. parsimony methods III. maximum likelihood (IV.) ayesian inference istance methods grew out of the school of numerical taxonomy, which had its heyday in the 1960s. axonomists were looking for more rigorous methods of developing classifications and inferring relationships. he idea was to use total information, measuring many characters and producing a summary of what the characters suggest about groupings based on overall similarity. hese approaches were also practical when molecular datasets started to get very large, and for a time outpaced computer processing power. 15 16
I. istance Methods (phenetics) I. istance Methods (phenetics) Example 1: morphology Example 1: morphology Overall istance matrix rait 2 istance matrix rait 2 1.0.0. 4.9.0.0 rait 1 1 1.0.0. 4.9.0.0 rait 1 18 I. istance Methods (phenetics) istance methods with sequence data Example 1: morphology istance matrix rait 2 istance matrix : : : : 1.0.0. 4.9.0.0 rait 1 19 1 5 20
istance methods with sequence data 1 5 istance methods with sequence data 1 5 6 New istance matrix: take averages 6 21 22 istance methods with sequence data 1 5 Strengths and weaknesses of distance methods dvantages - Intuitive, easy to understand - Works all all sorts of data, alone or in combination - Fast implementation on large data sets - an handle very large data sets easily 6 2 isadvantages - Must assume that similarity reflects shared evolutionary history (when is this most problematic?) 24
II. Parsimony Methods (ladistics) pplying parsimony Methods originally developed by Willi Hennig (erman entomologist), presented in a book published in 1966 ranslated into English in 196; very influential Originally important in analysis of small morphological data sets, including those from fossils hese methods came to the forefront with the application of N sequencing technology to systematics (early 1990s). In the early days, the methods were tough to implement because of limitations in computer processor speed (still somewhat limiting at times, because data sets keep getting larger). 25 onsider four taxa (1-4) and four characters (-) ncestral state: abcd axon 1 2 4 a a a a rait b b b b c c c c d d d d 26 pplying parsimony pplying parsimony onsider four taxa (1-4) and four characters (-) ncestral state: abcd Unique changes axon rait 1 a b c d 2 a b c d a b c d 4 a b c d onvergences or reversals 1 2 4 a bcd a b c d a b c d a b cd c d c b a abcd 5 steps 2 onsider four taxa (1-4) and four characters (-) ncestral state: abcd Unique changes axon rait 1 a b c d 2 a b c d a b c d 4 a b c d onvergences or reversals 1 4 2 a bcd a b cd a b c d a b c d d c b a abcd 4 steps 28
Strengths and weaknesses of parsimony Parsimony practice haracters Strengths - straightforward to calculate the length of the tree (number of steps) - Simulation studies have shown that parsimony algorithms are reliable under a range of conditions - onceptually simple; satisfying Weaknesses - annot easily accommodate complex models of evolutionary change (e.g. in which rates of evolutionary change differ among branches) - Under certain circumstances, can be positively misleading 29 axa 1 2 4 5 6 K L M N Which unrooted tree is most parsimonious? L N L M L 2 M 2 K K N N Plot each change on each tree. Positions 1 and 2 are done. Which positions help to determine relationships? 2 K M 0 Inferring the direction of evolution Where did the mutation occur, and what was the change? Mouse (outgroup) Orangutan orilla Human III. Maximum Likelihood Methods (and ayesian analysis as currently used) Maximum likelihood approaches involve using a specific model to determine the probability that a particular base substitution will occur along a particular branch on a tree. In effect the question being addressed is: what is the probability of the observed data given a particular tree and a particular model of substitution? onobo 1 himp he best tree is the one with the highest probability of explaining the observed data, given the model 2
ransversions Maximum likelihood: a simple model Probabilities: transition: 0.2 transversion: 0.1 no change 0. ransitions SK: Find the tree with the highest probability ransversions ransitions Maximum likelihood: a simple model Probabilities: transition: 0.2 transversion: 0.1 no change 0. SK: Find the tree with the highest probability P 1 = (.)(.1)(.2)(.)(.) 4 ransversions Maximum likelihood: a simple model Probabilities transition: 0.2 transversion: 0.1 no change 0. ransitions SK: Find the tree with the highest probability = P1 x P2 x P P1 = (.)(.1)(.2)(.)(.) P2 = (.)(.1)(.)(.)(.) P = (.1)(.2)(.)(.)(.2) 5 More complex likelihood models.. Likelihood models can be quite complesm and different models assign different probabilities to changes, including: Relative probabilities of transitions and transversions Variation in mutation rates across sites (e.g. by codon position in protein coding genes) or regions (intron versus exon versus spacers) Variation in mutation rates across lineages. 6
ssessment of Maximum Likelihood (also ayesian) Strengths Highly flexible (any model can be used) - Statistically justifiable - given enough data (and the right model), will always infer the correct tree (as shown by simulation studies). Weaknesses Impossible to know that the model is correct, and different models may yield different answers omputationally intensive (most data sets not fully analyzable) haracters to use in phylogeny Morphology N sequence 8 haracters to use in phylogeny What are the desirable qualities of characters used for phylogeny reconstruction? 1. 2. he problem of homology with N the good, the bad and the ugly lignment (= HOMOLOY assessment) can be very challenging! axon 1 axon 2. 4. axon 1 axon 2 How are these qualities met by N sequence data? 9 40
he problems of locus choice: etting the right rate of evolution oo slow? not enough variation axon 1 axon 2 axon Example of insufficient evidence: metazoan phylogeny Metazoans Fungi Polytomy 41 42 hallenges: sunflower phylogeny Recent radiation (200,000 years) Many species, much hybridization Need more rapidly evolving markers!! = 15 spp! = 12 spp! he problems of locus choice: etting the right rate of evolution oo fast? homoplasy likely saturation only 4 possible states for N axon 1 axon 2 axon Polytomy 4 44
Saturation: mammalian mitochondrial N Saturation Imagine changing one nucleotide every hour to a random nucleotide Split the ancestral population in 2. his line is what we would expect if we had an infinite number of bases, so that every mutation could be seen. One hour Red indicates multiple mutations at a site Four hours 8 hours 12 hours 24 hours? 45 Phylogeny case study I: whales re whales ungulates (hoofed mammals)? Figure 4.8 Forces of evolution and phylogeny reconstruction How does each force affect the ability to reconstruct phylogeny? mutation? drift? selection? non-random mating? migration? 4 46 48
Whales: N sequence data Hillis,.. 1999. How reliable is this tree? ootstrapping. 49 How consistent are the data? ake the dataset (5 taxa, 10 characters) Orang reate a new data set by sampling characters at random, with replacement axon Human himp onobo orilla axon Human himp onobo orilla Orang 1 2 8 2 4 6 5 10 6 8 10 5 9 8 10 8 50 Whales: N sequence data Molecular clocks Hillis,.. 1999. 51 52
asic idea of molecular clocks hallenges for phylogeny: gene flow chimps 6 substitutions humans whales 60 substitutions 56 mya hippos 5 54 Sunflower annuals ifferent genes may have different histories! 55 Wayne Maddison (U) has emphasized that genes and species are not expected to always have the same evolutionary history. s such, gene trees and species trees will not always match each other, as shown in this diagram from the computer package MESQUIE (Maddison and Maddison) developed to tackled some of these 56 complexities.
Phylogeny study questions 1) Explain in words the difference between monophyletic, paraphyletic, and polyphyletic groups. raw a hypothetical phylogeny representing each type. ive an actual example of a commonly recognized paraphyletic taxon in both animals and in plants (use your text for sources). 2) How can a phylogenetic tree be used to determine if a similar character in two taxa is due to homoplasy? ) Whales are classified as cetaceans, not artiodactyl ungulates. his makes artiodactyls paraphyletic why? What is the evidence that whales belong in the artiodactyls? 4) Phenetics (distance methods) and cladistics (parsimony) differ in the ways they recognize and use similarities among taxa to form phylogenetic groupings. What types of similarity does each school recognize, and how useful is each type of similarity considered to be for identifying groups? Phylogeny study questions 5) What is bootstrapping in the context of phylogenetic analysis, and why is this procedure performed? 6) Why are maximum likelihood methods increasing in popularity for reconstructing phylogenies? In your answer, include a short description of how this method identifies the best phylogeny. ) Integrative question: raw a pair of axes with ime since divergence on the x axis and percent of sites that are the same on the y axis. raw a line that shows the expected pattern for third codon sites in protein coding genes: is your graph linear? Explain why or why not. How and why would the graph of first codon positions differ from this? 8) You are studying a group of species that lives in two very different environments. You build two phylogenies: one is based on a locus that is probably under divergent selection in the two environments, while the other phylogeny is based on a neutral locus. Which phylogeny would be more likely to represent the species history? Why? 5 58 Phylogeny study questions 9) For a number of years, nolis lizards are found in similar microhabitats on many separate islands in the aribbean that are very similar to each other (for example, large lizards that feed on the ground, smaller lizards that feed on tree trunks, and very small lizards that feed at the tops of branches). wo different, historical explanations have been proposed to explain this pattern: each morph has evolved repeatedly on each island, or each morph has evolved just once, then dispersed. Sketch a phylogeny that would support each hypothesis. 10) Integrative question: the ameroon lake cichlid phylogeny, showing that the lake species were monophyletic, was based on mitochondrial N. Explain why this might not reflect the species history. How could you be more certain about the phylogeny? 59