Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

Similar documents
Molecular evolution. Joe Felsenstein. GENOME 453, Autumn Molecular evolution p.1/49

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Week 7: Bayesian inference, Testing trees, Bootstraps

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

History of Genetics in Evolution

Phylogenetic Tree Reconstruction

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)


Summary statistics, distributions of sums and means

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

AN ALTERNATING LEAST SQUARES APPROACH TO INFERRING PHYLOGENIES FROM PAIRWISE DISTANCES

EVOLUTIONARY DISTANCES

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetics: Building Phylogenetic Trees

X X (2) X Pr(X = x θ) (3)

Phylogenetic inference

BINF6201/8201. Molecular phylogenetic methods

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Constructing Evolutionary/Phylogenetic Trees

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Reconstruire le passé biologique modèles, méthodes, performances, limites

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

1. Can we use the CFN model for morphological traits?

Lecture 4. Models of DNA and protein change. Likelihood methods

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Week 5: Distance methods, DNA and protein models

Evolutionary Models. Evolutionary Models

Algorithms in Bioinformatics

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Dr. Amira A. AL-Hosary

What is Phylogenetics

Constructing Evolutionary/Phylogenetic Trees

Letter to the Editor. Department of Biology, Arizona State University

Phylogeny Tree Algorithms

Phylogenetic analyses. Kirsi Kostamo

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Molecular Evolution and Phylogenetic Tree Reconstruction

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models

A (short) introduction to phylogenetics

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

BMI/CS 776 Lecture 4. Colin Dewey

How to read and make phylogenetic trees Zuzana Starostová

Using algebraic geometry for phylogenetic reconstruction

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Consistency Index (CI)

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Phylogenetics. BIOL 7711 Computational Bioscience

1 ATGGGTCTC 2 ATGAGTCTC

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Lab 9: Maximum Likelihood and Modeltest

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Concepts and Methods in Molecular Divergence Time Estimation

Lecture 11 Friday, October 21, 2011

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

I. Short Answer Questions DO ALL QUESTIONS

Theory of Evolution Charles Darwin

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

TheDisk-Covering MethodforTree Reconstruction

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogeny: building the tree of life

Biology Keystone (PA Core) Quiz Theory of Evolution - (BIO.B ) Theory Of Evolution, (BIO.B ) Scientific Terms

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Algebraic Statistics Tutorial I

How Molecules Evolve. Advantages of Molecular Data for Tree Building. Advantages of Molecular Data for Tree Building

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Molecular Evolution and Comparative Genomics

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

Likelihoods and Phylogenies

arxiv: v1 [q-bio.pe] 4 Sep 2013

Module 13: Molecular Phylogenetics

Phylogenetic inference: from sequences to trees

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

Module 13: Molecular Phylogenetics

Phylogeny and Molecular Evolution. Introduction

= 1 = 4 3. Odds ratio justification for maximum likelihood. Likelihoods, Bootstraps and Testing Trees. Prob (H 2 D) Prob (H 1 D) Prob (D H 2 )

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

Chapter 19: Taxonomy, Systematics, and Phylogeny

Consensus Methods. * You are only responsible for the first two

Transcription:

Inference of phylogenies, with some thoughts on statistics and geometry Joe Felsenstein University of Washington Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

Darwin s tree This is the only illustration in The Origin of Species Inference of phylogenies, with some thoughts on statistics and geometry p.2/31

One of Ernst Haeckel s trees (1866) (Bark! on some of his trees, even leaves!) He coined the word phylogeny". Inference of phylogenies, with some thoughts on statistics and geometry p.3/31

George Gaylord Simpson, 1940 s Structure starts to get imprecise. Far from algorithms. Inference of phylogenies, with some thoughts on statistics and geometry p.4/31

lgorithmic methods for inferring phylogenies Starting in the 1960 s, biologists invented a number of methods to infer phylogenies: Parsimony methods. Find that tree that minimizes the number of changes of state needed to evolve the observed data. (Related: compatibility methods) Distance matrix methods. Measure inferred distances between all pairs of species. Find the tree, with branch lengths, that predicts this table best. Maximum likelihood methods. Use a probability model of evolution (also used in distance matrix methods) and find that tree that makes the observed data most probable. (Related: Bayesian inference, which assumes also prior probabilities for the possible trees). Inference of phylogenies, with some thoughts on statistics and geometry p.5/31

Parsimony methods Two ways of looking at the parsimony method: Species are points in a sequence space. Find the Steiner tree for these points in this space. Trees live in a tree space; for each of them we can evaluate how many changes of state are needed to evolve the data set. Inference of phylogenies, with some thoughts on statistics and geometry p.6/31

Parsimony on a simple data set site species 1 2 3 4 5 6 lpha T G G C Beta C T C T C Gamma G G T C Delta G G G T Epsilon C T C G C Here is the most parsimonious tree (drawn as if rooted, though root location doesn t matter for the parsimony score in this case) Gamma 5 4 Delta 6 lpha Beta 5 4 Epsilon 2 1 3 Inference of phylogenies, with some thoughts on statistics and geometry p.7/31

Distance matrix methods 0.08 0.06 E 0.05 C B 0.10 0.05 0.03 0.07 D B C D E B C D E 0 0.23 0.16 0.20 0.17 0.23 0 0.23 0.17 0.24 0.16 0.23 0 0.15 0.11 0.20 0.17 0.15 0 0.21 0.17 0.24 0.11 0.21 0 This table of expected distances is computed from the tree and compared to pairwise distances (branch lengths of a 2-species unrooted tree) computed from the data. Inference of phylogenies, with some thoughts on statistics and geometry p.8/31

Likelihood methods C C G t 1 t 2 y C t 3 t 4 t 5 w t t 6 z 7 x t 8 Likelihood is the probability of the data given the tree, where the tree has branch lengths, and a model of evolution (occurring independently in each site and independently in different lineages) Inference of phylogenies, with some thoughts on statistics and geometry p.9/31

Likelihood computed separately in each site: The likelihood is the product over sites (as they are independent given the tree): L = Prob (D T) = m Prob ( ) D (i) T i=1 For site i, summing over all possible states at interior nodes of the tree: Prob ( ) D (i) T = x y z w Prob (, C, C, C, G, x, y, z, w T) Inference of phylogenies, with some thoughts on statistics and geometry p.10/31

Using independence assumptions to simplify By independence of events in different branches (conditional only on their starting states): Prob (, C, C, C, G, x, y, z, w T) = Prob (x) Prob (y x, t 6 ) Prob ( y, t 1 ) Prob (C y, t 2 ) Prob (z x, t 8 ) Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) There are, of course, numerical optimization issues in finding the best branch lengths for each topology examined. Inference of phylogenies, with some thoughts on statistics and geometry p.11/31

n example: Turbeville et al., 28s RN, chordates Xenopus?TCCTGGTTGTCCTGCCGTG-CTTG... Sebastol??????????????????????G-CTTG... Latimeri?TCCTGGTTGTCCTGCCGTG-CTTG... Squalus??????????????????????G-CTTG... Myxine Petromyz??CCCTGGTTGTCCTGCCGCCG-CTTG...???CCTGGTTGTCCTGCCGTG-CTTG... Branch Styela???CCTGGTTGTCCTGCCGTGTCTTG...??TCTGGTTGTCCTGCCGTGTGTTG... Herdman Saccogl?TTCTGGTTGTCCTGCCGTGTGTTG...??CCTGGTTGTCCTGCCGTGTCTTG... Ophiophol??CCTGGTTGTCCTGCCGTGTCTTG... Strongyl??CCTGGTTGTCCTGCCGTGTCTTG... Placopec CCCTGGTTGTCCTGCCGTGTCTTG... Limicol Eurypelm?TTCTGGTTGTCCTGCCGTGTCTTG...?TCCTGGTTGTCCTGCCGTGTCTTG... Tenebrio?TCCCTGGTTGTCCTGCCGTGTCTTG... (and so on for 11 more pages) Inference of phylogenies, with some thoughts on statistics and geometry p.12/31

The tree with parsimony (acorn worms) Saccogl Ophiophol (echinoderms) Strongyl Herdman (sea squirts) Styela Branchiostoma (amphioxus) Placopec (molluscs) Limicol Petromyz(lamprey) (insects) Eurypelm Tenebrio Latimeri (coelacanth) Squalus(dogfish) Sebastol(fish) Xenopus (frog) Myxine (hagfish) Inference of phylogenies, with some thoughts on statistics and geometry p.13/31

The tree with the Fitch-Margoliash distance matrix method Ophiophol (echinoderms) Strongyl (acorn worms) Saccogl Herdman (sea squirts) Styela Branch (amphioxus) Myxine (hagfish) Placopec (molluscs) Limicol Eurypelm (insects) Tenebrio Petromyz Xenopus (lamprey) Squalus Sebastol Latimeri (frog) (dogfish) (fish) (coelacanth) Inference of phylogenies, with some thoughts on statistics and geometry p.14/31

The tree with maximum likelihood Herdmanni (sea squirts) (acorn worm) Styela Saccoglossus Petromyzon (lamprey) Ophiopholes (echinoderms) Strongylocentrotus Placopecten Squalus (dogfish) Sebastoles (rockfish) Xenopus (frog) Latimeria (coelacanth) Myxine (hagfish) (molluscs) Branchiotoma Limicol Eurypelm Tenebrio (insects) (amphioxus) Inference of phylogenies, with some thoughts on statistics and geometry p.15/31

The bootstrap shows the uncertainty in our trees Original Data sites sequences Bootstrap sample #1 sequences sites sample same number of sites, with replacement Estimate of the tree Bootstrap sample #2 sites Bootstrap estimate of the tree, #1 sequences sample same number of sites, with replacement (and so on) Bootstrap estimate of the tree, #2 Inference of phylogenies, with some thoughts on statistics and geometry p.16/31

Phylogenies can be seen as graphical models We have observations at a number of characters (wuch as sites in DN molecules). Each site has a graphical model and is independently evolving on the same tree (by assumption). The graphical model looks like this: x x x x x 6 6 6 6 6 x x x x x x x x x x 1 1 1 1 1 4 4 4 4 4 v 1 x x x x x 5 5 5 5 5 v 5 v 6 v 4 x x x x x 3 3 3 3 3 x x x x x 8 8 8 8 8 x x x x x 7 7 7 7 7 v 3 v 8 v 7 x x x x x 9 9 9 9 9 v 9 x x x x x 0 0 0 0 0 Imagine a version of this 500 or so sites deep. Problem: given observations on the x s at the tips, to reconstruct the tree topology and the v s. Inference of phylogenies, with some thoughts on statistics and geometry p.17/31

Invariants (a method with a future?) With DN data and 4 species there are 256 data patterns possible: through TTTT. With the simplest symmetric model (the Jukes-Cantor model) of nucleotide change, the tree has 5 parameters (the branch lengths These 5 parameters define a 5-dimensional space that defines where in the space of expected pattern frequency the trees are. This leaves 250 degrees of freedom to account for: there are 250 equations the expected pattern frequencies must satisfy. Many are simple symmetry requirements such as f = f CCCC a few are phylogenetic invariants that depend on the tree topology. Inference of phylogenies, with some thoughts on statistics and geometry p.18/31

Mathematical approaches to trees that have been useful The analogy between trees and recursion. This was discovered early by many mathematically-oriented biologists. Invariants (J. Cavender and J. Lake). Holds promise for the future but so far not of great practical use. Lots of geometry here? The Hadamard conjugation (M. Hendy / D. Penny). Remarkable results, but so far confined to certain symmetric models of base change. But on the whole, this is a short list. We are waiting for a mathematical result that makes a real impact, rather than restating the obvious or just rigorizing something we already use. Inference of phylogenies, with some thoughts on statistics and geometry p.19/31

Tree space as explained by me a few years ago The space of trees with branch lengths an example: three species with a clock B C trifurcation t 1 t 2 t 1 not possible OK etc. t 2 when we consider all three possible topologies, the space looks like: t 1 2 t1 Inference of phylogenies, with some thoughts on statistics and geometry p.20/31

This graph, arranged by my stepson Ben Schoenberg (www.seriousjuggling.com) appeared on the 1999 T-shirt of the Molecular Evolution Workshop, Marine Biological Laboratory, Woods Inference of phylogenies, with some thoughts on statistics and geometry p.21/31 Hole, Massachusetts. Tree space, for 5 species C D B E D B E C B C E D E B C D B C D E B D C E B C D E B C E D B D C E D B C E C B D E C B D E E B D C E C B D D B E C

dual Ben also noticed that if we take the cliques (3 trees each) of the NNI graph, and make a dual in which the points are cliques, and the lines connect them when they share a tree, we get (ta-da!) The infamous Petersen Graph. Inference of phylogenies, with some thoughts on statistics and geometry p.22/31

shamrock", part of the NNI graph for 6 species B B C F C F D E D E C D B E F B E F C D E F C D B D C F E B E F Each tree of shape ((x,x),(x,x),(x,x)) is surrounded by 6 trees of shape (((x,x),x),(x,(x,x))). There are 15 of these shamrocks, 105 trees in all. C D B Inference of phylogenies, with some thoughts on statistics and geometry p.23/31

More about the 6-species NNI graph Each shamrock connects from each leaf to two others. The 6 shamrocks that each shamrock connects to are the ones that share at least one of its two-species sets. Inference of phylogenies, with some thoughts on statistics and geometry p.24/31

test case: 3 species with clock, Jukes-Cantor model The data set (variant nucleotides lower-cased): 3 64 lpha gtcacgtcgtcgtcgtcgtcgtcgt Beta CGTgtcaCGTgtcagtcaCGTCGTCGT Gamma CGTCGTgtcaCGTCGTgtcagtcagtca B C CGTCGTCGTCGTCGTCGTCGTCGT CGTCGTCGTCGTCGTCGTCGTCGT CGTCGTCGTCGTCGTCGTCGTCGT y x y x+y Inference of phylogenies, with some thoughts on statistics and geometry p.25/31

ln likelihood This leads to these profile likelihoods 192 194 C B x B C 196 x 198 200 0.2 0.1 x B C x 0.0 0.1 0.2 x (Profile likelihoods are max y ln L(x, y) ) Inference of phylogenies, with some thoughts on statistics and geometry p.26/31

The branch score (Kuhner and Felsenstein, 1994) F D 0.1 0.3 0.2 B D B 0.05 0.1 0.2 0.2 0.2 0.2 0.4 0.2 0.15 C F0.1 0.3 0.15 0.1 0.1 0.3 0.1 E 0.2 E 0.2 G G C partitions {D BCEFG} {DF BCEG} {BC DEFG} {DF BCEG} {EG BCDF} { BCDEFG} {B CDEFG} {C BDEFG} {D BCEFG} {E BCDFG} {F BCDEG} {G BCDEF} branch lengths none 0.2 0.4 0.3 0.2 0.2 0.3 none 0.1 none 0.05 0.2 0.2 0.2 0.15 0.15 0.1 0.1 0.1 0.1 0.2 0.1 0.2 0.2 Inference of phylogenies, with some thoughts on statistics and geometry p.27/31

The perfect tree distance Can we make a tree distance with perfect statistical meaning? What would that look like if we did? The (squared?) distance would reflect the difference of log likelihood between the two trees: D 2 = patterns i ( ) n i n ln pi q i where the p i and the q i are the expected frequencies of the possible data patterns in the two trees, and the n i are the numbers of each pattern seen. (This formula is related to the Kullback-Leibler separator but is slightly different because it has the data in it, and is not just a function of the two trees.) I don t see any way to make a distance that magically reflects the data, without using the data. Inference of phylogenies, with some thoughts on statistics and geometry p.28/31

n example of the problem: primates data Lemur Bovine Human Chimp Gorilla Orang Gibbon BarbMacaq Crab E.Mac Rhesus Mac Jpn Macaq Tarsier Squir Monk Mouse Tree from 232 nucleotides of mitochondrial DN data Inference of phylogenies, with some thoughts on statistics and geometry p.29/31

Bootstrap sampling finds these partition frequencies: Bovine Mouse Squir Monk 35 42 72 74 84 80 Chimp Human Gorilla Orang 77 49 100 99 99 Gibbon Rhesus Mac Jpn Macaq Crab E.Mac BarbMacaq Tarsier Lemur (drawing the tree as if rooted) Inference of phylogenies, with some thoughts on statistics and geometry p.30/31

How to get distance-ish methods to use data? modest proposal Use the cloud of bootstrap trees, jackknife trees, or posterior distribution of trees in a Bayesian analysis. The variation in these accurately reflects the uncertainty in the data. Can we somehow use this variation to evaluate which directions in tree space our data allows us to go? Inference of phylogenies, with some thoughts on statistics and geometry p.31/31

References urbeville. J. McC., Schulz, J.R. and R.. Raff. 1994. Deuterostome phylogeny and the sister group of the chordates: evidence from molecules and morphology. Molecular Biology and Evolution Inference of phylogenies, with some thoughts on statistics and geometry p.32/31

How it was done This projection produced as a PDF, not a PowerPoint file, and viewed using the Full Screen mode (in the View menu of dobe crobat Reader): using the prosper style in LaTeX, using Latex to make a.dvi file, using dvips to turn this into a Postscript file, using ps2pdf to mill it into a PDF file, and displaying the slides in dobe crobat Reader. Result: nice slides using freeware. Inference of phylogenies, with some thoughts on statistics and geometry p.33/31