Phylogenetic Inference: Building Trees

Similar documents

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Inferring Molecular Phylogeny

EVOLUTIONARY DISTANCES

Reading for Lecture 13 Release v10

Constructing Evolutionary/Phylogenetic Trees

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Dr. Amira A. AL-Hosary

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Taming the Beast Workshop

Phylogenetics. BIOL 7711 Computational Bioscience

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetic Tree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Constructing Evolutionary/Phylogenetic Trees

Evolutionary Tree Analysis. Overview

Molecular Evolution and Comparative Genomics

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

Algorithms in Bioinformatics

Reconstruire le passé biologique modèles, méthodes, performances, limites

Molecular Evolution and Phylogenetic Tree Reconstruction

Evolutionary Models. Evolutionary Models

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Modern Phylogenetics. An Introduction to Phylogenetics. Phylogenetics and Systematics. Phylogenetic Tree of Whales

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Phylogenetics: Building Phylogenetic Trees

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Evolutionary Analysis of Viral Genomes

Lecture Notes: Markov chains

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Letter to the Editor. Department of Biology, Arizona State University

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Lecture 4. Models of DNA and protein change. Likelihood methods

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Mutation models I: basic nucleotide sequence mutation models

The Phylogenetic Handbook

What is Phylogenetics

Phylogeny Tree Algorithms

Chapter 3: Phylogenetics

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Lecture 3: Markov chains.

Week 5: Distance methods, DNA and protein models

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Understanding relationship between homologous sequences

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Non-Parametric Bayesian Population Dynamics Inference

Modeling Noise in Genetic Sequences

BMI/CS 776 Lecture 4. Colin Dewey

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Computational Genomics

What Is Conservation?

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetics and Darwin. An Introduction to Phylogenetics. Tree of Life. Darwin s Trees

PROTEIN PHYLOGENETIC INFERENCE USING MAXIMUM LIKELIHOOD WITH A GENETIC ALGORITHM

A (short) introduction to phylogenetics

Sequence modelling. Marco Saerens (UCL) Slides references

Consistency Index (CI)

Phylogeny. Properties of Trees. Properties of Trees. Trees represent the order of branching only. Phylogeny: Taxon: a unit of classification

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

7. Tests for selection

The Causes and Consequences of Variation in. Evolutionary Processes Acting on DNA Sequences

Estimating Evolutionary Trees. Phylogenetic Methods

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Phylogeny. November 7, 2017

Phylogenetic trees 07/10/13

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogeny: building the tree of life

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Phylogenetic Inference using RevBayes

An Introduction to Bayesian Inference of Phylogeny

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Applications of hidden Markov models for comparative gene structure prediction

Chapter 16: Reconstructing and Using Phylogenies

Phylogenetic inference

Phylogeny Jan 5, 2016

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

7.36/7.91 recitation CB Lecture #4

Theory of Evolution Charles Darwin

Transcription:

Phylogenetic Inference: Building rees Philippe Lemey and Marc. Suchard Rega Institute Department of Microbiology and Immunology K.U. Leuven, Belgium, and Departments of Biomathematics and Human Genetics David Geffen School of Medicine at UL Department of Biostatistics UL School of Public Health SISMID p.1 Intra-Host Viral Evolution Retroviruses (and HBV) exist as a quasi-species within infected patients: Shared substitutions may be insufficient to resolve intra-host phylogenies Improve resolution using joint model: 1195 env sequences from 9 HIV+ patients [taken from Rambaut et al. (2004)] Indel rates substitution rates Opportunity to detect intra-host recombination SISMID p.2

Reconstructing the ree of Life: re Humans Just Big Slime Molds? Branches lade Support.50.74.75.99 >.99 Scale.10 hermotoga hloroflexus hermus Escheria Streptomyces Pyrococcus Euryarchaeota Nanoarchaeum rchaeoglobus Ferroplasma Methanosarcina Halobacterium Methanothermobacter Methanococcus eropyrum Sulfolobus Pyrobaculum Eubacteria Eocytes Giardia Eukaryotes Entamoeba Saccharum Euglena rypanasoma Dictyostelium Saccharomyces Homo B Uncertain D ertain hermotoga ILVV D GPMPQREHVLLRQVEVPYMIVFINKDM VD DPELIDL hloroflex ILVVS PD GPMPQREHILLRQVQVPIVVFLNKVDM MD DPELLEL Escheria ILVV D GPMPQREHILLGRQVGVPYIIVFLNKDM VD DEELLEL Pyrococcus VLVV D GVMPQKEHFLRLGIKHIIVINKMDM VNYNQKRFEE Halobacterium VLVV DD GVPQREHVFLSRLGIDELIVVNKMDV VDYDESKYNE Methanococus VLVVNV DDKSGIQPQREHVFLIRLGVRQLVVNKMD VNFSEDYNE eropyrum ILVVSRKGEFEGMSE G QREHLLLRMGIEQIIVVNKMDPDVNYDQKRYEF Sulfol obus ILVVSKKGEYEGMSE G QREHIILSKMGINQVIVINKMDLDPYDEKRFKE Giardia ILVVGQGEFEGISKD G QREHLNLGIKMIIVNKMDDGQVKYSKERYDE Euglena VLVIDSGGFEGISKD G QREHLLYLGVKQMIVNKFDDKVKYSQRYEE Dictyostelium VLVISPGEFEGIKN G QREHLLYLGVKQMIVINKMDEKSNYSQRYDE Homo VLIVGVGEFEGISKN G QREHLLYLGVKQLIVGVNKMDSEPPYSQKRYEE hermotoga hloroflex Escheria Pyrococcus Halobacterium Methanococus eropyrum Sulfol obus Giardia Euglena Dictyostelium Homo GIDKDEVERGQVL PGSIKPHKRFKQIYVLKKEEGGRHPFKGYKPQFYIRDV GIERDVERGQVL PGSIKPHKKFEQVYVLKKEEGGRHPFFSGYRPQFYIRDV GIKREEIERGQVL KPGIKPHKFESEVYILSKDEGGRHPFFKGYRPQFYFRDV GVSKNDIKRGDVGHN PPVVRKDFKQIIVLNH PIVGYSPVLHHQV GIGKDDIRRGDVGPDD PPSV DFQQVVVMQH PSVIGYPVFHHQV GVGKKDIKRGDVLGHN PPV DFQIVVLQH PSVLDGYPVFHHQI GVSKSDIKRGDVGHLDK PPV EEFERIFVIWH PSIVGYPVIHVHSV GVEKKDVKRGDVGSVQN PPV DEFQVIVIWH PVGVGYPVLHVHSI GLVKDLKKGYVVGDVNDPPVG KSFQVIVMNH PKKIQPGYPVIDHHI NVSVKDIRRGYVSNKNDPKE DFQVIILNH PGQIGNGYPVLDHHI NVSVKEIKRGMVGDSKNDPPQE EKFVQVIVLNH PGQIHGYSPVLDHHI NVSVKDVRRGNVGDSKNDPPME GFQVIILNH PGQISGYPVLDHHI MP ree Partial u Plot ontentious issue among paleobiologists: Do rchaea (Euryarchaeota/Eocytes) form one or two domains? Weekly World News calls humans slime molds. SISMID p.3 he hicken or the (Small) Genome: Which ame First? Reptiles Mammals Birds log Genome Size 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Evolutionary History and Genome Sizes of Reptiles, Dinosaurs, Birds and Mammals Issue: Bird genomes are markedly smaller than those from other vertebrates. Question: Did small genomes precede flight or co-evolve? SISMID p.4

Maximum Parsimony (MP) Most often used best, not even statistically consistent, but fast, fast, fast...if youknowthetree Key: Findtreewithminimal#of suspected substitutions (internal states are not observed, 0/1 model process) ounting minimum # of substitutions is easy Enumerating (searching through) all possible trees is hard Human GG himp GG Mouse Fly G G Site:1 2 3 4 5 6 7 8 9 10 long Molecular Sequence Sites are independent Pre Order raversal 5 6 7 1 2 3 4 {,G} X {,} G X Post Order raversal SISMID p.5 Maximum Parsimony (MP) littlehistory: nthony Edwards/Luca avalli-sforza (1963,1964) Both students of R. Fisher Introduced both parsimony and likelihood methods (for continuous quantities, e.g. gene frequencies) in one paper amin and Sokal (1965) provide first program for molecular sequences Fitch and Margoliash (1967) provide efficient algorithm Pre Order raversal 7 6 5 1 2 3 4 {,} {,G} X X G Post Order raversal SISMID p.6

Maximum Parsimony lgorithm procedure Fitch and Margoliash (1967) lgorithm cost 0 {Initialization} pointer k 2N 1 {at the root node} o obtain the set R k of possible states at node k {Recursion} if k is leaf then R k observed character for taxon k else ompute R i,r j for daughters i, j of k if R i R j 0then R k R i R j 7 else R k R i R 6 j +1 5 end if end if minimum cost is {ermination} Pre Order raversal 1 2 3 4 {,} {,G} X X G Post Order raversal SISMID p.7 Searching for the MP ree omplexity: Find MP score is NP-complete 5 Branches B 3 Branches 5 Branches B B B D B D 7 Branches Find MP tree is NP-hard B B D 5 Branches Recall that # of N-taxon rooted trees is 3 5 2N 3 ttack exponential-order space Branch-and-Bound: Monotonic order: min PS 2 min PS 3... Bound if min PS k > best n-taxon PS found so far. SISMID p.8

Neighbor-Joining (Saitou and Nei, 1987) omputational algorithm: alignment single tree Site (data) Pairwise distances = 0 = 1 = 1 1 2 3 2 3 4 10 12 15 30 20 25 Join (closest) neighbors 12* 1 2 Recompute distances (using 12*,3,4) dvantages: veryfast,greatfor1000sofsequences Disadvantages: nosite-to-siteratevariation,nonatural ways to compare trees/measure data support SISMID p.9 Neighbor-Joining aveat: Pairs i, j with min d ij are not necessarily nearest neighbors. E.g., d B =3<d =5 1 B 1 1 4 4 Solution: Subtractofftheaveragedistancestoallother leaves via 1 D ij = d ij (r i + r j ), r i = d ik, L 2 k L where L is the current set of leaves. Proof in Studier and Keppler (1988). omputational: O(N 3 ) D SISMID p.10

Likelihood-based Methods (Felsenstein, 1973) Statistical technique: assumes an unknown tree and a stochastic model for character change along the tree Site (data) state G ontinuous ime Markov hain time Unknown tree/internal node states dvantages: site-to-siterate/treevariationiseasy,can formulate probability statements Disadvantages: must search tree-space slow Foundation of Bayesian Phylogenetics SISMID p.11 raditional Phylogenetic Reconstruction Human himp Mouse Fly Reconstruction Example Site: 1 Human himp G G G G G G 2 3 4 5 6 7 8 9 10 long Molecular Sequence Mouse Fly Substitution: singleresidue replaces another Insertion/deletion: residues are inserted or deleted Statistical Model ssume: Homologous sites are iid and site patterns (e.g. dotted box) XY...Z Multinomial(p XY...Z ) where p XY...Z is determined by an unknown tree τ, branch lengths t and continuous-time Markov chain model (for residue substitution) given by infinitesimal rate matrix Q P(X Y in time t) =e tq SISMID p.12

M(Q) =ɛ Normal(µ, σ 2 ) of Phylogenetics ontinuous in elapsed time t, discrete in starting/ending state! Memory-less process in which the probability that state b replaces state a during (t, t + s) =sq ab + o(s) state G time Infinitesimal generator matrix Q has off-diagonal entries q ab and row sums =0 hink: ExponentialwaitingtimewithrateR a = b q ab until chain leaves a. hen the new state b is independently chosen with probabilities q ab /R a SISMID p.13 From Infinitesimal to Finite ime Let p ab (t) =the finite-time probability of the chain moving from state a at time 0 to state b at time t, thenmatrix P (t) ={p ab (t)} satisfies with solution d P (t) =P (t)q where P (0) = I dt P (t) =e tq = I + tq + 1 2 (tq)2 + = k=0 1 k! (tq)k as d dt etq = Qe tq = e tq Q for t real SISMID p.14

Example: wo-state Model onsider purines (R) pyrimidines (Y). Kolmogorov forward equation: R α Y p RY (t + s) =p RR (t)αs + p RY (t)(1 s)+o(s) yielding d dt p RY(t) =αp RR (t) p RY (t) Q = ( α α ) Solutions of P (t) = e tq have the form with eigenvalues 0 and (α + ) c + de (α+)t SISMID p.15 Standard Ms for Phylogenetics Jukes and antor (J69), π a = 1 4, κ 1 = κ 2 =1 Kimura (K80), π a = 1 4, κ 1 = κ 2 Hasegawa, Kishino and Yano (HKY85), κ 1 = κ 2 (most common) amura and Nei (N93), right κ 1 κ 2 General ime Reversible (GR) Note identifiability concern in e tq.ommonsolutionistofix 1d.f.suchthat q aa π a = 1 a Scaling: t =1 1 expected substitution per site π π G π G π SISMID p.16

Explicit Parameterization of N93 Nucleotides mutate according to a Markovian process Pr(X Y in time t) =e tq Nuc where Q Nuc is a 4x4 infinitesimal rate matrix and t is a branch length. π π π G κ 1 0 G κ 1 π G π π κ 1 π π π Q Nuc = B @ π π G κ 2 π κ 2 π π G κ 2 π π 1, where κ 1, κ 2 are transition:transversion rate ratios and π is the stationary distribution of {,G,,}. controls the overall rate and can vary from site-to-site. SISMID p.17 Site-to-Site Rate Variation Variation occurs quite naturally and is also an important inference short range: codon phase (slow-slow-fast) long range: enzymatic active sites, protein folding, immunological pressures/selection ssume: infinitesimalratesforsitek are r k t q ab.variouspriorson r k with E(r k )=1.ImplicitlyBayesian Yang (1994) discretized Gamma distribution SISMID p.18

General ime Reversible M Let Q = RD π where R is symmetric and D π is a diagonal matrix composed of the stationary distribution π. Detailed balance π a q ab = π b q ba.balance+ irreducibility reversible Note Q is similar to R, asd 1/2 QD 1/2 = R Hence, Q must have real eigenvalues and real eigenvectors he properties speed up computation of the finite-time transition matrix P (t) =e tq SISMID p.19 alculating the Probability of a Single Site Pattern Y i Given the tree and unobserved internal node states, the probability is the product of the finite time mutation probabilities over all branches: t1 t 2 t Y t 5 4 X t 3 G L(Y i ) p G = Pr(Y,t 1 )Pr(X G,t 2 ) X Y Pr(X,t 3 )Pr(Y,t 4 )Pr(X Y,t 5 )π X (1) Number of sumants grow rapidly in N sum-product/peeling algorithm to distribute sums across the product SISMID p.20

Pruning lgorithm Felsenstein (1981) Let P (L k a) =likelihoodofleavesbelownodek given k is in state a. hen,recursively compute P (L k a) given P (L i b) and P (L j c) for daughters i, j of k: Set pointer k 2N 1 {the root, initialization} Node k ompute P (L k a) a as follows: {recursion} if k is a leaf node then if a is observed then P(L P (L k a) =1 i b) else P (L k a) =0 P(L end if j c) else ompute P (L i a) and P (L j a) a for daughters i, j of k {post-order traversal} P (L k a) = P P b c Pr(a b, t i)p (L i b) Pr(a c, t j )P (L j c) end if L(Y i ) P a P (L 2n 1 a)π a {termination} a b a c SISMID p.21 ML ree or MP ree? Reporting uncertainty on tree estimates: he Bootstrap Most common ssumes evolutionary events are reproducible. If I went back out to the field and recollected exchangeable data... Bayesian inference Returns the probability of a tree given the observed data and model Requires MM (e.g., MrBayes or BES) dvantages Does not rely on asymptotics (hypothesis testing) Naturally incorporates uncertainty in all parameters (including discrete quantities: trees, site-classifications, etc.) rguably faster algorithms Disadvantages Must specify (justifiable) prior distributions SISMID p.22