Phylogenetic Assumptions

Similar documents
Estimation of Phylogeny Using a General Markov Model

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 3: Markov chains.

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

EVOLUTIONARY DISTANCES

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lecture Notes: Markov chains

Reconstruire le passé biologique modèles, méthodes, performances, limites

Lab 9: Maximum Likelihood and Modeltest

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Modeling Noise in Genetic Sequences

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Evolutionary Analysis of Viral Genomes

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

AUTHOR COPY ONLY. Hetero: a program to simulate the evolution of DNA on a four-taxon tree

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetics. BIOL 7711 Computational Bioscience

Inferring Molecular Phylogeny

Phylogenetic Tree Reconstruction

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Molecular Evolution and Phylogenetic Tree Reconstruction

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

Evolutionary Models. Evolutionary Models

Dr. Amira A. AL-Hosary

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 11 Friday, October 21, 2011

The Phylo- HMM approach to problems in comparative genomics, with examples.

O 3 O 4 O 5. q 3. q 4. Transition

Consensus Methods. * You are only responsible for the first two

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Markov Models & DNA Sequence Evolution

Using algebraic geometry for phylogenetic reconstruction

Stochastic processes and Markov chains (part II)

Phylogenetic invariants versus classical phylogenetics

Comparative Bioinformatics Midterm II Fall 2004

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Biology 644: Bioinformatics

Mutation models I: basic nucleotide sequence mutation models

Phylogenetic inference

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

In: P. Lemey, M. Salemi and A.-M. Vandamme (eds.). To appear in: The. Chapter 4. Nucleotide Substitution Models

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Phylogenetic Inference using RevBayes

Phylogeny of Mixture Models

Stochastic processes and

Understanding relationship between homologous sequences

Quantifying sequence similarity

C.DARWIN ( )

Maximum Likelihood in Phylogenetics

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Maximum Likelihood in Phylogenetics

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

In: M. Salemi and A.-M. Vandamme (eds.). To appear. The. Phylogenetic Handbook. Cambridge University Press, UK.

Constructing Evolutionary/Phylogenetic Trees

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

What Is Conservation?

Chapter 7: Models of discrete character evolution

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Predicting the Evolution of two Genes in the Yeast Saccharomyces Cerevisiae

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

A phylogenetic view on RNA structure evolution

Phylogeny: building the tree of life

Cladistics and Bioinformatics Questions 2013

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

BINF6201/8201. Molecular phylogenetic methods

Computational Biology and Chemistry

Phylogenetic Inference using RevBayes

Genetic distances and nucleotide substitution models

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Stochastic processes and

Intraspecific gene genealogies: trees grafting into networks

Lecture 4. Models of DNA and protein change. Likelihood methods

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

Algorithms in Bioinformatics

Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

1. Can we use the CFN model for morphological traits?

Tree Building Activity

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Predicting RNA Secondary Structure

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Transcription:

Substitution Models and the Phylogenetic Assumptions Vivek Jayaswal Lars S. Jermiin COMMONWEALTH OF AUSTRALIA Copyright htregulation WARNING This material has been reproduced and communicated to you by or on be half of the University of Sydney pursuant to Part VB of the Copyright Act 1968 (the Act). The material in this communication may be subject to copyright under the Act. Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. Do not remove this notice.

Why do we need substitution models? 12 34 5 N Human A T C G A C Chimp A G C A A C Gorilla... Orangutan... I1 Root I2 Gorilla Orangutan Chimp Human 4 j= 1 4 4 P( u j) P( v j) P( O1 u) P( O2 u) P( O3 v) P( O4 Li = fj u= 1 v= 1 v) I1 P(u j) Root j I2 P(v j) Gorilla Orangutan Chimp Human P(O 1 u) P(O 2 u) P(O 3 3 v) P(O 4 4 u)

Topics how correction for multiple substitution are done some of the phylogenetic assumptions how we may evaluate phylogenetic assumptions an example involving bacterial DNA

Substitutions at a Single Site Single Substitution Multiple Substitution Coincidental Substitution G A G A A T G T A G A A G A G A A T T Parallel Substitution Convergent Substitution Back Substitution G A G A G G G A G T G A T A A A G A G A A G A Note Every substitution overwrites the evidence of a past state, which leads to an erosion of the historical signal

Modeling Nucleotide Substitutions Consider the evolution at a given site in terms of conditional rates-of-change from nucleotide i to nucleotide j A G T C α Aj α AC α AG α AT j A α CA α Cj α CG α CT j C R = α GA α GC α Gj α GT j G α TA α TC α TG α Tj j T Note Here α ij is the conditional rate-of-change from nucleotide i to nucleotide j in R the rate matrix and R is the most general Markov model for DNA

Modelling a Site in one Sequence Consider Markov process, X, that results in nucleotide i being converted to nucleotide j over time t P ij [ ] t j ( t)= PXt ()= j X( 0)= i Time If the rates of change are constant, then P(t) = e Rt t = 0 Ancestor X i In matrix notation, this is P ( t )= I + Rt + ( Rt)2 + ( Rt)3 +L 2! 3! ( Rt) k = k =0 k!

Modelling a Site in two Sequences Consider two Markov processes, X and Y and the following scenario t F ij [ ( )] Time Y X ()= t PXt ()= i, Y( t)= j X( 0)= Y 0 t = 0 Ancestor In matrix notation, this is ()= P X () t F t ( ) T F 0 ()P Y () t

Take-home Message #1 The substitution model, R,, is an integral part of the transition function; it is used to estimate the probability of the present states, given R and t

Rate matrix revisited r Aj r AC r AG r AT j A r CA r Cj r CG r CT j C R = i r GA r GC r Gj r GT j G r TA r TC r TG r Tj j T Typically simplified forms of this rate matrix are used These matrices belong to the GTR-family of models and can be represented as r R = r r AC AG AT r r r AC CG CT r r AG CG r GT r r r AT CT GT π A π C π G πt

Commonly-used Markov Models Jukes & Cantor (1969) Assumptions 3α α α α α 3α α α R = 1. One rate (α) α α 3α α 2. Uniform nucleotide content α α α 3α Kimura (1980) (2β + α) β α β β (2β + α) β α R = α β (2β + α) β β α β (2β ( β + α) ) 1. Two rates (α and β) 2. Uniform nucleotide content Hasegawa, Kishino & Yano (1985) (βπ Y + απ G ) βπ C απ G βπ T βπ A (βπ R + απ T ) βπ G απ T R = απ A βπ C (βπ Y + απ A ) βπ T βπ A απ C βπ G (βπ R + απ C ) 1. Two rates (α and β) 2. Non-uniform nucleotide content

The Phylogenetic Assumptions Given an alignment of nucleotides, phylogenetic methods commonly assume that the sites have evolved under stationary and reversible conditions homogeneous conditions independent and identical conditions

Phylogenetic Assumptions Consider the following scenario 1 2 F 1 = [ f A f C f G f T ] F 2 = [ f A f C f G f T ] α Aj α AC α AG α AT j A α CA α Cj α CG α CT j C R 1 = α α α α GA GC Gj GT j G α TA α TC α TG α Tj j T 0 α Aj α AC α AG α AT j A α CA α Cj α CG α CT j C R 2 = α GA α GC α Gj α GT j G α TA α TC α TG α Tj j T [ ] [ ] Π 1 = [ π A π C π G π T ] F 0 = f A f C f G f T Π 2 = π A π C π G π T

The Stationary Condition 1 2 Π 1 = π A π C π G π T 0 [ ] Π 2 = [ π A π C π G π T ] F 0 = [ f A f C f G f T ] Note The stationary condition is met if F 0 = Π 1 = Π 2, implying that the marginal distributions of R 1 and R 2 are the same, even though R 1 and R 2 may differ!

The Reversible Condition 1 2 0 R 1 Π 1 R 2 Π 2 Notes If the process is stationary, then Π 1 = Π 2 Moreover, if π i R ij = π j R ji for all i and j, then the process is reversible

The Homogeneous Condition 1 2 0 R 1 R 2 Notes The homogeneous condition is met for the Markov processes, R 1 and R 2, if R 1 = R 2 If the homogeneous condition is met, then Π 1 = Π 2 however, non-stationary, and therefore non-reversible, conditions may still prevail (i.e., if F 0 Π 1 = Π 2 )

IID condition t 2 1 time t = 0 Y Ancestor X Seq_1 ACGTGTCCATGATTA... R x R x Seq_2 ACCTGCCCAAGATAA... Notes For computational reasons, it is convenient to assume that Sites in the sequence have evolved independently Sites in the sequence have evolved under the same model (R x = R y )

Rate Heterogeneity Across Sites RNA-coding Genes

Take-home Message #2 Phylogenetic analyses require the users to make certain assumptions about the data before these are investigated in detail

Questions Are these phylogenetic assumptions realistic? How can we assess whether the phylogenetic assumptions are met by the data?

Answer Inspecting the data before or after inferring the phylogeny increases the chance of finding out what might have taken place during the evolution

Considering the IID Condition Visual inspection of the alignment might show whether some regions evolved faster Assume rate-heterogeneity across some sites, and then use phylogenetic methods that account for this

Hierarchical Likelihood-Ratio Test Consider the following decision tree Yes No Uniform nucleotide content? No Yes Yes No One conditional rate? No Yes No Yes Two or six conditional rates? Source: Posada & Crandall (1998) Bioinformatics 14, 817-818.

Take-home Message #3 The likelihood-ratio test can be used identify a suitable substitution model for a given data set

Testing assumptions prior to modelling n F(t) expected divergence matrix N(t) observed divergence matrix Examples: N(0) = 294 0 0 0 244 31 8 11 0 372 0 0 28 321 14 9 N(t) = 0 0 829 0 11 13 801 4 0 0 0 655 14 10 3 628

Matched-pairs Tests of Symmetry Seq 1 AGACTAGGTCTTGTATAGACTAATGTTCACAGTTTTTTAACTTTGTCAATGGA... Seq 2 AGACGAGGTCGTGTATGGCCTCGTGAGCACGGGTTGTTCACTCCGCCAACGGT... A C G T Σ 2 A 5 4 7 1 17 C 0 7 2 0 9 G 0 1 5 0 6 T 1 4 5 8 18 Test of Symmetry 2 X Bowker = i< j ( x 2 ij x ) ji x ij + x ji Σ Test of Marginal Symmetry 1 6 16 19 9 2 = DV 1 D T, Note: These tests statistics are asymptotically X Stuart D ij = x i x i, χ 2-distributed on ν degrees of freedom V = covariance matrix of D Sources: Bowker (1948). JASA 43, 572-574; Stuart (1955). Biometrica 42, 412-416.

Bacterial 16S rdna Sequences Ribosomal RNA from five bacteria was compared using the matched-pairs test of homogeneity Probabilities Thermotoga Bacillus Deinococcus Thermus Aquifex 13210 1.32 10 01 17910 1.79 10 11 34510 3.45-10 50910 5.09-01 Thermotoga 2.64 10 12 6.69 10 12 4.15 10 01 Bacillus 9.95 10 01 3.64 10 09 Deinococcus 59910 5.99 11 Note It is highly unlike that these data have evolved under homogeneous conditions, implying that it would be unwise to use a time-reversible Markov model Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.

Phylogeny of Bacterial Ribosomal RNA Markov model: GTR Markov model: General Source: Ababneh et al. (2006). Bioinformatics 22, 1225-1231.

Take-home Message #4 It is important to consider the substitution models carefully when using them in phylogenetic studies

Suggested Literature RDM Page, EC Holmes (1998), Molecular Evolution. Chapter 5 (Sections 5.2, 5.3) important reading W-H Li (1997), Molecular Evolution. Chapter 3 (pp. 59-78) contains descriptions that are better than those in Page & Holmes 1998 important reading D Posada, KA Crandall, 1998. MODELTEST: testing the model of DNA substitution. Bioinformaticsi 14, 817-818818 useful reading LS Jermiin et al. (2008). Phylogenetic model Evaluation. Pp 331-363. In Bioinformatics - Volume I: Data, Sequences Analysis and Evolution (Ed. Keith J), Humana Press, Totowa, NJ. [2008] important reading