Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Similar documents
Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 4. Models of DNA and protein change. Likelihood methods

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Probabilistic modeling and molecular phylogeny

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Inferring Molecular Phylogeny

Phylogenetics: Building Phylogenetic Trees

PHYLOGENY ESTIMATION AND HYPOTHESIS TESTING USING MAXIMUM LIKELIHOOD

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Lab 9: Maximum Likelihood and Modeltest

Week 5: Distance methods, DNA and protein models

BMI/CS 776 Lecture 4. Colin Dewey

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Likelihood

Lecture 4. Models of DNA and protein change. Likelihood methods

Consensus Methods. * You are only responsible for the first two

Molecular Evolution, course # Final Exam, May 3, 2006

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Phylogenetic Assumptions

Consensus methods. Strict consensus methods

Constructing Evolutionary/Phylogenetic Trees

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Constructing Evolutionary/Phylogenetic Trees

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2016 University of California, Berkeley. Parsimony & Likelihood [draft]

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

C.DARWIN ( )

Special features of phangorn (Version 2.3.1)

Lecture Notes: Markov chains

Dr. Amira A. AL-Hosary

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Natural selection on the molecular level

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Evolutionary Genomics: Phylogenetics

Phylogenetic Inference and Hypothesis Testing. Catherine Lai (92720) BSc(Hons) Department of Mathematics and Statistics University of Melbourne

Evolutionary Analysis of Viral Genomes

Concepts and Methods in Molecular Divergence Time Estimation

Taming the Beast Workshop

Phylogenetic Inference using RevBayes

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Reconstruire le passé biologique modèles, méthodes, performances, limites

Introduction to MEGA

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetics. BIOL 7711 Computational Bioscience

Chapter 7: Models of discrete character evolution

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

EVOLUTIONARY DISTANCES

MOLECULAR SYSTEMATICS: A SYNTHESIS OF THE COMMON METHODS AND THE STATE OF KNOWLEDGE

1. Can we use the CFN model for morphological traits?

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Discrete & continuous characters: The threshold model

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

Maximum Likelihood in Phylogenetics

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

An Introduction to Bayesian Inference of Phylogeny

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Phylogenetics

The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid


Phylogenetic Tree Reconstruction

Phylogenetic Inference using RevBayes

Mutation models I: basic nucleotide sequence mutation models

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Letter to the Editor. Department of Biology, Arizona State University

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Molecular Evolution and Comparative Genomics

What Is Conservation?

The Phylo- HMM approach to problems in comparative genomics, with examples.

Molecular Evolution and Phylogenetic Tree Reconstruction

Maximum Likelihood in Phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200 Spring 2018 University of California, Berkeley

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Using models of nucleotide evolution to build phylogenetic trees

Phylogenetics and Darwin. An Introduction to Phylogenetics. Tree of Life. Darwin s Trees

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used

Lecture 6 Phylogenetic Inference

7. Tests for selection

KaKs Calculator: Calculating Ka and Ks Through Model Selection and Model Averaging

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Bayesian Models for Phylogenetic Trees

A Bayesian Approach to Phylogenetics

Homoplasy. Selection of models of molecular evolution. Evolutionary correction. Saturation

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics

Inferring Speciation Times under an Episodic Molecular Clock

The Phylogenetic Handbook

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Transcription:

Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018

Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

It [maximum likelihood] involves finding that evolutionary tree which yields the highest probability of evolving the observed data (Felsenstein 1981).

Maximum Likelihood Likelihood = P(D M) Probability Data Model

Maximum Likelihood Likelihood = P(D M) Probability Alignment Topology (Ψ), Branch lengths (v), Model Parameters (θ)

Toss a coin 10 times: H T H H T T H H H H Probability of heads is p Probability of tails is 1-p What is the probability of our data given our model of a two-sided coin? H T H H T T H H H H L = p * (1-p) * p * p *(1-p) * (1-p) * p * p * p * p L = p 7 * (1-p) 3

p = seq(from = 0, to = 1, by = 0.01) L = p^7 * (1 p)^3

Our new approach 1. Build a model of character evolution 2. Use that model to calculate the likelihood of a particular phylogeny 3. Find the tree/ parameter values with the highest likelihood (maximum likelihood)

Our new approach: Step 1 Build a model of character evolution What is a model? Parameters vs. parameter values A T C G A???? T???? C???? G????

Our new approach: Step 1 Build a model of character evolution What is a model? Parameters vs. parameter values A T C G A - a a a T a - a a C a a - a G a a a -

Our new approach: Step 1 Build a model of character evolution What is a model? Parameters vs. parameter values A T C G A - 0.4 0.4 0.4 T 0.4-0.4 0.4 C 0.4 0.4-0.4 G 0.4 0.4 0.4 -

Our new approach: Step 1 Build a model of character evolution What is a model? Parameters vs. parameter values Modified from Huelsenbeck 2017

Q = Jukes Cantor (JC69) A G C T A - μ/3 μ/3 μ/3 G μ/3 - μ/3 μ/3 C μ/3 μ/3 - μ/3 T μ/3 μ/3 μ/3 - Kimura (K80) A G C T A - K 1 1 G K - 1 1 C 1 1 - K T 1 1 K - * μ A <-> G: Transition (purine purine) C <-> T: Transition (pyrimadine pyrimadine) Felsenstein (F81) A G C T A - π G π C π T G π A - π C π T C π A π G - π T T π A π G π C - Hasegawa, Kishino, Yano (HKY85) A G C T A - kπ G π C π T * μ G kπ A - π C π T * μ C π A π G - kπ T T π A π G kπ C - Base frequencies vary

General Time Reversible (GTR): A G C T A - απ G βπ C γπ T G απ A - δπ C επ T * μ C βπ A δπ G - ηπ T T γπ A επ G ηπ C - Rate parameters: α = A <-> G β = A <-> C γ = A <-> T δ = C <-> G ε = T <-> G η = T <-> C Stationary base frequencies: π A = frequency of A π G = frequency of G π C = frequency of A π T = frequency of T

General Time Reversible (GTR): Markov models are memoryless A G C T A - απ G βπ C γπ T G απ A - δπ C επ T * μ C βπ A δπ G - ηπ T T γπ A επ G ηπ C - Rate parameters: α = A <-> G β = A <-> C γ = A <-> T δ = C <-> G ε = T <-> G η = T <-> C Stationary base frequencies: π A = frequency of A π G = frequency of G π C = frequency of A π T = frequency of T

https://www.summitpost.org/giant-senecio/762750 https://www.summitpost.org/frailejon-espeletia-pycnophylla/565302

Our new approach: Step 2 It [maximum likelihood] involves finding that evolutionary tree which yields the highest probability of evolving the observed data (Felsenstein 1981). GAATC GTATC AAATC GAATC GTATC AAATC GAATC GTATC AAATC Let s start with one site in the alignment

Our new approach: Step 2 P ij (t) π s0 Felsenstein 1981

Our new approach: Step 2 0 0 1 0 0 0 1 0 1 0 0 0.1.2.8.2.4.2.5.2 Probability of node for all possible base pairs Felsenstein s pruning algorithm Figured modified from Huelsenbeck 2017

Where do we get P ij (t) and π s0? Q = A G C T A -3λ λ λ λ G λ -3λ λ λ C λ λ -3λ λ T λ λ λ -3λ A G C T P(t) = e Qt = A p 0 (t) p 1 (t) p 1 (t) p 1 (t) G p 1 (t) p 0 (t) p 1 (t) p 1 (t) C p 1 (t) p 1 (t) p 0 (t) p 1 (t) with, p 0 (t) = ¼ + ¾e -4λt p 1 (t) = ¼ - ¼e -4λt T p 1 (t) p 1 (t) p 1 (t) p 0 (t)

Pick a tree & parameter values Use Felsenstein s Pruning Algorithm & your model of evolution to calculate the likelihood of a tree for a single site in an alignment Assume all sites in alignment are independent & multiply likelihoods of all sites to calculate the likelihood of the tree given the alignment Vary parameter values for that topology to find the maximum likelihood Pick a new tree, repeat! Overview:

This process is often untenable computationally Number of rooted trees increases A LOT with increasing number of taxa multiplying small number -> smaller numbers so we use log likelihood

http://www.audubon.org/magazine/march-april-2009/beetle-mania

Main differences: Branch length matter!!!!! Estimation of parameters Model underlies the process

Main differences: Branch length matter!!!!! ATTCG ATTCG Time What happened along the branches? ATTCG ATTCG

Main differences: Branch length matter!!!!! ATTCG ATTCG Time C -> G What happened along the branches? ATTCG G -> C ATTCG

Main differences: Branch length matter!!!!! Estimation of parameters Model underlies the process

Main differences: Branch lengths matter!!!!! Estimation of parameters How does this compare to weighting in parsimony?

Fancy Things Error? Rate heterogeneity Model testing Additional models (codon, amino acid, etc.)

Get idea of error through bootstrapping But what error is this measuring?

Rate Heterogeneity: Partitions gene1 gene2 gene3 gene4 gene5 A G C T A -3λ λ λ λ G λ -3λ λ λ C λ λ -3λ λ T λ λ λ -3λ

Rate Heterogeneity: Partitions A G C T A G C T A -3λ λ λ λ A -3λ λ λ λ G λ -3λ λ λ G λ -3λ λ λ C λ λ -3λ λ C λ λ -3λ λ T λ λ λ -3λ T λ λ λ -3λ gene1 gene2 gene3 gene4 gene5 A G C T A -3λ λ λ λ G λ -3λ λ λ C λ λ -3λ λ T λ λ λ -3λ A G C T A -3λ λ λ λ G λ -3λ λ λ C λ λ -3λ λ T λ λ λ -3λ A G C T A -3λ λ λ λ G λ -3λ λ λ C λ λ -3λ λ T λ λ λ -3λ Estimate model parameters separately for different genes!

Rate Heterogeneity: Partitions gene1 gene2 gene3 gene4 gene5 Estimate model parameters separately for different codon positions!

Rate Heterogeneity: gamma (Γ)

Model testing Likelihood ratio test for nested models: LR = 2 * (-lnl1 -lnl2) HKY85 -lnl = 1787.08 GTR -lnl = 1784.82 LR = 2(1787.08-1784.82) = 4.53, df = 4 critical value (P = 0.05) = 9.49 (~chi 2 distribution)

Model testing Aikake Information Criterion (AIC) (non nested): Find model with lowest AIC Models need not be nested Can correct for small sample sizes (AICc) BIC is a similar alternative

Additional models Amino Acid Codon non-markov What would a model of morphology look like?