Likelihoods and Phylogenies

Similar documents
Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models

= 1 = 4 3. Odds ratio justification for maximum likelihood. Likelihoods, Bootstraps and Testing Trees. Prob (H 2 D) Prob (H 1 D) Prob (D H 2 )

Week 6: Protein sequence models, likelihood, hidden Markov models

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Lecture Notes: Markov chains

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Inferring Molecular Phylogeny

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Week 7: Bayesian inference, Testing trees, Bootstraps

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Evolutionary Analysis of Viral Genomes

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics: Building Phylogenetic Trees

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Taming the Beast Workshop

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

An Introduction to Bayesian Inference of Phylogeny

Dr. Amira A. AL-Hosary

Statistical nonmolecular phylogenetics: can molecular phylogenies illuminate morphological evolution?

Molecular Evolution & Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

Probabilistic modeling and molecular phylogeny

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

What Is Conservation?

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Bootstraps and testing trees. Alog-likelihoodcurveanditsconfidenceinterval

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Molecular Evolution, course # Final Exam, May 3, 2006

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

CSE 549: Computational Biology. Substitution Matrices

Phylogenetic Inference using RevBayes

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Molecular evolution. Joe Felsenstein. GENOME 453, Autumn Molecular evolution p.1/49

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetic Tree Reconstruction

Constructing Evolutionary/Phylogenetic Trees

Inferring Speciation Times under an Episodic Molecular Clock

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Similarity or Identity? When are molecules similar?

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.


Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Evolutionary Models. Evolutionary Models

Bayesian Models for Phylogenetic Trees

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

O 3 O 4 O 5. q 3. q 4. Transition

Letter to the Editor. Department of Biology, Arizona State University

Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Reconstruire le passé biologique modèles, méthodes, performances, limites

Maximum Likelihood in Phylogenetics

The translation machinery of the cell works with triples of types of RNA bases. Any triple of RNA bases is known as a codon. The set of codons is

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

BMI/CS 776 Lecture 4. Colin Dewey

EVOLUTIONARY DISTANCES

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Estimating Divergence Dates from Molecular Sequences

Markov Chains and Hidden Markov Models. = stochastic, generative models

Proteins: Characteristics and Properties of Amino Acids

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Basic math for biology

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

Molecular Evolution and Comparative Genomics

Computational Genomics

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

X X (2) X Pr(X = x θ) (3)

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Maximum Likelihood in Phylogenetics

Phylogenetic inference

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Lab 9: Maximum Likelihood and Modeltest

Viewing and Analyzing Proteins, Ligands and their Complexes 2

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Counting phylogenetic invariants in some simple cases. Joseph Felsenstein. Department of Genetics SK-50. University of Washington

7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?

Lecture 15: Realities of Genome Assembly Protein Sequencing

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Introduction to Molecular Phylogeny

Transcription:

Likelihoods and Phylogenies Joe Felsenstein Department of enome Sciences and Department of Biology University of Washington, Seattle Likelihoods and Phylogenies p.1/68

n ideal parsimony method? Ideally, we d like to have a parsimony method that Took into account less parsimonious as well as most parsimonious state reconstructions Likelihoods and Phylogenies p.2/68

n ideal parsimony method? Ideally, we d like to have a parsimony method that Took into account less parsimonious as well as most parsimonious state reconstructions Weighted changes differently if they occur in a branch of different length Likelihoods and Phylogenies p.2/68

n ideal parsimony method? Ideally, we d like to have a parsimony method that Took into account less parsimonious as well as most parsimonious state reconstructions Weighted changes differently if they occur in a branch of different length Weighted different kinds of events (e.g. transitions, transversions) differently Likelihoods and Phylogenies p.2/68

n ideal parsimony method? Ideally, we d like to have a parsimony method that Took into account less parsimonious as well as most parsimonious state reconstructions Weighted changes differently if they occur in a branch of different length Weighted different kinds of events (e.g. transitions, transversions) differently There is such a method. It is maximum likelihood. Likelihoods and Phylogenies p.2/68

n ideal parsimony method? Ideally, we d like to have a parsimony method that Took into account less parsimonious as well as most parsimonious state reconstructions Weighted changes differently if they occur in a branch of different length Weighted different kinds of events (e.g. transitions, transversions) differently There is such a method. It is maximum likelihood. But... It requires a believable model, and Likelihoods and Phylogenies p.2/68

n ideal parsimony method? Ideally, we d like to have a parsimony method that Took into account less parsimonious as well as most parsimonious state reconstructions Weighted changes differently if they occur in a branch of different length Weighted different kinds of events (e.g. transitions, transversions) differently There is such a method. It is maximum likelihood. But... It requires a believable model, and It is computationally intensive Likelihoods and Phylogenies p.2/68

Odds ratio justification for maximum likelihood D the data H 1 Hypothesis 1 H 2 Hypothesis 2 the symbol for given Prob (H 1 D) Prob (D H 1 ) Prob (H 1 ) Prob (H 2 D) = Prob (D H 2 ) Prob (H 2 ) }{{} Posterior odds ratio } {{ } Likelihood ratio }{{} Prior odds ratio Likelihoods and Phylogenies p.3/68

simple example of Bayes Theorem If a space probe finds no Little reen Men on Mars, when it would have a 1/3 chance of missing them if they were there: likelihoods 1 no yes 0 Likelihoods and Phylogenies p.4/68

simple example of Bayes Theorem If a space probe finds no Little reen Men on Mars, when it would have a 1/3 chance of missing them if they were there: yes no priors likelihoods 1 no yes 0 4 1 1/3 1 Likelihoods and Phylogenies p.5/68

simple example of Bayes Theorem If a space probe finds no Little reen Men on Mars, when it would have a 1/3 chance of missing them if they were there: yes no priors likelihoods 1 no yes 0 yes no posteriors 4 1 1/3 = 4 1 3 Likelihoods and Phylogenies p.6/68

simple example of Bayes Theorem If a space probe finds no Little reen Men on Mars, when it would have a 1/3 chance of missing them if they were there: yes no no priors yes likelihoods 1 no yes 0 yes no posteriors 4 1 1/3 = 4 1 3 1 4 1/3 1 Likelihoods and Phylogenies p.7/68

simple example of Bayes Theorem If a space probe finds no Little reen Men on Mars, when it would have a 1/3 chance of missing them if they were there: yes no no priors yes likelihoods 1 no yes 0 no yes no posteriors yes 4 1 1/3 = 4 1 3 1 4 1/3 = 1 1 12 Likelihoods and Phylogenies p.8/68

The likelihood ratio term ultimately dominates If we see one Little reen Man, the likelihood calculation does the right thing: 1 = 2/3 1 0 4 (put this way, this is OK but not mathematically kosher) If after n missions, we keep seeing none, the likelihood ratio term is ( ) n 1 3 It dominates the calculation, overwhelming the prior. Thus even if we don t have a prior we can believe in, we may be interested in knowing which hypothesis the likelihood ratio is recommending... Likelihoods and Phylogenies p.9/68

Likelihood in simple coin-tossing Tossing a coin n times, with probability p of heads, the probability of outcome HHTHTTTTHTTH is which is L = p 5 (1 p) 6 pp(1 p)p(1 p)(1 p)(1 p)(1 p)p(1 p)(1 p)p Plotting L against p to find its maximum: Likelihood 0.0 0.2 0.4 0.6 0.8 1.0 p 0.454 Likelihoods and Phylogenies p.10/68

Differentiating to find the maximum: Differentiating the expression for L with respect to p and equating the derivative to 0, the value of p that is at the peak is found (not surprisingly) to be p = 5/11: L p = ( 5 p 6 ) p 5 (1 p) 6 = 0 1 p 5 11 p = 0 ˆp = 5 11 Likelihoods and Phylogenies p.11/68

You already know many likelihood estimators Many commonly-used estimators in statistics are actually MLE s: For example: The empirical average as the mean of a sample from a normal distribution The correlation coefficient The slope of a regression of Y on X The observed fraction of heads as estimate of p in tossing coins Likelihoods and Phylogenies p.12/68

likelihood curve Likelihood curve in one parameter Ln (Likelihood) length of a branch in the tree Likelihoods and Phylogenies p.13/68

Its maximum likelihood estimate Likelihood curve in one parameter and the maximum likelihood estimate Ln (Likelihood) length of a branch in the tree maximum likelihood estimate (MLE) Likelihoods and Phylogenies p.14/68

The (approximate, asymptotic) confidence interval Likelihood curve in one parameter and the maximum likelihood estimate and confidence interval derived from it Ln (Likelihood) 1/2 the value of a chi square with 1 d.f. significant at 95% 95% confidence interval length of a branch in the tree maximum likelihood estimate (MLE) Likelihoods and Phylogenies p.15/68

ontours of a likelihood surface in two dimensions length of branch 2 length of branch 1 Likelihoods and Phylogenies p.16/68

Where the maximum likelihood estimate is length of branch 2 MLE length of branch 1 Likelihoods and Phylogenies p.17/68

Likelihood-based confidence set for two variables shaded area is the joint confidence interval length of branch 2 height of this contour is less than at the peak by an amount equal to 1/2 the chi square value with two degrees of freedom which is significant at 95% level length of branch 1 Likelihoods and Phylogenies p.18/68

Likelihood-based confidence interval for one variable (shaded area is the confidence interval) length of branch 2 height of this contour is less than at the peak by an amount equal to 1/2 the chi square value with one degree of freedom which is significant at 95% level length of branch 1 Likelihoods and Phylogenies p.19/68

Likelihood-based confidence interval for one variable (shaded area is the confidence interval) length of branch 2 height of this contour is less than at the peak by an amount equal to 1/2 the chi square value with one degree of freedom which is significant at 95% level length of branch 1 Likelihoods and Phylogenies p.20/68

ln L ln L Scale-invariance of ML estimates In the case of a tree with one branch, whose length can be expressed either by the (pseudo-)time t or the probability of base change p, the value of p which achieves the highest likelihood corresponds exactly to the value of t which achieves the highest likelihood, so it doesn t matter which scale we work on as long as one can be translated into the other. 6 6 8 8 10 10 12 12 14 0 1 2 3 4 5 t ^ t = 0.383112 ( p ^ = 0.3 ) 14 0.0 0.2 0.4 0.6 0.8 p ^p = 0.3 ( ^t = 0.383112 ) Likelihoods and Phylogenies p.21/68

alculating the likelihood of a tree If we have molecular sequences on a tree, the likelihood is the product over sites of the data D [i] for each site (if those evolve independently): L = Prob (D T) = sites i=1 Prob (D [i] T) With log-likelihoods, the product becomes a sum: lnl = ln Prob (D T) = sites i=1 ln Prob (D [i] T) Likelihoods and Phylogenies p.22/68

alculating the likelihood for site i on a tree t 1 x t i are "branch lengths", t2 t 7 t 3 z t 4 t 5 y t 6 (rate X time) w t 8 Sum over all possible states (bases) at interior nodes: L (i) = x y z w Prob (w) Prob (x w, t 7 ) Prob ( x, t 1 ) Prob ( x, t 2 ) Prob (z w, t 8 ) Prob ( z, t 3 ) Prob (y z, t 6 ) Prob ( y, t 4 ) Prob ( y, t 5 ) Likelihoods and Phylogenies p.23/68

alculating the likelihood for site i on a tree We use the conditional likelihoods: L (i) j (s) These compute the probability of everything at site i at or above node j on the tree, given that node j is in state s. Thus it assumes something (s) that we don t know in practice we compute these for all states s. t the tips we can define these quantities: if the observed state is (say), the vector of L s is (0, 1, 0, 0). If we observe an ambiguity, say R (purine), they are (1, 0, 1, 0) Likelihoods and Phylogenies p.24/68

The pruning" algorithm: j k v j l v k L (i) l (s) = [ ] Prob (s j s, v j ) L (i) (s j j ) s [ j ] Prob (s k s, v k ) L (i) k (s k) s k (Felsenstein, 1973; 1981). Likelihoods and Phylogenies p.25/68

and at the bottom of the tree: L (i) 0 = s π s L (i) 0 (s) (Felsenstein, 1973, 1981) and having gotten the likelihoods for each site: L = sites i=1 L (i) 0 Likelihoods and Phylogenies p.26/68

The tree is effectively unrooted before after 6 8 t 6 0 8 6 t 6 The region around nodes 6 and 8 in the tree, when a new root (node 0) is placed in that branch (The subtrees are shown as shaded triangles) It is possible to show that if the base substitution model is reversible (as most of them are), these two trees have exactly the same likelihood. So we are only inferring the unrooted tree. Likelihoods and Phylogenies p.27/68

Finding the best tree The pruning algorithm helps us calculate likelihood quickly. Likelihoods and Phylogenies p.28/68

Finding the best tree The pruning algorithm helps us calculate likelihood quickly. It turns out that for unrooted trees (and reversible models of change), the likelihood is the same no matter where you root the tree. Likelihoods and Phylogenies p.28/68

Finding the best tree The pruning algorithm helps us calculate likelihood quickly. It turns out that for unrooted trees (and reversible models of change), the likelihood is the same no matter where you root the tree. If we root in the middle of a branch, we can prune down to both ends of the branch and then get the likelihood of the tree really quickly for any particular length of the branch. Likelihoods and Phylogenies p.28/68

Finding the best tree The pruning algorithm helps us calculate likelihood quickly. It turns out that for unrooted trees (and reversible models of change), the likelihood is the same no matter where you root the tree. If we root in the middle of a branch, we can prune down to both ends of the branch and then get the likelihood of the tree really quickly for any particular length of the branch. This means we can quickly maximize likelihood when varying one branch length in a given topology (holding all the other branch lengths fixed). Likelihoods and Phylogenies p.28/68

Finding the best tree The pruning algorithm helps us calculate likelihood quickly. It turns out that for unrooted trees (and reversible models of change), the likelihood is the same no matter where you root the tree. If we root in the middle of a branch, we can prune down to both ends of the branch and then get the likelihood of the tree really quickly for any particular length of the branch. This means we can quickly maximize likelihood when varying one branch length in a given topology (holding all the other branch lengths fixed). If we go around the tree doing this, the branch lengths quickly reach their optima (for that topology) Likelihoods and Phylogenies p.28/68

Finding the best tree The pruning algorithm helps us calculate likelihood quickly. It turns out that for unrooted trees (and reversible models of change), the likelihood is the same no matter where you root the tree. If we root in the middle of a branch, we can prune down to both ends of the branch and then get the likelihood of the tree really quickly for any particular length of the branch. This means we can quickly maximize likelihood when varying one branch length in a given topology (holding all the other branch lengths fixed). If we go around the tree doing this, the branch lengths quickly reach their optima (for that topology) For searching among topologies, the problems are the usual ones, but the pruning enables local rearrangements to be evaluated more quickly. Likelihoods and Phylogenies p.28/68

What does tree space" (with branch lengths) look like? an example: three species with a clock B trifurcation t 1 t 2 t 1 not possible OK etc. t 2 when we consider all three possible topologies, the space looks like: t 1 t 2 Likelihoods and Phylogenies p.29/68

For one tree topology The space of trees varying all 2n 3 branch lengths, each a nonegative number, defines an orthant" (open corner) of a 2n 3-dimensional real space: B v 1 F v 6 v 2 3 wall 8v 9 v v v 7 v 5 v 4 D floor wall v 9 E Likelihoods and Phylogenies p.30/68

Through the looking-glass Shrinking one of the n 1 interior branches to 0, we arrive at a trifurcation: B v 1 F v 6 v v v 8 7 v 2 3 v 9 v 5 v 4 D E Here, as we pass through the looking glass" we are also touch the space for two other tree topologies, and we could decide to enter either. Likelihoods and Phylogenies p.31/68

Through the looking-glass Shrinking one of the n 1 interior branches to 0, we arrive at a trifurcation: B v 1 F v 6 v 2 3 v v v 8 7 v 9 v 5 v 4 D E B v 1 F v 6 v 2 3 8 v 4 D v v v 7 v 5 E Here, as we pass through the looking glass" we are also touch the space for two other tree topologies, and we could decide to enter either. Likelihoods and Phylogenies p.32/68

Through the looking-glass Shrinking one of the n 1 interior branches to 0, we arrive at a trifurcation: B v 1 F v 6 v 2 3 v v v 8 7 v 9 v 5 v 4 D E B v 1 F v 6 v 7 B v 2 v3 v 8 v 4 D v 9 v 5 v 1 F v 6 v 2 3 8 v 4 D v v v 7 E v 5 E Here, as we pass through the looking glass" we are also touch the space for two other tree topologies, and we could decide to enter either. Likelihoods and Phylogenies p.33/68

Through the looking-glass Shrinking one of the n 1 interior branches to 0, we arrive at a trifurcation: B v 1 F v 6 v 2 3 v v v 8 7 v 9 v 5 v 4 D E B v 1 F v 6 v 2 3 8 v 4 D v v v 7 E v 5 v 1 v 7 v 9 B v 2 v3 v8 v 5 E F v 6 v 4 Here, as we pass through the looking glass" we are also touch the space for two other tree topologies, and we could decide to enter either. D Likelihoods and Phylogenies p.34/68

The graph of all trees of 5 species The space of all these orthants, one for each topology, connecting ones that share faces (looking glasses): D B E D B E B E D E B D B D E B D E B D E B E D B D E D B E B D E B D E E B D E B D D B E The Schoenberg graph (all 15 trees of size 5 connected by NNI s) Likelihoods and Phylogenies p.35/68

Models of DN change (1) Jukes antor model (1969) a a a a a T a 1 4at P( T, t) = (1 e ) 4 (2) Kimura 2 parameter model (1980) b b a b b a T (3) Felsenstein (1984) model (like Jukes antor model but allows for differences in rates of transitions and transversions) T (like Kimura model but allows for inequality of base frequencies) (4) Hasegawa, Kishino, and Yano (1985) model (like Felsenstein model but T differs in detail) Likelihoods and Phylogenies p.36/68

data example: mitochondrial D-loop sequences ovine ouse ibbon rang orilla himp uman TT T T TT T T T TT TT TTT TT T TT T T T T T TT TTT TTTT T TT T T T T TT TT TTT T T T T TT TTTT TT TTT TTT TTTT TTTTT TT TT TT TT TTTT TTT T TTTT TTT TTT TT TT T T T T TT T TT TT TTT TTT TT T TT TT T TTT T T TTTTT TTT TT TT TT TTT TTT TTT TT T TTTT T TT TTTT TTT T TT T TTT T TT TT TTT TTT TT TT TTTT TT TT TT T TT TTT TT TTT TT TTT TT TT T TT TT TT TT Likelihoods and Phylogenies p.37/68

which gives the ML tree himp Human Orang orilla 0.153 0.304 0.075 0.172 0.121 0.049 0.336 0.106 ibbon 0.486 ln L = 1405.6083 Maximum likelihood tree for the Hasegawa 232-site mitochondrial D-loop data set, with Ts/Tn set to 2, analyzed with maximum likelihood (Dnaml) 0.792 0.902 Mouse Bovine Likelihoods and Phylogenies p.38/68

pioneer of protein evolution Margaret Dayhoff, pioneer of protein databases, protein evolution models, and gene families, about 1966 Likelihoods and Phylogenies p.39/68

Models with amino acids R N D Q E H I L K M F P S T W V Y R N D Q E etc. H I L K M F P S T W V Y Dayhoff PM model Jones Taylor Thornton model specific models for secondary structure contexts or membrane proteins Models adapted from Henikoff BLOSUM scoring Likelihoods and Phylogenies p.40/68

Dayhoff s PM001 matrix R N D Q E H I L K M F P S T W ala arg asn asp cys gln glu gly his ile leu lys met phe pro ser thr trp ala 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 R arg 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 N asn 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 D asp 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 cys 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 Q gln 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 E glu 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 gly 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 H his 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 I ile 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 L leu 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 K lys 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 M met 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 F phe 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 P pro 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 S ser 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 T thr 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 W trp 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 99 Y tyr 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 V val 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 Likelihoods and Phylogenies p.41/68

odon models (more later from Joe Bielawski) (Muse & aut, MBE 1994; oldman & Yang, MBE 1994) U U phe UUU U phe leu UU UU ser U stop U stop U leu UU U leu UU leu leu U U leu U U ile UU ile ile U U met U U val UU val val U U val U Likelihoods and Phylogenies p.42/68

onsiderations for a protein model Making a model for protein evolution (a not-very-practical approach) Use a good model of DN evolution. Use the appropriate genetic code. When an amino acid changes, accept it with probability that declines as the amino acids become more different. Fit this to empirical information on protein evolution. Take into account variation of rate from site to site. Take into account correlation of rates in adjacent sites. How about protein structure? Secondary structure? 3D structure? (the first four steps are the codon model of oldman and Yang, 1994 and Muse and aut, 1994, both in Molecular Biology and Evolution. The next two are the rate variation machinery of Yang, 1995, 1996 and Felsenstein and hurchill, 1996). Likelihoods and Phylogenies p.43/68

ovarion models? (Fitch and Markowitz, 1970) T T T T T T T T T T T T T T T T T T Which sites are available T T T T for substitutions changes as one moves along the tree T T T T T Likelihoods and Phylogenies p.44/68

How to calculate likelihood with rate variation Easy! Since branch lengths always come into transition probability formulas as r t, can just multiply lengths of branches by the appropriate factor to calculate the likelihood for a site. (Branch lengths are usually scaled relative to a rate of 1.) Likelihoods and Phylogenies p.45/68

Rate variation among sites Sites Phylogeny 1 2 3 4 5 6 7 8 T T... Hidden Markov chain: Rates of evolution 10.0 2.0 0.3... Likelihoods and Phylogenies p.46/68

rray of likelihoods for possible rates Sites Phylogeny 1 2 3 4 5 6 7 8 T T... Hidden Markov chain: Rates of evolution 10.0 2.0 0.3... Likelihoods and Phylogenies p.47/68

Hidden Markov Model of rate variation among sites Sites Phylogeny 1 2 3 4 5 6 7 8 T T... Hidden Markov chain: Rates of evolution 10.0 2.0 0.3... Likelihoods and Phylogenies p.48/68

Hidden Markov Models sum up over all paths Prob (Data tree) = Σ Prob(Data tree, path) Prob(path) paths one path another path Likelihoods and Phylogenies p.49/68

The rate combination contributing the most: We can leave behind pointers that allow us to backtrack This can be done by a dynamic programming algorithm (Of course, this one might account for only 0.001 of the likelihood) Likelihoods and Phylogenies p.50/68

The Forwards lgorithm The Forwards lgorithm, well-known in the Hidden Markov model literature, updates, from last to first site (yes, I know, that s the wrong direction!), the quantity Prob (D [i] T, r j ) where D [i] is the data from site i + 1 to the end of the sequence. This is just like the conditional likelihood on the tree since it is conditioned on us knowing that the rate at site i is r j. We don t know that but we compute it for all r j and do a weighted average at the front of the array of rates. The logic is the same as when adding up likelihoods over a tree. Likelihoods and Phylogenies p.51/68

The Forwards lgorithm If we can calculate the contribution to the likelihood from all paths passing through one rate at a particular site... Likelihoods and Phylogenies p.52/68

re-uses information by dynamic programming... then we can use it to calculate the same things for the previous site Likelihoods and Phylogenies p.53/68

like this This algorithm, the Forwards algorithm, was invented in communications applications of Hidden Markov models Likelihoods and Phylogenies p.54/68

The pruning algorithm is just like species 1 species 2 different bases ancestor ancestor Likelihoods and Phylogenies p.55/68

the Forwards lgorithm site 22 different rates site 21 site 20 Likelihoods and Phylogenies p.56/68

Forwards-Backwards algorithm (marginal probabilities) oing backwards (using the Forwards lgorithm), leaving information behind, then forwards (using the Backwards lgorithm), we can calculate the total probability of the data over all paths that have a particular rate at site i by combining the two into the the Forwards Backwards algorithm can calculate the contribution of one rate at a given site to the overall likelihood Likelihoods and Phylogenies p.57/68

frequency The amma distribution, used for rates α = 0.25 cv = 2 α = 1 cv = 1 α = 11.1111 cv = 0.3 0 0.5 1 1.5 2 rate Likelihoods and Phylogenies p.58/68

pproximating the amma distribution Integrating over all possible rates is hard. But we can approximate the amma distribution by the rates and probabilities in a Hidden Markov Model. Here are the rates and probabilities we might use to approximate a amma with a V of 1/2: State Rate of Probability in HMM change 1 0.234 0.056 2 0.552 0.299 3 0.995 0.400 4 1.576 0.200 5 2.311 0.042 6 3.230 0.0036 7 4.378 0.00011 8 5.849 0.000001 9 7.875 0.0000001 Likelihoods and Phylogenies p.59/68

simple Hidden Markov Model ssume that rate i has probability p i. Start with one from that distribution. t each site: With probability 1 λ keep the rate the same With probability λ choose a new one from this distribution Likelihoods and Phylogenies p.60/68

numerical example. ytochrome B We analyze 31 cytochrome B sequences, aligned by Naoko Takezaki, using the Proml protein maximum likelihood program. ssume a Hidden Markov Model with 3 states, rates: and expected block length 3. category rate probability 1 0.0 0.2 2 1.0 0.4 3 3.0 0.4 We get a reasonable, but not perfect, tree with the best rate combination inferred to be Likelihoods and Phylogenies p.61/68

Phylogeny for Takezaki cytochrome B whalebm whalebp borang sorang hseal gseal gibbon gorilla2 bovine gorilla1 cchimp cat rhinocer pchimp platypus dhorse horse african caucasian wallaroo rat opossum mouse chicken seaurchin2 xenopus loach carp seaurchin1 trout lamprey Likelihoods and Phylogenies p.62/68

Rates inferred from ytochrome B 1333333311 3222322313 3321113222 2133111111 1331133123 1122111 african M-----TPMRK INPLMKLINH SFIDLPTPSN ISWWNFSL LLILQIT TLFL caucasian......r........t...... cchimp...t................. pchimp...t..........t......... gorilla1... T........T......... gorilla2... T........T......... borang... T....L.........I.TI... sorang...st.. T....L.........I...... gibbon...l.. T....L.....M......I... bovine...ni.. SH...IV.N.....S.....I...L... whalebm...ni.. TH...I..D.....S.....L...V..L... whalebp...ni.. TH...IV.D.V.....S.....L...M..L... dhorse...ni.. SH..I.I........S.....I...L... horse...ni.. SH..I.I........S.....I...L... rhinocer...ni.. SH..V.I........S.....I...L... cat...ni.. SH..I.I...........V..T...L... gseal...ni.. TH...I..N........I...L... hseal...ni.. TH...I..N........I...L... mouse...n... TH..F.I........S.....V..MV..I... rat...ni.. SH..F.I........S.....V..MV..L... platypus...nnl.. TH..I.IV.......S.....L...I..L... wallaroo...nl.. SH..I.IV...........I..L... opossum...ni.. TH...I..D........V...I..L... chicken...pni.. SH..L.M..N.L.......V..MT..L...L... xenopus...pni.. SH..I.I..N.....SL.....V...I... carp...-sl.. TH..I.I.D LV........L...T..L... loach...-sl.. TH..I.I.D LV.....V.....L...T..L... trout...-nl.. TH..L.I.D LV.....V.....L..T..L... lamprey.shqpsii.. TH..LS..S MLV...S......SL...I...I... seaurchin1 -...L.L.. EH.IFRIL.S T.V...L... L.I.....L...T..L... seaurchin2 -...L.. EH.IFRIL.S T.V...L... L.M.....L...I.LI Likelihoods and Phylogenies..I... p.63/68

Rates inferred from ytochrome B 2223311112 2222222222 2222232112 2222222223 1222221112 3333111 african PDSTFSSI HITRDVNY WIIRYLHN SMFFILFL HIRLYYS FLYSETW caucasian.................. cchimp..................l... pchimp............l....v......l... gorilla1.......t...........hq... gorilla2.......t...........hq... borang...t.......m..h......l.......thl... sorang.......m..h..........thl... gibbon...v...............l... bovine S.TT...V T......M......YM.V... YTFL... whalebm..tm...v T....V......Y.M... HFR... whalebp..tt...v T.........Y.M... YFR... dhorse S.TT...V T.........I.V... YTFL... horse S.TT...V T.........I.V... YTFL... rhinocer..tt...v T....M......I.V... YTFL... cat S.TM...V T.........YM.V...M... YTF... gseal S.TT...V T.........YM.V... YTFT... hseal S.TT...V T.........YM.V... YTFT... mouse S.TM...V T....L...M.......V... YTFM... rat S.TM...V T....L...Q.......V... YTFL... platypus S.T...V....L...M.....L..M.I..... YTQT... wallaroo S.TL...V....L..N......M....V...I... Y..K... opossum S.TL...V....L..NI......M....V...I... Y..K... chicken.t.l...v..t.n.q...l..n.....f...i..... Y..K... xenopus.t.m...v...f... LL..N... L.F...IY.......K... carp S.I...V T....L..NV.....F...IYM... Y..K... loach S.I...V....L..NI.....F...Y.... Y..K... trout S.I...V...S...L..NI.....F...IYM... Y..K... lamprey NTEL...V M...N..LM.N......IY...I... Y..K... seaurchin1.i.l... S....LL.NV.....L...MY... SNKI... seaurchin2.inl...v S....LL.NV.....L...MY...L Likelihoods and Phylogenies TNKI... p.64/68

PhyloHMMs: used in the US enome Browser The conservation scores calculated in the enome Browser use PhyloHMMs, which is just these HMM methods. Likelihoods and Phylogenies p.65/68

References Edwards,. W. F. and L. L. avalli-sforza. 1964. Reconstruction of evolutionary trees. pp. 67-76 in Phenetic and Phylogenetic lassification, ed. V. H. Heywood and J. McNeill. Systematics ssociation Publication No. 6. Systematics ssociation, London. [Parsimony and likelihood for phylogenies from gene frequencies] Neyman, J. 1971. Molecular studies of evolution: a source of novel statistical problems. In Statistical Decision Theory and Related Topics, ed. S. S. upta and J. Yackel, pp. 1-27. New York: cademic Press. [First paper on likelihood for molecular sequences. Neyman was a famous statistician.] Jukes, T. H. and. antor. 1969. Evolution of protein molecules. pp. 21-132 in Mammalian Protein Metabolism, ed. M. N. Munro. cademic Press, New York. [The Jukes-antor model, in one formula and a couple of sentences] Hasegawa, M., H. Kishino, and T. Yano. 1985. Dating of the human-ape splitting by a molecular clock of mitchondrial DN. Journal of Molecular Evolution 22: 160-174. [HKY model] Kimura, M. 1980. simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution 16: 111-120. [K2P model] Likelihoods and Phylogenies p.66/68

(more references) Felsenstein, J. 1973. Maximum-likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters. Systematic Zoology 22: 240-249. [The pruning algorithm, parsimony not same as likelihood] Felsenstein, J. 1981. Evolutionary trees from DN sequences: a maximum likelihood approach. Journal of Molecular Evolution 17: 368-376. [Making likelihood useable for molecular sequences] hurchill,.. 1989. Stochastic models for heterogeneous DN sequences. Bulletin of Mathematical Biology 51: 79-94. [First paper to use HMMs in molecular biology] Yang, Z. 1994. Maximum-likelihood estimation of phylogeny from DN sequences when substitution rates differ over sites. Molecular Biology and Evolution 10: 1396-1401. [Use of gamma distribution of rate variation in ML phylogenies] Yang, Z. 1994. Maximum likelihood phylogenetic estimation from DN sequences with variable rates over sites: approximate methods. Journal of Molecular Evolution 39: 306-314. [pproximating gamma distribution in ML phylogenies by an HMM] Likelihoods and Phylogenies p.67/68

(and more references) Yang, Z. 1995. space-time process model for the evolution of DN sequences. enetics 139: 993-1005. [llowing for autocorrelated rates along the molecule using an HMM for ML phylogenies] Felsenstein, J. and.. hurchill. 1996. Hidden Markov Model approach to variation among sites in rate of evolution Molecular Biology and Evolution 13: 93-104. [HMM approach to evolutionary rate variation] Siepel,. and D. Haussler. 2004. ombining phylogenetic and hidden Markov models in biosequence analysis. Journal of omputational Biology 11: 413-428. [Using PhyloHMMs to infer conserved sequences in comparative genomics] Thorne, J. L., N. oldman, and D. T. Jones. 1996. ombining protein evolution and secondary structure. Molecular Biology and Evolution 13 666-673. [HMM for secondary structure of proteins, with phylogenies] Felsenstein, J. 2004. Inferring Phylogenies. Sinauer ssociates, Sunderland, Massachusetts. [Book you and all your friends must rush out and buy] Semple,. and M. Steel. 2003. Phylogenetics. Oxford University Press, Oxford. [Introduction for mathematicians] Yang, Z. 2007. omputational Molecular Evolution. Oxford University Press, Oxford. [Well-thought-out, concentrates on likelihood and Bayesian methods for sequences] Likelihoods and Phylogenies p.68/68