Week 5: Distance methods, DNA and protein models

Similar documents
Lecture 4. Models of DNA and protein change. Likelihood methods

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Building Phylogenetic Trees

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Phylogeny: traditional and Bayesian approaches

Phylogenetic trees 07/10/13

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

EVOLUTIONARY DISTANCES

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Consistency Index (CI)

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Phylogenetic Tree Reconstruction

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

AN ALTERNATING LEAST SQUARES APPROACH TO INFERRING PHYLOGENIES FROM PAIRWISE DISTANCES

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetic inference

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut


Letter to the Editor. Department of Biology, Arizona State University

Constructing Evolutionary/Phylogenetic Trees

Evolutionary Tree Analysis. Overview

Lecture 4. Models of DNA and protein change. Likelihood methods

Mutation models I: basic nucleotide sequence mutation models

Dr. Amira A. AL-Hosary

Week 7: Bayesian inference, Testing trees, Bootstraps

Molecular Evolution and Phylogenetic Tree Reconstruction

Lab 9: Maximum Likelihood and Modeltest

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Math 239: Discrete Mathematics for the Life Sciences Spring Lecture 14 March 11. Scribe/ Editor: Maria Angelica Cueto/ C.E.

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

Phylogenetics. Andreas Bernauer, March 28, Expected number of substitutions using matrix algebra 2

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Theory of Evolution Charles Darwin

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

Building Phylogenetic Trees UPGMA & NJ

Reading for Lecture 13 Release v10

BMI/CS 776 Lecture 4. Colin Dewey

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

BINF6201/8201. Molecular phylogenetic methods

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Constructing Evolutionary/Phylogenetic Trees

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Algorithms in Bioinformatics

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Minimum evolution using ordinary least-squares is less robust than neighbor-joining

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

X X (2) X Pr(X = x θ) (3)

Phylogenetics. BIOL 7711 Computational Bioscience

CSCI1950 Z Computa4onal Methods for Biology Lecture 4. Ben Raphael February 2, hhp://cs.brown.edu/courses/csci1950 z/ Algorithm Summary

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Reconstruire le passé biologique modèles, méthodes, performances, limites

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Evolutionary Trees. Evolutionary tree. To describe the evolutionary relationship among species A 3 A 2 A 4. R.C.T. Lee and Chin Lung Lu

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

The least-squares approach to phylogenetics was first suggested

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Phylogeny: building the tree of life

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Lecture Notes: Markov chains

Molecular Evolution, course # Final Exam, May 3, 2006

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

A (short) introduction to phylogenetics

Weighted Neighbor Joining: A Likelihood-Based Approach to Distance-Based Phylogeny Reconstruction

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Michael Yaffe Lecture #4 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogenetic inference: from sequences to trees

Understanding relationship between homologous sequences

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Theory of Evolution. Charles Darwin

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Workshop III: Evolutionary Genomics

The Generalized Neighbor Joining method

Ensembles and incomplete information

Multiple Sequence Alignment. Sequences

Module 13: Molecular Phylogenetics

Theoretical Basis of Likelihood Methods in Molecular Phylogenetic Inference

7. Tests for selection

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

2t t dt.. So the distance is (t2 +6) 3/2

Phylogeny. November 7, 2017

Transcription:

Week 5: Distance methods, DNA and protein models Genome 570 February, 2016 Week 5: Distance methods, DNA and protein models p.1/69

A tree and the expected distances it predicts E A 0.08 0.05 0.06 0.03 0.07 0.05 C B 0.10 D A B C D E A B C D E 0 0.23 0.16 0.20 0.17 0.23 0 0.23 0.17 0.24 0.16 0.23 0 0.15 0.11 0.20 0.17 0.15 0 0.21 0.17 0.24 0.11 0.21 0 The predicted distances are the sums of branch lengths between those two species. Week 5: Distance methods, DNA and protein models p.2/69

A tree and a set of two-species trees E A 0.08 0.05 0.06 0.03 0.07 0.05 C B 0.10 D E A C D B The two-species trees correspond to the pairwise distances observed between the pairs of species. The tree also predicts two-species trees, because clipping off all other species you are left simply with the path between that pair of species. Week 5: Distance methods, DNA and protein models p.3/69

A tree with branch lengths A v 1 v 2 B E v 5 v 6 v 7 v 3 v 4 C D Distance matrix methods always infer trees that have branch lengths, and they assume models of change of the characters (sequences). Week 5: Distance methods, DNA and protein models p.4/69

Distance matrix methods observed distances calculated from the sequence data A B C D E F A B C D E F 0 10 9 12 16 9 10 0 10 6 9 9 9 10 0 10 15 2 12 6 10 0 613 16 9 15 6 015 9 9 2 13 15 0 Find the tree which comes closest to predicting the observed pairwise distances compare A C Each possible tree (with branch lengths) predicts pairwise distances 5 0.5 2.5 1.5 F D B 4 2 3 2 3 E A B C D E F A B C D E F 0 11 8 14 15 9 11 0 9 7 810 8 14 15 9 9 0 13 14 2 7 13 0 513 8 14 5 014 10 2 13 14 0 Week 5: Distance methods, DNA and protein models p.5/69

The math of least squares trees Q = n w ij (D ij d ij ) 2 i=1 j i d ij = k x ij,k v k Q = dq dv k = 2 ( n w ij D ij j i k i=1 ( n w ij x ij,k D ij j i k i=1 x ij,k v k ) 2. x ij,k v k ) = 0. Equations to solve to infer branch lengths by least squares on a given tree topology, which specifies the x ij,k. Week 5: Distance methods, DNA and protein models p.6/69

Solving for least squares branch lengths D AB + D AC + D AD + D AE = 4v 1 + v 2 + v 3 + v 4 + v 5 + 2v 6 + 2v 7 D AB + D BC + D BD + D BE = v 1 + 4v 2 + v 3 + v 4 + v 5 + 2v 6 + 3v 7 D AC + D BC + D CD + D CE = v 1 + v 2 + 4v 3 + v 4 + v 5 + 3v 6 + 2v 7 D AD + D BD + D CD + D DE = v 1 + v 2 + v 3 + 5v 4 + v 5 + 2v 6 + 3v 7 D AE + D BE + D CE + D DE = v 1 + v 2 + v 3 + v 4 + 4v 5 + 3v 6 + 2v 7 D AC + D AE + D BC +D BE + D CD + D DE = 2v 1 + 2v 2 + 3v 3 + 2v 4 + v 5 + 6v 6 + 4v 7 D AB + D AD + D BC +D BE + D CD + D DE = 2v 1 + 3v 2 + 3v 4 + 2v 5 + 4v 6 + 6v 7 Week 5: Distance methods, DNA and protein models p.7/69

A vector of all distances, stacked d = D AB D AC D AD D AE D BC D BD D BE D CD D CE D DE They are in lexicographic (dictionary) order. Week 5: Distance methods, DNA and protein models p.8/69

The design matrix and least squares equations X = So solution of equations is 1 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 1 1 X T D = ( X T X ) v. v = (X T X) 1 X T D Week 5: Distance methods, DNA and protein models p.9/69

A diagonal matrix of weights W = w AB 0 0 0 0 0 0 0 0 0 0 w AC 0 0 0 0 0 0 0 0 0 0 w AD 0 0 0 0 0 0 0 0 0 0 w AE 0 0 0 0 0 0 0 0 0 0 w BC 0 0 0 0 0 0 0 0 0 0 w BD 0 0 0 0 0 0 0 0 0 0 w BE 0 0 0 0 0 0 0 0 0 0 w CD 0 0 0 0 0 0 0 0 0 0 w CE 0 0 0 0 0 0 0 0 0 0 w DE, Week 5: Distance methods, DNA and protein models p.10/69

Weighted least squares equations X T WD = ( X T WX ) v, v = ( X T WX ) 1 X T WD. These matrix equations solve for weighted least squares estimates of the branch lengths on a given tree. The tree topology is specified by the design matrix X and the weights are the elements of the diagonal matrix W. Week 5: Distance methods, DNA and protein models p.11/69

A statistical justification for least squares SSQ = n i=1 j i (D ij E (D ij )) 2. Var(D ij ) This least squares method... is what we would get by standard statistical least squares approachesif the distances were normally distributed, independently varying, and had expectation and variance as shown... but they actually aren t independent in almost all cases (such as molecular sequences), but...... it can be shown that the estimate of the tree will be a consistent estimate in the case of non-independence, just not as efficient Week 5: Distance methods, DNA and protein models p.12/69

The Jukes-Cantor model A u/3 G u/3 u/3 u/3 u/3 C u/3 T The simplest and most symmetrical of models of DNA evolution. In a small interval of time dt the probability of change at a site is u dt, and it is equally likely to go to each of the other three bases. Week 5: Distance methods, DNA and protein models p.13/69

A simple derivation of the probabilities of net change Imagine a (fictional) kind of event that could change the base to one of the four possible bases (including the same base) with equal probability (instead of changing to one of the other three). Week 5: Distance methods, DNA and protein models p.14/69

A simple derivation of the probabilities of net change Imagine a (fictional) kind of event that could change the base to one of the four possible bases (including the same base) with equal probability (instead of changing to one of the other three). If you set the probability of change in that model to 4 3u dt, it would be indistinguishable from the actual Jukes-Cantor model. Week 5: Distance methods, DNA and protein models p.14/69

A simple derivation of the probabilities of net change Imagine a (fictional) kind of event that could change the base to one of the four possible bases (including the same base) with equal probability (instead of changing to one of the other three). If you set the probability of change in that model to 4 3u dt, it would be indistinguishable from the actual Jukes-Cantor model. If any nonzero number of these fictional events occur on a branch, the probability of ending up with (say) base C is 1 4. Week 5: Distance methods, DNA and protein models p.14/69

A simple derivation of the probabilities of net change Imagine a (fictional) kind of event that could change the base to one of the four possible bases (including the same base) with equal probability (instead of changing to one of the other three). If you set the probability of change in that model to 4 3u dt, it would be indistinguishable from the actual Jukes-Cantor model. If any nonzero number of these fictional events occur on a branch, the probability of ending up with (say) base C is 1 4. The number of these events that occur in a branch has Poisson distribution with expected number 4 3u t, so the probability of no event is e 4 3 ut Week 5: Distance methods, DNA and protein models p.14/69

The Jukes-Cantor model Probability of no event: Probability of some event: e 4 3 ut 1 e 4 3 ut Probability of changing to C given start at A, have rate u, time t: Prob (C A, u, t) = 1 (1 e 4ut) 3 4 fraction of sites different: f D = 3 4 ) (1 e 4 3 ut. Solving, the distance as function of the fraction of sites different is D = ût = 3 4 ln (1 4 3 f D ) Week 5: Distance methods, DNA and protein models p.15/69

differences per site Fraction of sites different versus branch length 1 0.75 0.49 4 3 4 (1 e 3 r t ) 0 0 0.7945 branch length... as predicted by the Jukes-Cantor model. Week 5: Distance methods, DNA and protein models p.16/69

Branch length versus fraction of sites different 3 4 4 ( (diff)) ln 1 3 branch length 0.7945 0 0 0.49 0.75 differences per site 1 Week 5: Distance methods, DNA and protein models p.17/69

If you don t correct for multiple changes B 0.00 0.20 0.20 B 0.0206 0.155 0.155 A The true (unrooted) tree C A What we estimate C This happening because the long path between A and C is shortened more, proportionally, than the shorter paths between A and B and between C and B. The only way to do this is to put in a spurious branch leading to B. In effect, there is a war between the long paths and the short paths. With properly corrected distances, each path approaches its true length when we look at a very large number of sites. Week 5: Distance methods, DNA and protein models p.18/69

Least squares methods These are just differently weighted least squares methods, as mentioned previously. Fitch and Margoliash (1967) used a weight of 1/D 2 ij, so the quantity to be minimized is (D ij d ij ) 2 Q = ij D 2 ij Cavalli-Sforza and Edwards (1967) used the unweighted least squares method: Q = (D ij d ij ) 2 ij These amount to different assumptions about the how the size of a distance will affect its variance. Week 5: Distance methods, DNA and protein models p.19/69

The Minimum Evolution method Kidd and Sgaramella-Zonta (1971) and (independently) Rzhetsky and Nei (1992ff.) came up with this method: Search through tree space as usual For each tree estimate branch lengths by least squares, not allowing negative branch lengths Then actually evaluate the treenot by the sum of squares, but by the total sum of branch lengths. This does fairly well, in spite of the mixture of two optimization criteria. Note that it isnot directly related to parsimony, in spite of its name. Week 5: Distance methods, DNA and protein models p.20/69

The UPGMA algorithm (ij) The UPGMA algorithm 1. Assign each species weight w i 2. Pick the values of i and j that are for the smallest of the D ij 3. Make a new group (ij) = 1 5. Its weight is w + w i j 6. The distance between (ij) and another (say k) is computed as w D + w D i ik j jk D = (ij),k j i D ik D (ij),k 4. Assign time depth D ij / 2 to the connection between i and j. D jk k w + w i j 7. Delete i and j from the table, put in a row and column for (ij). 8. If there is only one group left, stop. Otherwise go to step 2. Note that the weighting in effect weights each of the original tips equally, so that a group s distance to an outside species or another group is the average of all the distances between species in one and species in the other. This is a natural unweighted choice. Week 5: Distance methods, DNA and protein models p.21/69

Sarich 1969, immunological distances dog bear raccoon weasel seal sea lion cat monkey dog 0 32 48 51 50 48 98 148 bear 32 0 26 34 29 33 84 136 raccoon 48 26 0 42 44 44 92 152 weasel 51 34 42 0 44 38 86 142 seal 50 29 44 44 0 24 89 142 sea lion 48 33 44 38 24 0 90 142 cat 98 84 92 86 89 90 0 148 monkey 148 136 152 142 142 142 148 0 Week 5: Distance methods, DNA and protein models p.22/69

Find smallest element, its rows, columns dog bear raccoon weasel seal sea lion cat monkey dog 0 32 48 51 50 48 98 148 bear 32 0 26 34 29 33 84 136 raccoon 48 26 0 42 44 44 92 152 weasel 51 34 42 0 44 38 86 142 seal 50 29 44 44 0 24 89 142 sea lion 48 33 44 38 24 0 90 142 cat 98 84 92 86 89 90 0 148 monkey 148 136 152 142 142 142 148 0 Week 5: Distance methods, DNA and protein models p.23/69

We do the following averaging dog bear raccoon weasel seal sea lion cat monkey dog 0 32 48 51 (50 + 48)/2 98 148 bear 32 0 26 34 (29 + 33)/2 84 136 raccoon 48 26 0 42 (44 + 44)/2 92 152 weasel 51 34 42 0 (44 + 38)/2 86 142 seal 49 31 44 41 0 24 89.5 142 sea lion 24 0 cat 98 84 92 86 (89 + 90)/2 0 148 monkey 148 136 152 142 (142 + 142)/2 148 0 Week 5: Distance methods, DNA and protein models p.24/69

Clustering seal and sea lion dog bear raccoon seal sea lion weasel cat monkey 12 12 Week 5: Distance methods, DNA and protein models p.25/69

After clustering seal and sea lion dog bear raccoon weasel SS cat monkey dog 0 32 48 51 49 98 148 bear 32 0 26 34 31 84 136 raccoon 48 26 0 42 44 92 152 weasel 51 34 42 0 41 86 142 SS 49 31 44 41 0 89.5 142 cat 98 84 92 86 89.5 0 148 monkey 148 136 152 142 142 148 0 Week 5: Distance methods, DNA and protein models p.26/69

Clustering bear and racoon dog bear raccoon seal sea lion weasel cat monkey 13 13 12 12 Week 5: Distance methods, DNA and protein models p.27/69

After clustering bear and raccoon dog BR weasel SS cat monkey dog 0 40 51 49 98 148 BR 40 0 38 37.5 88 144 weasel 51 38 0 41 86 142 SS 49 37.5 41 0 89.5 142 cat 98 88 86 89.5 0 148 monkey 148 144 142 142 148 0 Week 5: Distance methods, DNA and protein models p.28/69

Clustering bear-raccoon with seal-sealion dog bear raccoon seal sea lion weasel cat monkey 13 13 12 12 5.75 6.75 Week 5: Distance methods, DNA and protein models p.29/69

After clustering those two clusters dog BRSS weasel cat monkey dog 0 44.5 51 98 148 BRSS 44.5 0 39.5 88.75 143 weasel 51 39.5 0 86 142 cat 98 88.75 86 0 148 monkey 148 143 142 148 0 Week 5: Distance methods, DNA and protein models p.30/69

Clustering weasel with BRSS dog bear raccoon seal sea lion weasel cat monkey 13 13 12 12 5.75 6.75 19.75 1 Week 5: Distance methods, DNA and protein models p.31/69

After adding weasel to that cluster dog BRSSW cat monkey dog 0 45.8 98 148 BRSSW 45.8 0 88.2 142.8 cat 98 88.2 0 148 monkey 148 142.8 148 0 Week 5: Distance methods, DNA and protein models p.32/69

Clustering the dog with BRSSW dog bear raccoon seal sea lion weasel cat monkey 22.9 13 13 12 12 5.75 6.75 19.75 1 3.15 Week 5: Distance methods, DNA and protein models p.33/69

After adding dog to it DBRWSS cat monkey DBRWSS 0 89.833 143.66 cat 89.833 0 148 monkey 143.66 148 0 Week 5: Distance methods, DNA and protein models p.34/69

Clustering all the carnivores and pinnipeds dog bear raccoon seal sea lion weasel cat monkey 22.9 13 13 12 12 5.75 6.75 19.75 1 3.15 22.0166 44.9166 Week 5: Distance methods, DNA and protein models p.35/69

Finally, just monkey remaining DBRWSSC monkey DBRWSSC 0 144.2857 monkey 144.2857 0 Week 5: Distance methods, DNA and protein models p.36/69

The UPGMA tree dog bear raccoon seal sea lion weasel cat monkey 22.9 13 13 12 12 5.75 6.75 19.75 1 3.15 22.0166 44.9166 27.22619 72.1428 Week 5: Distance methods, DNA and protein models p.37/69

UPGMA can mislead 13 A True tree B C 4 4 2 2 D 10 True distance matrix A B C D A B C D 0 17 21 27 17 21 27 0 12 18 12 18 0 14 14 0 UPGMA tree B C D A 6 6 8 10.833 2 2.833 Week 5: Distance methods, DNA and protein models p.38/69

Neighbor-Joining For each tip, compute u i = n j:j i D ij/(n 2). Note that the denominator is (deliberately) not the number of items summed. Choose the i and j for which D ij u i u j is smallest. Join items i and j. Compute the branch length from i to the new node (v i ) and from j to the new node (v j ) as v i = 1 2 D ij + 1 2 (u i u j ) v j = 1 2 D ij + 1 2 (u j u i ) (continued on next slide) Week 5: Distance methods, DNA and protein models p.39/69

continued... Compute the distance between the new node (ij) and each of the remaining tips as D (ij),k = (D ik + D jk D ij ) Delete tips i and j from the tables and replace them by the new node, (ij), which is now treated as a tip. If more than two nodes remain, go back to step 1. Otherwise, connect the two remaining nodes (say, l and m ) by a branch of length D lm. / 2 Week 5: Distance methods, DNA and protein models p.40/69

The NJ star decomposition i v i v j j (ij) k Week 5: Distance methods, DNA and protein models p.41/69

The NJ tree on the Sarich data 25.25 dog bear raccoon seal sea lion 12.35 11.65 weasel cat monkey 6.875 19.125 1.75 7.8125 3.4375 1.5625 20.4375 19.5625 47.0833 Same as UPGMA? No. Not clocklike. 100.9166 Week 5: Distance methods, DNA and protein models p.42/69

Unweighted least squares on the Sarich data monkey cat weasel dog raccoon bear sea lion seal Same as Neighbor-Joining tree? Close but not the same. Week 5: Distance methods, DNA and protein models p.43/69

Fitch-Margoliash tree on the Sarich (1969) data monkey cat weasel dog raccoon bear sea lion seal Same as the unweighted least squares tree? No. Week 5: Distance methods, DNA and protein models p.44/69

Minimum evolution tree on the Sarich (1969) data monkey cat weasel raccoon dog bear sea lion seal Same as either least squares tree? As NJ tree? No. Week 5: Distance methods, DNA and protein models p.45/69

Kimura 2-parameter model A α G β β β β C α T The simplest, most symmetrical model that has different rates for transitions than for transversions. Week 5: Distance methods, DNA and protein models p.46/69

Parameters of K2P in terms of Ts/Tn ratio If we let R be the expected ratio of transition changes to transversions, it turns out that α = R R+1 β = ( ) 1 1 2 R+1 Week 5: Distance methods, DNA and protein models p.47/69

Transition [sic] probabilities for K2P Computing the net probabilities of change ( transition in the stochastic process sense rather than as molecular biologists use it), it turns out that ) Prob (transition t) = 1 4 1 2 ( exp 2R+1 R+1 t Prob (transversion t) = 1 2 1 2 exp ( 2 R+1 t ). + 1 4 exp ( 2 R+1 t ) Week 5: Distance methods, DNA and protein models p.48/69

Transition and transversion when R = 10 Differences 0.60 0.50 0.40 0.30 Total differences Transitions 0.20 Transversions 0.10 0.00 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time (branch length) R = 10 Week 5: Distance methods, DNA and protein models p.49/69

Transition and transversion when R = 2 0.70 Total differences 0.60 Differences 0.50 0.40 0.30 0.20 Transversions Transitions 0.10 0.00 R = 2 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Time (branch length) Week 5: Distance methods, DNA and protein models p.50/69

ML estimates for the K2P model The sufficient statistics for comparison of DNA sequences under the K2P model are simply the observed fractions P and Q of transitions and transversions. Then the maximum likelihood estimates of the branch length t and of R are obtained by finding the values that predict exactly those fractions. That done by solving the equations given three slides ago. t = 1 4 ln [ (1 2Q)(1 2P Q) 2] R = ln(1 2P Q) ln(1 2Q) 1 2 Week 5: Distance methods, DNA and protein models p.51/69

Likelihood for two species under the K2P model L = Prob (data t, R) = ( ) 1 n 4 (1 P Q) n n 1 n 2 P n 1 ( 1 2 Q) n 2 where n 1 is the number of sites differing by transitions, and n 2 is the number of sites differing by transversions. P and Q are the expected fractions of transition and transversion differences, as given by the expressions four screens above. This is a function of the two parameters t and R. The values that maximize it were given above, if we estimate both of them. But if R is given rather than estimated, there is no closed-form equation solving for t, it has to be inferred numerically by finding the value of t that maximizes L. Week 5: Distance methods, DNA and protein models p.52/69

The Tamura/Nei model, F84, and HKY To : A G C T From : A α R π G /π R + βπ G βπ C βπ T G α R π A /π R + βπ A βπ C βπ T C βπ A βπ G α Y π T /π Y + βπ T T βπ A βπ G α Y π C /π Y + βπ C For the F84 model, α R = α Y For the HKY model, α R /α Y = π R /π Y Week 5: Distance methods, DNA and protein models p.53/69

Transition/transversion ratio for the Tamura-Nei model T s = 2 α R π A π G / π R + 2 α Y π C π T / π Y + β ( π A π G + 2π C π T ) T v = 2 β π R π Y To get T s /T v = R and T s + T v = 1, β = 1 2π R π Y (1 + R) α Y = π R π Y R π A π G π C π T (1 + R) (π Y π A π G ρ + π R π C π T ) α R = ρ α Y Week 5: Distance methods, DNA and protein models p.54/69

Using fictional events to mimic the Tamura-Nei model We imagine two types of events: Type I: If the existing base is a purine, draw a replacement from a purine pool with bases in relative proportions π A : π G. This event has rate α R. If the existing base is a pyrimidine, draw a replacement from a pyrimidine pool with bases in relative proportions π C : π T. This event has rate α Y. Type II: No matter what the existing base is, replace it by a base drawn from a pool at the overall equilibrium frequencies: π A : π C : π G : π T. This event has rate β. Week 5: Distance methods, DNA and protein models p.55/69

Transition [sic] probabilities with the Tamura-Nei model If the branch starts with a purine: No events exp( (α R + β)t) Some type I, no type II exp( β t) (1 exp( α R t)) Some type II 1 exp( β t) If the branch starts with a pyrimidine: No events exp( (α Y + β) t) Some type I, no type II exp( β t) (1 exp( α Y t)) Some type II 1 exp( β t) Week 5: Distance methods, DNA and protein models p.56/69

A transition probability So if we want to compute the probability of getting a G given that a branch starts with an A, we add up The probability of no events, times 0 (as you can t get a G from an A with no events) The probability of some type I, no type II times π G /π Y (as the last type I event puts in a G with probability equal to the fraction of G s out of all purines). The probability of some type II times π G (as if there is any type II event, we thereafter have a probability of G equal to its overall expected frequency, and further type I events don t change that). Week 5: Distance methods, DNA and protein models p.57/69

A transition probability So that, for example Prob (G A, t) = exp( β t) (1 exp( α R t)) π G π R + (1 exp( β t)) π G Week 5: Distance methods, DNA and protein models p.58/69

A more compact expression More generally, we can use the Kronecker delta notation δ ij and the Watson-Kronecker notation ε ij to write Prob (j i, t) = exp( (α i + β)t) δ ij +exp( βt) (1 exp( α i t)) ( ) πj ε P ij k ε jkπ k +(1 exp( βt))π j where δ ij is 1 if the two bases i and j are different (0 otherwise), and ε ij is 1 if, of the two bases i and j, one is a purine and one is a pyrimidine (0 otherwise). Week 5: Distance methods, DNA and protein models p.59/69

Reversibility and the GTR model The condition ofreversibility in a stochastic process is written in terms of the equilibrium frequences of the states ( π i and the transition probabilities as π i Prob (j i, t) = π j Prob (i j, t) The general time-reversible model: To : A G C T From : A π G α π C β π T γ G π A α π C δ π T ε C π A β π G δ π T η T π A γ π G ε π C η Week 5: Distance methods, DNA and protein models p.60/69

The GTR model βπ A A βπ C γπ γπ A T απ απ G δπ A G T G δπ επ C επ G C ηπ T ηπ C T It is the most general model possible that still has the changes satisfy the condition of reversibility so that one cannot tell which direction evolution has gone by examining the sequences before and after. Week 5: Distance methods, DNA and protein models p.61/69

Standardizing the rates 2π A π G α + 2 π A π C β + 2 π A π T γ + 2 π G π C δ + 2 π G π T ε + 2 π C π T η = 1 This is done so that the expected rate of change is 1 per unit time. Week 5: Distance methods, DNA and protein models p.62/69

General Time Reversible models inference A data example (simulated under a K2P model, true distance 0.2 transition/transversion ratio = 2 A G C T total A 93 13 3 3 112 G 10 105 3 4 122 C 6 4 113 18 141 T 7 4 21 93 125 total 116 126 140 118 500 The K2P model is a special case of the GTR model (as are all the other models that have been mentioned here). So if all goes well we should infer parameters that come close to specifying a K2P model. Week 5: Distance methods, DNA and protein models p.63/69

Averaging across the diagonal... A G C T total A 93 11.5 4.5 5 114 G 11.5 105 3.5 4 124 C 4.5 3.5 113 19.5 140.5 T 5 4 19.5 93 121.5 total 114 124 140.5 121.5 500 Week 5: Distance methods, DNA and protein models p.64/69

Dividing each column by its sum (column, because P ij is to be the probability of change from j to i ) 0.815789 0.0927419 0.0320285 0.0411523 ˆP = 0.100877 0.846774 0.024911 0.0329218 0.0394737 0.0282258 0.80427 0.160494 0.0438596 0.0322581 0.13879 0.765432 Week 5: Distance methods, DNA and protein models p.65/69

Rate matrix from the matrix logarithm If the rate matrix is A, so that P = e A t ) Ât = log (ˆP = 0.212413 0.110794 0.034160 0.046726 0.120512 0.174005 0.025043 0.035554 0.0421002 0.028375 0.236980 0.205579 0.0498001 0.034837 0.177778 0.287859. Week 5: Distance methods, DNA and protein models p.66/69

Standardizing the rates If we denote by ˆD the diagonal matrix of observed base frequencies, and we require that the rate of (potentially-observable) substitution is 1: trace(âˆd) = 1 We get: ˆt = trace(ât ˆD) = trace ( ) log(ˆp) ˆD and that also gives us an estimate of the rate matrix: ) / ( )  = log (ˆP trace log(ˆp) ˆD Week 5: Distance methods, DNA and protein models p.67/69

The rate estimates  = 0.931124 0.485671 0.149741 0.204826 0.528274 0.762764 0.109776 0.155852 0.184549 0.124383 1.038820 0.901168 0.218302 0.152710 0.779302 1.261850. moderately close to the actual K2P rate matrix used in the simulation which was: A = 1 2/3 1/6 1/6 2/3 1 1/6 1/6 1/6 1/6 1 2/3 1/6 1/6 2/3 1 (but if any of the eigenvalues of log(ˆp) are negative, this doesn t work and the divergence time is estimated to be infinite). Week 5: Distance methods, DNA and protein models p.68/69

The lattice of these models General 12 parameter model (12) General time reversible model (9) Tamura Nei (6) HKY (5) F84 (5) Kimura K2P (2) Jukes Cantor (1) There are a great many other models, but these are the most commonly used. They allow for unequal expected frequencies of bases, and for inequalities of transitions and transversions. Week 5: Distance methods, DNA and protein models p.69/69