Week 6: Protein sequence models, likelihood, hidden Markov models

Similar documents
Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models

Likelihoods and Phylogenies

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 4. Models of DNA and protein change. Likelihood methods

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

= 1 = 4 3. Odds ratio justification for maximum likelihood. Likelihoods, Bootstraps and Testing Trees. Prob (H 2 D) Prob (H 1 D) Prob (D H 2 )

Using an Artificial Regulatory Network to Investigate Neural Computation

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Lecture 4. Models of DNA and protein change. Likelihood methods

Edinburgh Research Explorer

Genetic code on the dyadic plane

A p-adic Model of DNA Sequence and Genetic Code 1

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

Lecture IV A. Shannon s theory of noisy channels and molecular codes

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

A Minimum Principle in Codon-Anticodon Interaction

Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II)

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

CHEMISTRY 9701/42 Paper 4 Structured Questions May/June hours Candidates answer on the Question Paper. Additional Materials: Data Booklet

Introduction to Molecular Phylogeny

Biology 155 Practice FINAL EXAM

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

The degeneracy of the genetic code and Hadamard matrices. Sergey V. Petoukhov

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Genetic Code, Attributive Mappings and Stochastic Matrices

Evolutionary Analysis of Viral Genomes

Proteins: Characteristics and Properties of Amino Acids

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other.

Using algebraic geometry for phylogenetic reconstruction

Get started on your Cornell notes right away

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

MODELING EVOLUTION AT THE PROTEIN LEVEL USING AN ADJUSTABLE AMINO ACID FITNESS MODEL

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Evolutionary Models. Evolutionary Models

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Advanced Topics in RNA and DNA. DNA Microarrays Aptamers

Crystal Basis Model of the Genetic Code: Structure and Consequences

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetic Tree Reconstruction

Lect. 19. Natural Selection I. 4 April 2017 EEB 2245, C. Simon

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. Sergey V. Petoukhov

Snork Synthesis Lab Lab Directions

M.O. Dayhoff, R.M. Schwartz, and B. C, Orcutt

Phylogenetics. BIOL 7711 Computational Bioscience

Probabilistic modeling and molecular phylogeny

Week 7: Bayesian inference, Testing trees, Bootstraps

Reducing Redundancy of Codons through Total Graph

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

CSE 549: Computational Biology. Substitution Matrices

Three-Dimensional Algebraic Models of the trna Code and 12 Graphs for Representing the Amino Acids

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

Lecture Notes: Markov chains

The translation machinery of the cell works with triples of types of RNA bases. Any triple of RNA bases is known as a codon. The set of codons is

ATTRIBUTIVE CONCEPTION OF GENETIC CODE, ITS BI-PERIODIC TABLES AND PROBLEM OF UNIFICATION BASES OF BIOLOGICAL LANGUAGES *

Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Analysis

Packing of Secondary Structures

C H E M I S T R Y N A T I O N A L Q U A L I F Y I N G E X A M I N A T I O N SOLUTIONS GUIDE

Sequence Divergence & The Molecular Clock. Sequence Divergence

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?

Amino Acid Side Chain Induced Selectivity in the Hydrolysis of Peptides Catalyzed by a Zr(IV)-Substituted Wells-Dawson Type Polyoxometalate

Properties of amino acids in proteins

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

A modular Fibonacci sequence in proteins

Protein Fragment Search Program ver Overview: Contents:

Week 5: Distance methods, DNA and protein models

T R K V CCU CG A AAA GUC T R K V CCU CGG AAA GUC. T Q K V CCU C AG AAA GUC (Amino-acid

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Physiochemical Properties of Residues

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Similarity or Identity? When are molecules similar?

Taming the Beast Workshop

Answers to Chapter 6 (in-text & asterisked problems)

Ramachandran Plot. 4ysz Phi (degrees) Plot statistics

Protein Struktur. Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep. wake up at slide 39

Translation. A ribosome, mrna, and trna.

Lecture 15: Realities of Genome Assembly Protein Sequencing

What makes a good graphene-binding peptide? Adsorption of amino acids and peptides at aqueous graphene interfaces: Electronic Supplementary

ومن أحياها Translation 1. Translation 1. DONE BY :Maen Faoury

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Supplementary Figure 3 a. Structural comparison between the two determined structures for the IL 23:MA12 complex. The overall RMSD between the two

Analysis of Codon Usage Bias of Delta 6 Fatty Acid Elongase Gene in Pyramimonas cordata isolate CS-140

Background BIOCHEMISTRY LAB CHE-554. Experiment #1 Spectrophotometry

Clustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University

Sequential resonance assignments in (small) proteins: homonuclear method 2º structure determination

Maximum Likelihood in Phylogenetics

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Transcription:

Week 6: Protein sequence models, likelihood, hidden Markov models Genome 570 February, 2016 Week 6: Protein sequence models, likelihood, hidden Markov models p.1/57

Variation of rates of evolution across sites The basic way all these models deal with rates of evolution is using the fact that a rate r times as fast is exactly equivalent to evolving along a branch that is r times as long, so that it is easy to calculate the transition probabilities with rate r: P ij (r, t) = P ij (r t) The likelihood for a pair of sequences averages over all rates at each site, using the density function f(r) for the distribution of rates: L(t) = sites i=1 ( 0 ) f(r) π ni P mi n i (r t) dr Week 6: Protein sequence models, likelihood, hidden Markov models p.2/57

The Gamma distribution f(r) = 1 Γ(α) β α xα 1 e x β E[x] = α β Var[x] = α β 2 To get a mean of 1, set β = 1/α so that f(r) = αα Γ(α) rα 1 e αr. so that the squared coefficient of variation is 1/α. Week 6: Protein sequence models, likelihood, hidden Markov models p.3/57

Gamma distributions Here are density functions of Gamma distributions for three different values of α: α = 1/4 CV = 2 α = 1 CV = 1 α = 4 CV = 1/2 0 0.5 1 1.5 2 2.5 3 rate Week 6: Protein sequence models, likelihood, hidden Markov models p.4/57

Gamma rate variation in the Jukes-Cantor model For example, for the Jukes-Cantor distance, to get the fraction of sites different we do D S = 0 f(r) 3 4 ( 1 e 4 3 r ut) dr leading to the formula for D as a function of D S D = 3 ( [1 4 α 1 4 ) ] 1/α 3 D S Week 6: Protein sequence models, likelihood, hidden Markov models p.5/57

Gamma rate variation in other models For many other distances such as the Tamura-Nei family, the transition probabilites are of the form P ij (t) = A ij + B ij e bt + C ij e ct and integrating termwise we can make use of the fact that E r [ e b r t ] = (1 + 1α b t ) α and just use that to replace e bt in the formulas for the transition probabilities. Week 6: Protein sequence models, likelihood, hidden Markov models p.6/57

Dayhoff s PAM001 matrix A R N D C Q E G H I L K M F P S T W ala arg asn asp cys gln glu gly his ile leu lys met phe pro ser thr trp A ala 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 R arg 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 N asn 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 D asp 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 C cys 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 Q gln 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 E glu 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 G gly 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 H his 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 I ile 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 L leu 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 K lys 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 M met 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 F phe 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 P pro 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 S ser 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 T thr 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 W trp 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 99 Y tyr 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 V val 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 Week 6: Protein sequence models, likelihood, hidden Markov models p.7/57

The codon model Goldman & Yang, MBE 1994; Muse and Weir, MBE 1994 U C A G U phe UUU C phe UUC A leu UUA ser UCA stop UAA stop UGA U G leu UUG U leu CUU C leu CUC A leu CUA C G leu CUG U ile AUU C ile AUC A ile AUA A G met AUG U val GUU C val GUC A val GUA G G val GUG Probabilities of change vary depending on whether amino acid is changing, and to what Week 6: Protein sequence models, likelihood, hidden Markov models p.8/57

A codon-based model of protein evolution observation: lys asn lys asn thr thr thr thr arg ser arg 0 1 0 1 0 0 0 0 0 0 0 AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG In each cell:... P (v) a ij ij... where and P (v) is the probability of codon change ij a ij is the probability that the change is accepted Week 6: Protein sequence models, likelihood, hidden Markov models p.9/57

Considerations for a protein model Making a model for protein evolution (a not-very-practical approach) Use a good model of DNA evolution. Use the appropriate genetic code. When an amino acid changes, accept it with probability that declines as the amino acids become more different. Fit this to empirical information on protein evolution. Take into account variation of rate from site to site. Take into account correlation of rates in adjacent sites. How about protein structure? Secondary structure? 3D structure? (the first four steps are the codon model of Goldman and Yang, 1994 and Muse and Gaut, 1994, both inmolecularbiologyandevolution. The next two are the rate variation machinery of Yang, 1995, 1996 and Felsenstein and Churchill, 1996). Week 6: Protein sequence models, likelihood, hidden Markov models p.10/57

Likelihood ratios: the odds-ratio form of Bayes theorem Prob (H 1 D) Prob (H 2 D) = Prob (D H 1) Prob (D H 2 ) Prob (H 1 ) Prob (H 2 ) } {{ } } {{ } }{{} posterior likelihood prior odds ratio ratio odds ratio Given prior odds, we can use the data to compute the posterior odds. Week 6: Protein sequence models, likelihood, hidden Markov models p.11/57

With many sites the likelihood wins out Independence of the evolution in the different sites implies that: Prob (D H i ) = Prob (D (1) H i ) Prob (D (2) H i )... Prob (D (n) H i ) so we can rewrite the odds-ratio formula as Prob (H 1 D) Prob (H 2 D) = ( n i=1 ) Prob (D (i) H 1 ) Prob (D (i) H 2 ) Prob (H 1 ) Prob (H 2 ) This implies that as n gets large the likelihood-ratio part will dominate. Week 6: Protein sequence models, likelihood, hidden Markov models p.12/57

An example coin tossing If we toss a coin 11 times and gethhtththhttt, the likelihood is: L = Prob (D p) = p p (1 p) (1 p) p (1 p) p p (1 p) (1 p) (1 p) = p 5 (1 p) 6 Solving for the maximum likelihood estimate for p by finding the maximum: and equating it to zero and solving: gives p = 5/11 dl dp = 5p4 (1 p) 6 6p 5 (1 p) 5 dl dp = p4 (1 p) 5 (5(1 p) 6p) = 0 Week 6: Protein sequence models, likelihood, hidden Markov models p.13/57

Likelihood as function of p for the 11 coin tosses Likelihood 0.0 0.2 0.4 0.6 0.8 1.0 p 0.454 Week 6: Protein sequence models, likelihood, hidden Markov models p.14/57

Maximizing the likelihood Maximizing the likelihood is the same as maximizing the log likelihood, because its log increases as the number increases, so: and ln L = 5 lnp + 6 ln(1 p), d(ln L) dp = 5 p 6 (1 p) = 0, so that, again, p = 5/11 Week 6: Protein sequence models, likelihood, hidden Markov models p.15/57

An example with one site A C C G t 1 t 2 y C t 3 t 4 t 5 w t t 6 z 7 x t 8 We want to compute the probability of these data at the tips of the tree, given the tree and the branch lengths. Each site has an independent outcome, all on the same tree. Week 6: Protein sequence models, likelihood, hidden Markov models p.16/57

Likelihood sums over states of interior nodes The likelihood is the product over sites (as they are independent given the tree): L = Prob (D T) = m Prob ( ) D (i) T i=1 For site i, summing over all possible states at interior nodes of the tree: ( ) Prob D (i) T = Prob (A, C, C, C, G, x, y, z, w T) x y z w Week 6: Protein sequence models, likelihood, hidden Markov models p.17/57

... the products over events in the branches By independence of events in different branches (conditional only on their starting states): Prob (A, C, C, C, G, x, y, z, w T) = Prob (x) Prob (y x, t 6 ) Prob (A y, t 1 ) Prob (C y, t 2 ) Prob (z x, t 8 ) Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) Week 6: Protein sequence models, likelihood, hidden Markov models p.18/57

The result looks hard to compute Summing over all x, y, z, and w (each taking values A, G, C, T): Prob ( D (i) T ) = x y z w Prob (x) Prob (y x, t 6 ) Prob (A y, t 1 ) Prob (C y, t 2 ) Prob (z x, t 8 ) Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) This could be hard to do on a larger tree. For example, with 20 species there are 19 interior nodes and thus the number of outcomes for interior nodes is 4 19 = 274, 877, 906, 944. Week 6: Protein sequence models, likelihood, hidden Markov models p.19/57

... but there s a trick... We can move summations in as far as possible: Prob ( D (i) T ) = x Prob (x) ( y ) Prob (y x, t 6 ) Prob (A y, t 1 ) Prob (C y, t 2 ) ( w ( Prob (z x, t 8 ) Prob (C z, t 3 ) z )) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) The pattern of parentheses parallels the structure of the tree: (A, C)(C, (C, G)) Week 6: Protein sequence models, likelihood, hidden Markov models p.20/57

Conditional likelihoods in this calculation Working from innermost parentheses outwards is the same as working down the tree. We can define a quantity L (i) j (s) the conditional likelihood at site i of everything at or above point j in the tree, given that point j have state s One such is the term: L 7 (w) = Prob (C w, t 4 ) Prob (G w, t 5 ) Another is the term including that: ( ) L 8 (z) = Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) w Week 6: Protein sequence models, likelihood, hidden Markov models p.21/57

The pruning algorithm This follows the recursion down the tree: L (i) k (s) = P x Prob (x s,t l ) L (i) l (x) «P y Prob (y s,t m )L (i) m (y)! At a tip the quantity is easily seen to be like this (if the tip is in state A): L (i) (A), L (i) (C), L (i) (G), L (i) (T) = (1,0,0, 0) At the bottom we have a weighted sum over all states, weighted by their prior probabilities: L (i) = X x π x L (i) 0 (x) We can do that because, if evolution has gone on for a long time before the root of the tree, the probabilities of bases there are just the equilibrium probabilities under the DNA model (or whatever model we assume). Week 6: Protein sequence models, likelihood, hidden Markov models p.22/57

Handling ambiguity and error If a base is unknown: use (1, 1, 1, 1). If known only to be a purine: (1, 0, 1, 0) Note donot do something like this: (0.5, 0, 0.5, 0). It isnot the probability ofbeing that base but the probability of the observationgiven that base. If there is sequencing error use something like this: (1 ε, ε/3, ε/3, ε/3) (assuming an error is equally likely to be each of the three other bases). Week 6: Protein sequence models, likelihood, hidden Markov models p.23/57

Rerooting the tree before after y 6 8 z 8 z t 6 x 0 t 8 y 6 x 0 t 6 t 8 Week 6: Protein sequence models, likelihood, hidden Markov models p.24/57

Unrootedness L (i) = x y z Prob (x) Prob (y x, t 6 ) Prob (z x, t 8 ) Reversibility of the Markov process guarantees that Prob (x) Prob (y x, t 6 ) = Prob (y) Prob (x y, t 6 ). Substituting that in: L (i) = y x y Prob (y) Prob (x y, t 6 ) Prob (z x, t 8 )... which means that (if the model of change is a reversible one) the likelihood does not depend on where the root is placed in the unrooted tree. Week 6: Protein sequence models, likelihood, hidden Markov models p.25/57

To compute likelihood when the root is in one branch prune in this subtree First we get (for the site) the conditional likelihoods for the left subtree... Week 6: Protein sequence models, likelihood, hidden Markov models p.26/57

To compute likelihood when the root is in one branch prune in this subtree... then we get the conditional likelihoods for the right subtree Week 6: Protein sequence models, likelihood, hidden Markov models p.27/57

To compute likelihood when the root is in one branch prune to bottom 0.236... and finally we get the conditional likelihoods for the root, and from that the likelhoods for the site. (Note that it does not matter where in that branch we place the root, as long as the sum of the branch lengths on either side of the root is 0.236). Week 6: Protein sequence models, likelihood, hidden Markov models p.28/57

To do it with a different branch length... prune to bottom 0.351... we can just do the last step, but now with the new branch length. The other conditional likelihoods were already calculated, have not changed, and so we can re-use them. This really speeds things up for calculating how the likelihood depends on the length of one branch just put the root there! Week 6: Protein sequence models, likelihood, hidden Markov models p.29/57

A data example: 7-species primates mtdna Chimp Human Orang Gorilla 0.153 0.304 0.075 0.172 0.121 0.049 0.336 0.106 Gibbon 0.486 ln L = 1405.6083 0.792 0.902 Mouse Bovine These are noncoding or synonymous sites from the D-loop region (and some adjacent coding regions) of mitochondrial DNA, selected by Masami Hasegawa from sequences done by S. Hayashi and coworkers in 1985. Week 6: Protein sequence models, likelihood, hidden Markov models p.30/57

Log L per site 4.82 4.80 4.78 4.76 4.74 A simulation showing consistency of the ML tree A C B D true tree A C B D 100 200 500 1000 2000 5000 10000 20000 50000 100000 sites As more and more sites are added the true tree topology is favored. Since the vertical scale here is log(l) per site, the difference of log-likelihoods becomes very great. Week 6: Protein sequence models, likelihood, hidden Markov models p.31/57

Rate variation among sites Evolution is independent once each site has had its rate specified Prob (D T, r 1, r 2,...,r p ) = p i=1 Prob (D (i) T, r i ). Week 6: Protein sequence models, likelihood, hidden Markov models p.32/57

Coping with uncertainty about rates Using a Gamma distribution independently assigning rates to sites: L = m [ i=1 0 ] f(r; α) L (i) (r) dr Unfortunately this is hard to compute on a tree with more than a few species. Yang (1994a) approximated this by a discrete histogram of rates: L (i) = 0 f(r; α) L (i) (r) dr k w k L (i) (r k ) j=1 Felsenstein (J. Mol. Evol., 2001) has suggested using Gauss-Laguerre quadrature to choose the rates r i and the weights w i. Week 6: Protein sequence models, likelihood, hidden Markov models p.33/57

Hidden Markov Models These are the most widely used models allowing rate variation to be correlated along the sequence. We assume: There are a finite number of rates, m. Rate i is r i. There are probabilities p i of a site having rate i. A process not visible to us ( hidden") assigns rates to sites. It is a Markov process working along the sequence. For example it might have transition probability Prob (j i) of changing to rate j in the next site, given that it is at rate i in this site. The probability of our seeing some data are to be obtained by summing over all possible combinations of rates, weighting appropriately by their probabilities of occurrence. Week 6: Protein sequence models, likelihood, hidden Markov models p.34/57

Likelihood with a[n] HMM Suppose that we have a way of calculating, for each possible rate at each possible site, the probability of the data at that site (i) given that rate (r j ). This is ) Prob (D (i) r j This can be done because the probabilities of change as a function of the rate r and time t are (in almost all models) just functions of their product rt, so a site that has twice the rate is just like a site that has branches twice as long. To get the overall probability of all data, sum over all possible paths through the array of sites rates, weighting each combination of rates by its probability: Week 6: Protein sequence models, likelihood, hidden Markov models p.35/57

Hidden Markov Model of rate variation Sites Phylogeny 1 2 3 4 5 6 7 8 C C C C A A G G A A C T A A G G A G A T A A A A G C C C C C G G G G G A A G G C... Hidden Markov Process Rates of evolution 10.0 2.0 0.3... Week 6: Protein sequence models, likelihood, hidden Markov models p.36/57

Hidden Markov Models If there are a number of hidden rate states, with state i having rate r i Prob (D T) = Prob (r i1, r i2,...r ip ) i 1 i 2 i p Prob (D T, r i1, r i2,...r im ) Evolution is independent once each site has had its rate specified Prob (D T, r 1, r 2,...,r p ) = p i=1 Prob (D (i) T, r i ). Week 6: Protein sequence models, likelihood, hidden Markov models p.37/57

Seems impossible... To compute the likelihood we sum over all ways rate states could be assigned to sites: L = Prob (D T) = m m m i 1 =1 i 2 =1 i p =1 Prob ( r i1, r i2,...,r ip ) Prob ( D (1) r i1 ) Prob ( D (2) r i2 )...Prob ( D (n) r ip ) Problem: The number of rate combinations is very large. With 100 sites and 3 rates at each, it is 3 100 5 10 47. This makes the summation impractical. Week 6: Protein sequence models, likelihood, hidden Markov models p.38/57

Factorization and the algorithm Fortunately, the terms can be reordered: L = Prob (D T) = m i 1 =1 m i 2 =1... m i p =1 Prob (i 1 ) Prob ( D (1) r i1 ) Prob (i 2 i 1 ) Prob ( D (2) r i2 ) Prob (i 3 i 2 ) Prob ( D (3) r i3 ). Prob (i p i p 1 ) Prob ( D (p) r ip ) Week 6: Protein sequence models, likelihood, hidden Markov models p.39/57

Using Horner s Rule and the summations can be moved each as far rightwards as it can go: L = m i 1 =1 m i 2 =1 m i 3 =1. Prob (i 1 ) Prob ( ) D (1) r i1 Prob (i 2 i 1 ) Prob ( ) D (2) r i2 Prob (i 3 i 2 ) Prob ( D (3) r i3 ) m i p =1 Prob (i p i p 1 ) Prob ( D (p) r ip ) Week 6: Protein sequence models, likelihood, hidden Markov models p.40/57

Recursive calculation of HMM likelihoods The summations can be evaluated innermost-outwards. The same summations appear in multiple terms. We can then evaluate them only once. A huge saving results. The result is this algorithm: Define P i (j) as the probability of everything at or to the right of site i, given that site i has the j-th rate. Now we can immediately see for the last site that for each possible rate category i p ) P p (i p ) = Prob (D (p) r ip (as at or to the right of" simply means at" for that site). Week 6: Protein sequence models, likelihood, hidden Markov models p.41/57

Recursive calculation More generally, for site l < p and its rates i l P l (i l ) = Prob (D (l) r il ) m i l =1 Prob (i l+1 i l ) P l+1 (i l+1 ) We can compute the P s recursively using this, starting with the last site and moving leftwards down the sequence. Finally we have the P 1 (i 1 ) for all m states. These are simply weighted by the equilibrium probabilities of the Markov chain of rate categories: L = Prob (D T) = m i 1 =1 Prob (i 1 ) P 1 (i 1 ) An entirely similar calculation can be done from left to right, remembering that the transition probabilities Prob (i k i k+1 ) would be different in that case. Week 6: Protein sequence models, likelihood, hidden Markov models p.42/57

All paths through the array array of conditional probabilities of everything at or to the right of that site, given the state at that site Week 6: Protein sequence models, likelihood, hidden Markov models p.43/57

Starting and finishing the calculation At the end, at site m: Prob (D [m] T, r im ) = Prob (D (m) T, r im ) and once we get to site 1, we need only use the prior probabilities of the rates r i to get a weighted sum: Prob (D T) = i 1 π i1 Prob (D [1] T, r i1 ) Week 6: Protein sequence models, likelihood, hidden Markov models p.44/57

The pruning algorithm is just like species 1 species 2 different bases ancestor ancestor Week 6: Protein sequence models, likelihood, hidden Markov models p.45/57

the forward algorithm site 22 different rates site 21 site 20 Week 6: Protein sequence models, likelihood, hidden Markov models p.46/57

Paths from one state in one site Week 6: Protein sequence models, likelihood, hidden Markov models p.47/57

Paths from another state in that site Week 6: Protein sequence models, likelihood, hidden Markov models p.48/57

Paths from a third state Week 6: Protein sequence models, likelihood, hidden Markov models p.49/57

We can also use the backward algorithm you can also do it the other way Week 6: Protein sequence models, likelihood, hidden Markov models p.50/57

Using both we can find likelihood contributed by a state the "forward backward" algorithm allows us to get the probability of everything given one site s state Week 6: Protein sequence models, likelihood, hidden Markov models p.51/57

A particular Markov process on rates There are many possible Markov processes that could be used in the HMM rates problem. I have used: Prob (r i r j ) = (1 λ) δ ij + λ π i Week 6: Protein sequence models, likelihood, hidden Markov models p.52/57

A numerical example. Cytochrome B We analyze 31 cytochrome B sequences, aligned by Naoko Takezaki, using the Proml protein maximum likelihood program. Assume a Hidden Markov Model with 3 states, rates: and expected block length 3. category rate probability 1 0.0 0.2 2 1.0 0.4 3 3.0 0.4 We get a reasonable, but not perfect, tree with the best rate combination inferred to be Week 6: Protein sequence models, likelihood, hidden Markov models p.53/57

Phylogeny for Takezaki cytochrome B whalebm whalebp borang sorang hseal gseal gibbon gorilla2 bovine gorilla1 cchimp cat rhinocer pchimp platypus dhorse horse african caucasian wallaroo rat opossum mouse chicken seaurchin2 xenopus loach carp seaurchin1 trout lamprey Week 6: Protein sequence models, likelihood, hidden Markov models p.54/57

Rates inferred from Cytochrome B 1333333311 3222322313 3321113222 2133111111 1331133123 1122111 african M-----TPMR KINPLMKLIN HSFIDLPTPS NISAWWNFGS LLGACLILQI TTGLFLA caucasian......r........t...... cchimp...t................. pchimp...t..........t......... gorilla1... T...A........T......... gorilla2... T...A........T......... borang... T....L.........I.TI... sorang...st.. T....L.........I...... gibbon...l.. T....L...A.....M......I... bovine...ni.. SH...IV.N A...A.....S.....I...L... whalebm...ni.. TH...I..D A.....S.....L...V..L... whalebp...ni.. TH...IV.D A.V.....S.....L...M..L... dhorse...ni.. SH..I.I......A.....S.....I...L... horse...ni.. SH..I.I........S.....I...L... rhinocer...ni.. SH..V.I........S.....I...L... cat...ni.. SH..I.I......A........V..T...L... gseal...ni.. TH...I..N........I...L... hseal...ni.. TH...I..N........I...L... mouse...n... TH..F.I......A.....S.....V..MV..I... rat...ni.. SH..F.I......A.....S.....V..MV..L... platypus...nnl.. TH..I.IV.......S.....L...I..L... wallaroo...nl.. SH..I.IV.....A.........I..L... opossum...ni.. TH...I..D........V...I..L... chicken...apni.. SH..L.M..N.L...A.......AV..MT..L...L... xenopus...apni.. SH..I.I..N.....SL.....V...A..I... carp...a-sl.. TH..I.IA.D ALV........L...T..L... loach...a-sl.. TH..I.IA.D ALV...A.....V.....L...T..L... trout...a-nl.. TH..L.IA.D ALV...A.....V.....L..AT..L... lamprey.shqpsii.. TH..LS.G.S MLV...S.A.....SL...I...I... seaurchin1 -...LG.L.. EH.IFRIL.S T.V...L... L.I.....L...T..L... seaurchin2 -...AG.L.. EH.IFRIL.S T.V...L... Week 6: Protein L.M... sequence models, likelihood,..l...i.li hidden Markov models..i... p.55/57

Rates inferred from Cytochrome B 2223311112 2222222222 2222232112 2222222223 1222221112 3333111 african PDASTAFSSI AHITRDVNYG WIIRYLHANG ASMFFICLFL HIGRGLYYGS FLYSETW caucasian.................. cchimp..................l... pchimp............l....v......l... gorilla1.......t...........hq... gorilla2.......t...........hq... borang...t.......m..h......l.......thl... sorang.......m..h..........thl... gibbon...v...............l... bovine S.TT...V T..C......M......YM.V... YTFL... whalebm..tm...v T..C....V......YA.M... HAFR... whalebp..tt...v T..C.........YA.M... YAFR... dhorse S.TT...V T..C.........I.V... YTFL... horse S.TT...V T..C.........I.V... YTFL... rhinocer..tt...v T..C....M......I.V... YTFL... cat S.TM...V T..C.........YM.V...M... YTF... gseal S.TT...V T..C.........YM.V... YTFT... hseal S.TT...V T..C.........YM.V... YTFT... mouse S.TM...V T..C....L...M.......V... YTFM... rat S.TM...V T..C....L...Q.......V... YTFL... platypus S.T...V...C....L...M.....L..M.I..... YTQT... wallaroo S.TL...V...C....L..N......M....V...I... Y..K... opossum S.TL...V...C....L..NI......M....V...I... Y..K... chicken A.T.L...V..TC.N.Q...L..N.....F...I..... Y..K... xenopus A.T.M...V...CF... LL..N... L.F...IY.......K... carp S.I...V T..C....L..NV.....F...IYM..A... Y..K... loach S.I...V...C....L..NI.....F...Y...A... Y..K... trout S.I...V C..C...S...L..NI.....F...IYM..A... Y..K... lamprey ANTEL...V M..C...N..LM.N......IYA...I... Y..K... seaurchin1 A.I.L...A S..C....LL.NV.....L...MYC...G SNKI... seaurchin2 A.INL...V S..C....LL.NV...C Week 6: Protein..L...MYC sequence models, likelihood,...l hidden Markov models TNKI... p.56/57

PhyloHMMs: used in the UCSC Genome Browser The conservation scores calculated in the Genome Browser use PhyloHMMs, which is just these HMM methods. Week 6: Protein sequence models, likelihood, hidden Markov models p.57/57