Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models

Similar documents
Week 6: Protein sequence models, likelihood, hidden Markov models

Likelihoods and Phylogenies

= 1 = 4 3. Odds ratio justification for maximum likelihood. Likelihoods, Bootstraps and Testing Trees. Prob (H 2 D) Prob (H 1 D) Prob (D H 2 )

Lecture 4. Models of DNA and protein change. Likelihood methods

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Molecular Evolution and Phylogenetic Tree Reconstruction

Evolutionary Models. Evolutionary Models

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

STA 414/2104: Machine Learning

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Week 8: Testing trees, Bootstraps, jackknifes, gene frequencies

Week 7: Bayesian inference, Testing trees, Bootstraps

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Week 5: Distance methods, DNA and protein models

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

L23: hidden Markov models

Bayesian Machine Learning - Lecture 7

An Introduction to Bayesian Inference of Phylogeny

Hidden Markov Models

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Molecular Phylogeny

STA 4273H: Statistical Machine Learning

Today s Lecture: HMMs

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Inference of phylogenies, with some thoughts on statistics and geometry p.1/31

HIDDEN MARKOV MODELS

Molecular Evolution and Comparative Genomics

EVOLUTIONARY DISTANCES

Molecular Evolution & Phylogenetics

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

order is number of previous outputs

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

An Introduction to Bioinformatics Algorithms Hidden Markov Models

3 : Representation of Undirected GM

Statistical Sequence Recognition and Training: An Introduction to HMMs

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Computational Genomics

Lecture 9. Intro to Hidden Markov Models (finish up)

CS711008Z Algorithm Design and Analysis

Hidden Markov Models

Taming the Beast Workshop

Discrete & continuous characters: The threshold model

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

Multiple Sequence Alignment using Profile HMM

Lecture 15. Probabilistic Models on Graph

Phylogenetic inference

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Molecular evolution. Joe Felsenstein. GENOME 453, Autumn Molecular evolution p.1/49

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Bayesian Models for Phylogenetic Trees

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

BINF6201/8201. Molecular phylogenetic methods

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Alignment Algorithms. Alignment Algorithms

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Lecture 3: Markov chains.

Bioinformatics 2 - Lecture 4

Phylogenetic Tree Reconstruction

Molecular Evolution, course # Final Exam, May 3, 2006

Lecture 8: December 17, 2003

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Lecture Notes: Markov chains

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Stephen Scott.

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Using algebraic geometry for phylogenetic reconstruction

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Brief Introduction of Machine Learning Techniques for Content Analysis

Reconstruire le passé biologique modèles, méthodes, performances, limites

9 Forward-backward algorithm, sum-product on factor graphs

Hidden Markov models in population genetics and evolutionary biology

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Statistical Methods for NLP

Introduction to Machine Learning CMU-10701

Phylogenetics. BIOL 7711 Computational Bioscience

Approximate Inference

Hidden Markov Models (I)

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Hidden Markov Models

Phylogeny: traditional and Bayesian approaches

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models 1

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Hidden Markov Models. x 1 x 2 x 3 x N

ECE521 Lecture 19 HMM cont. Inference in HMM

Supplementary information

Transcription:

Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models Genome 570 February, 2012 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.1/63

Restriction sites and microsatellite loci Note owing to time pressure, we are going to skip the restriction sites and microsatellites material in the lectures. The slides are still included here. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.2/63

Restriction sites GGGGAATCCCAGTCAGGTTCCGTTAGTGAATCCGTTACCCGTAAT GGGCAATCCCAGTCAGGTTCCGTTAGTGAATCCGTTACCCGTAAT GGGCATTCCCAGTCAGGTTCCGTTAGTGAA CCCGTTACCCGTAAT Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.3/63

Number of sequences in a restriction site location N k = ( ) r 3 k. k k number 0 1 1 18 2 135 3 540 4 1215 5 1458 6 729 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.4/63

Transition probabilities for restriction sites, aggregated P kl = P kl = b m=a ( r k ) l k+m min[k,r l] m=max[0,k l] p l k+m (1 p) r k (l k+m) ( k m) ( p 3 ) m ( ) 1 p k m 3 ( r k ) l k+m p l k+m (1 p) r l+m ( k m)( p 3 ) m ( ) 1 p k m 3 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.5/63

Presence/absence of restriction sites π 0 = ( 1 4 ) r P ++ = π 0 (1 p) r P ++ = P + = P + = ( ) ] r r 1 [1 + 3e 4 3 t 4 4 ( ) ( r 1 1 4 [ 1 + 3e 4 3 t 4 ] r ) P = 1 ( ) ( r 1 2 4 [ 1 + 3e 4 3 t 4 ] r ) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.6/63

Conditioning on the site not being entirely absent Prob (+ + not ) = P ++ / (P ++ + P + + P + ) Prob ( + not ) = P + / (P ++ + P + + P + ) Prob (+ not ) = P + / (P ++ + P + + P + ) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.7/63

Likelihoods for two sequences with restriction sites L = ( P ++ ) n++ ( ) n+ +n P + + P ++ + P + + P + P ++ + P + + P + is of the form L = x n ++ ( ) n+ +n 1 + (1 x) 2 where x = P ++ P ++ + P + + P + Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.8/63

Maximizing the likelihood The maximum likelihood is when x = n ++ n ++ + n + + n + using the equations for the P s and for x ( 1 + 3e 4 3 t 4 ) r = which can be solved for t to give the distance D = ˆt = 3 4 ln 4 3 [ n ++ n ++ + 1 2 (n + + n + ) n ++ n ++ + 1 2 (n + + n + ) ] 1/r 1 3 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.9/63

Probability ancestor is + Are shared restriction sites parallelisms? 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 restriction site (6 bases) 0.0 0.5 1.0 1.5 2.0 t + t + + two state 0 1 model t Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.10/63

A one-step microsatellite model Computing the probability that there are a net of i steps by having i + k steps up and k steps down, Prob (there are i + 2k steps) = e µt (µt) i+2k /(i + 2k)! Prob (i + k of these are steps upwards) = ( ) ( i+2k 1 i+k ( 1 ) k i+k 2) 2 = ( ) ( i+2k 1 ) i+2k i+k 2... we put these together to and sum over k to get the result... Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.11/63

Transition probabilities in the microsatellite model Transition probability of i steps net change in branch of length µt is then p i (µt) = k=0 = e µt e µt (µt) ( i+2k i+2k ) ( 1 ) i+2k (i+2k)! i+k 2 k=0 ( µt ) i+2k 1 2 (i+k)!k! Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.12/63

Why the (δµ) 2 distance works approximately 2N generations e t generations approximately 2N generations e Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.13/63

(δµ) 2 distance for microsatellites (δµ) 2 = (µ A µ B ) 2 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.14/63

Transition probabilities in a one-step model µ t = 1 µ t = 5 10 5 0 5 10 change in number of repeats Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.15/63

A Brownian motion approximation Prob (m + k m) min [ 1, ( 1 exp 1 µt 2π 2 k 2 )] µt Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.16/63

Is it accurate? well... Probability 1 0.8 0 0.6 0.4 0.2 1 2 0.01.02.05 0.1 0.2 0.5 1 2 5 Branch Length 3 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.17/63

Likelihood ratios: the odds-ratio form of Bayes theorem Prob (H 1 D) Prob (H 2 D) = Prob (D H 1) Prob (D H 2 ) Prob (H 1 ) Prob (H 2 ) } {{ } } {{ } }{{} posterior likelihood prior odds ratio ratio odds ratio Given prior odds, we can use the data to compute the posterior odds. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.18/63

With many sites the likelihood wins out Independence of the evolution in the different sites implies that: Prob (D H i ) = Prob (D (1) H i ) Prob (D (2) H i )... Prob (D (n) H i ) so we can rewrite the odds-ratio formula as Prob (H 1 D) Prob (H 2 D) = ( n i=1 ) Prob (D (i) H 1 ) Prob (D (i) H 2 ) Prob (H 1 ) Prob (H 2 ) This implies that as n gets large the likelihood-ratio part will dominate. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.19/63

An example coin tossing If we toss a coin 11 times and gethhtththhttt, the likelihood is: L = Prob (D p) = p p (1 p) (1 p) p (1 p) p p (1 p) (1 p) (1 p) = p 5 (1 p) 6 Solving for the maximum likelihood estimate for p by finding the maximum: and equating it to zero and solving: gives p = 5/11 dl dp = 5p4 (1 p) 6 6p 5 (1 p) 5 dl dp = p4 (1 p) 5 (5(1 p) 6p) = 0 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.20/63

Likelihood as function of p for the 11 coin tosses Likelihood 0.0 0.2 0.4 0.6 0.8 1.0 p 0.454 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.21/63

Maximizing the likelihood Maximizing the likelihood is the same as maximizing the log likelihood, because its log increases as the number increases, so: and ln L = 5 lnp + 6 ln(1 p), d(ln L) dp = 5 p 6 (1 p) = 0, so that, again, p = 5/11 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.22/63

An example with one site A C C G t 1 t 2 y C t 3 t 4 t 5 w t t 6 z 7 x t 8 We want to compute the probability of these data at the tips of the tree, given the tree and the branch lengths. Each site has an independent outcome, all on the same tree. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.23/63

Likelihood sums over states of interior nodes The likelihood is the product over sites (as they are independent given the tree): L = Prob (D T) = m Prob ( ) D (i) T i=1 For site i, summing over all possible states at interior nodes of the tree: ( ) Prob D (i) T = Prob (A, C, C, C, G, x, y, z, w T) x y z w Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.24/63

... the products over events in the branches By independence of events in different branches (conditional only on their starting states): Prob (A, C, C, C, G, x, y, z, w T) = Prob (x) Prob (y x, t 6 ) Prob (A y, t 1 ) Prob (C y, t 2 ) Prob (z x, t 8 ) Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.25/63

The result looks hard to compute Summing over all x, y, z, and w (each taking values A, G, C, T): Prob ( D (i) T ) = x y z w Prob (x) Prob (y x, t 6 ) Prob (A y, t 1 ) Prob (C y, t 2 ) Prob (z x, t 8 ) Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) This could be hard to do on a larger tree. For example, with 20 species there are 19 interior nodes and thus the number of outcomes for interior nodes is 4 19 = 274, 877, 906, 944. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.26/63

... but there s a trick... We can move summations in as far as possible: Prob ( D (i) T ) = x Prob (x) ( y ) Prob (y x, t 6 ) Prob (A y, t 1 ) Prob (C y, t 2 ) ( w ( Prob (z x, t 8 ) Prob (C z, t 3 ) z )) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) The pattern of parentheses parallels the structure of the tree: (A, C)(C, (C, G)) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.27/63

Conditional likelihoods in this calculation Working from innermost parentheses outwards is the same as working down the tree. We can define a quantity L (i) j (s) the conditional likelihood at site i of everything at or above point j in the tree, given that point j have state s One such is the term: L 7 (w) = Prob (C w, t 4 ) Prob (G w, t 5 ) Another is the term including that: ( ) L 8 (z) = Prob (C z, t 3 ) Prob (w z, t 7 ) Prob (C w, t 4 ) Prob (G w, t 5 ) w Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.28/63

The pruning algorithm This follows the recursion down the tree: L (i) k (s) = P x Prob (x s,t l ) L (i) l (x) «P y Prob (y s,t m )L (i) m (y)! At a tip the quantity is easily seen to be like this (if the tip is in state A): L (i) (A), L (i) (C), L (i) (G), L (i) (T) = (1,0,0, 0) At the bottom we have a weighted sum over all states, weighted by their prior probabilities: L (i) = X x π x L (i) 0 (x) We can do that because, if evolution has gone on for a long time before the root of the tree, the probabilities of bases there are just the equilibrium probabilities under the DNA model (or whatever model we assume). Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.29/63

Handling ambiguity and error If a base is unknown: use (1, 1, 1, 1). If known only to be a purine: (1, 0, 1, 0) Note donot do something like this: (0.5, 0, 0.5, 0). It isnot the probability ofbeing that base but the probability of the observationgiven that base. If there is sequencing error use something like this: (1 ε, ε/3, ε/3, ε/3) (assuming an error is equally likely to be each of the three other bases). Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.30/63

Rerooting the tree before after y 6 8 z 8 z t 6 x 0 t 8 y 6 x 0 t 6 t 8 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.31/63

Unrootedness L (i) = x y z Prob (x) Prob (y x, t 6 ) Prob (z x, t 8 ) Reversibility of the Markov process guarantees that Prob (x) Prob (y x, t 6 ) = Prob (y) Prob (x y, t 6 ). Substituting that in: L (i) = y x y Prob (y) Prob (x y, t 6 ) Prob (z x, t 8 )... which means that (if the model of change is a reversible one) the likelihood does not depend on where the root is placed in the unrooted tree. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.32/63

To compute likelihood when the root is in one branch prune in this subtree First we get (for the site) the conditional likelihoods for the left subtree... Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.33/63

To compute likelihood when the root is in one branch prune in this subtree... then we get the conditional likelihoods for the right subtree Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.34/63

To compute likelihood when the root is in one branch prune to bottom 0.236... and finally we get the conditional likelihoods for the root, and from that the likelhoods for the site. (Note that it does not matter where in that branch we place the root, as long as the sum of the branch lengths on either side of the root is 0.236). Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.35/63

To do it with a different branch length... prune to bottom 0.351... we can just do the last step, but now with the new branch length. The other conditional likelihoods were already calculated, have not changed, and so we can re-use them. This really speeds things up for calculating how the likelihood depends on the length of one branch just put the root there! Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.36/63

A data example: 7-species primates mtdna Chimp Human Orang Gorilla 0.153 0.304 0.075 0.172 0.121 0.049 0.336 0.106 Gibbon 0.486 ln L = 1405.6083 0.792 0.902 Mouse Bovine These are noncoding or synonymous sites from the D-loop region (and some adjacent coding regions) of mitochondrial DNA, selected by Masami Hasegawa from sequences done by S. Hayashi and coworkers in 1985. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.37/63

Rate variation among sites Evolution is independent once each site has had its rate specified Prob (D T, r 1, r 2,...,r p ) = p i=1 Prob (D (i) T, r i ). Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.38/63

Coping with uncertainty about rates Using a Gamma distribution independently assigning rates to sites: L = m [ i=1 0 ] f(r; α) L (i) (r) dr Unfortunately this is hard to compute on a tree with more than a few species. Yang (1994a) approximated this by a discrete histogram of rates: L (i) = 0 f(r; α) L (i) (r) dr k w k L (i) (r k ) j=1 Felsenstein (J. Mol. Evol., 2001) has suggested using Gauss-Laguerre quadrature to choose the rates r i and the weights w i. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.39/63

Hidden Markov Models These are the most widely used models allowing rate variation to be correlated along the sequence. We assume: There are a finite number of rates, m. Rate i is r i. There are probabilities p i of a site having rate i. A process not visible to us ( hidden") assigns rates to sites. It is a Markov process working along the sequence. For example it might have transition probability Prob (j i) of changing to rate j in the next site, given that it is at rate i in this site. The probability of our seeing some data are to be obtained by summing over all possible combinations of rates, weighting appropriately by their probabilities of occurrence. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.40/63

Likelihood with a[n] HMM Suppose that we have a way of calculating, for each possible rate at each possible site, the probability of the data at that site (i) given that rate (r j ). This is ) Prob (D (i) r j This can be done because the probabilities of change as a function of the rate r and time t are (in almost all models) just functions of their product rt, so a site that has twice the rate is just like a site that has branches twice as long. To get the overall probability of all data, sum over all possible paths through the array of sites rates, weighting each combination of rates by its probability: Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.41/63

Hidden Markov Model of rate variation Sites Phylogeny 1 2 3 4 5 6 7 8 C C C C A A G G A A C T A A G G A G A T A A A A G C C C C C G G G G G A A G G C... Hidden Markov Process Rates of evolution 10.0 2.0 0.3... Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.42/63

Hidden Markov Models If there are a number of hidden rate states, with state i having rate r i Prob (D T) = Prob (r i1, r i2,...r ip ) i 1 i 2 i p Prob (D T, r i1, r i2,...r im ) Evolution is independent once each site has had its rate specified Prob (D T, r 1, r 2,...,r p ) = p i=1 Prob (D (i) T, r i ). Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.43/63

Seems impossible... To compute the likelihood we sum over all ways rate states could be assigned to sites: L = Prob (D T) = m m m i 1 =1 i 2 =1 i p =1 Prob ( r i1, r i2,...,r ip ) Prob ( D (1) r i1 ) Prob ( D (2) r i2 )...Prob ( D (n) r ip ) Problem: The number of rate combinations is very large. With 100 sites and 3 rates at each, it is 3 100 5 10 47. This makes the summation impractical. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.44/63

Factorization and the algorithm Fortunately, the terms can be reordered: L = Prob (D T) = m i 1 =1 m i 2 =1... m i p =1 Prob (i 1 ) Prob ( D (1) r i1 ) Prob (i 2 i 1 ) Prob ( D (2) r i2 ) Prob (i 3 i 2 ) Prob ( D (3) r i3 ). Prob (i p i p 1 ) Prob ( D (p) r ip ) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.45/63

Using Horner s Rule and the summations can be moved each as far rightwards as it can go: L = m i 1 =1 m i 2 =1 m i 3 =1. Prob (i 1 ) Prob ( ) D (1) r i1 Prob (i 2 i 1 ) Prob ( ) D (2) r i2 Prob (i 3 i 2 ) Prob ( D (3) r i3 ) m i p =1 Prob (i p i p 1 ) Prob ( D (p) r ip ) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.46/63

Recursive calculation of HMM likelihoods The summations can be evaluated innermost-outwards. The same summations appear in multiple terms. We can then evaluate them only once. A huge saving results. The result is this algorithm: Define P i (j) as the probability of everything at or to the right of site i, given that site i has the j-th rate. Now we can immediately see for the last site that for each possible rate category i p ) P p (i p ) = Prob (D (p) r ip (as at or to the right of" simply means at" for that site). Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.47/63

Recursive calculation More generally, for site l < p and its rates i l P l (i l ) = Prob (D (l) r il ) m i l =1 Prob (i l+1 i l ) P l+1 (i l+1 ) We can compute the P s recursively using this, starting with the last site and moving leftwards down the sequence. Finally we have the P 1 (i 1 ) for all m states. These are simply weighted by the equilibrium probabilities of the Markov chain of rate categories: L = Prob (D T) = m i 1 =1 Prob (i 1 ) P 1 (i 1 ) An entirely similar calculation can be done from left to right, remembering that the transition probabilities Prob (i k i k+1 ) would be different in that case. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.48/63

All paths through the array array of conditional probabilities of everything at or to the right of that site, given the state at that site Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.49/63

Starting and finishing the calculation At the end, at site m: Prob (D [m] T, r im ) = Prob (D (m) T, r im ) and once we get to site 1, we need only use the prior probabilities of the rates r i to get a weighted sum: Prob (D T) = i 1 π i1 Prob (D [1] T, r i1 ) Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.50/63

The pruning algorithm is just like species 1 species 2 different bases ancestor ancestor Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.51/63

the forward algorithm site 22 different rates site 21 site 20 Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.52/63

Paths from one state in one site Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.53/63

Paths from another state in that site Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.54/63

Paths from a third state Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.55/63

We can also use the backward algorithm you can also do it the other way Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.56/63

Using both we can find likelihood contributed by a state the "forward backward" algorithm allows us to get the probability of everything given one site s state Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.57/63

A particular Markov process on rates There are many possible Markov processes that could be used in the HMM rates problem. I have used: Prob (r i r j ) = (1 λ) δ ij + λ π i Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.58/63

A numerical example. Cytochrome B We analyze 31 cytochrome B sequences, aligned by Naoko Takezaki, using the Proml protein maximum likelihood program. Assume a Hidden Markov Model with 3 states, rates: and expected block length 3. category rate probability 1 0.0 0.2 2 1.0 0.4 3 3.0 0.4 We get a reasonable, but not perfect, tree with the best rate combination inferred to be Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.59/63

Phylogeny for Takezaki cytochrome B whalebm whalebp borang sorang hseal gseal gibbon gorilla2 bovine gorilla1 cchimp cat rhinocer pchimp platypus dhorse horse african caucasian wallaroo rat opossum mouse chicken seaurchin2 xenopus loach carp seaurchin1 trout lamprey Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.60/63

Rates inferred from Cytochrome B 1333333311 3222322313 3321113222 2133111111 1331133123 1122111 african M-----TPMR KINPLMKLIN HSFIDLPTPS NISAWWNFGS LLGACLILQI TTGLFLA caucasian......r........t...... cchimp...t................. pchimp...t..........t......... gorilla1... T...A........T......... gorilla2... T...A........T......... borang... T....L.........I.TI... sorang...st.. T....L.........I...... gibbon...l.. T....L...A.....M......I... bovine...ni.. SH...IV.N A...A.....S.....I...L... whalebm...ni.. TH...I..D A.....S.....L...V..L... whalebp...ni.. TH...IV.D A.V.....S.....L...M..L... dhorse...ni.. SH..I.I......A.....S.....I...L... horse...ni.. SH..I.I........S.....I...L... rhinocer...ni.. SH..V.I........S.....I...L... cat...ni.. SH..I.I......A........V..T...L... gseal...ni.. TH...I..N........I...L... hseal...ni.. TH...I..N........I...L... mouse...n... TH..F.I......A.....S.....V..MV..I... rat...ni.. SH..F.I......A.....S.....V..MV..L... platypus...nnl.. TH..I.IV.......S.....L...I..L... wallaroo...nl.. SH..I.IV.....A.........I..L... opossum...ni.. TH...I..D........V...I..L... chicken...apni.. SH..L.M..N.L...A.......AV..MT..L...L... xenopus...apni.. SH..I.I..N.....SL.....V...A..I... carp...a-sl.. TH..I.IA.D ALV........L...T..L... loach...a-sl.. TH..I.IA.D ALV...A.....V.....L...T..L... trout...a-nl.. TH..L.IA.D ALV...A.....V.....L..AT..L... lamprey.shqpsii.. TH..LS.G.S MLV...S.A.....SL...I...I... seaurchin1 -...LG.L.. EH.IFRIL.S T.V...L... L.I.....L...T..L... seaurchin2 -...AG.L.. EH.IFRIL.S T.V...L... Week 6: Restriction sites, L.M... RAPDs, microsatellites, likelihood,..l...i.li hidden Markov models..i... p.61/63

Rates inferred from Cytochrome B 2223311112 2222222222 2222232112 2222222223 1222221112 3333111 african PDASTAFSSI AHITRDVNYG WIIRYLHANG ASMFFICLFL HIGRGLYYGS FLYSETW caucasian.................. cchimp..................l... pchimp............l....v......l... gorilla1.......t...........hq... gorilla2.......t...........hq... borang...t.......m..h......l.......thl... sorang.......m..h..........thl... gibbon...v...............l... bovine S.TT...V T..C......M......YM.V... YTFL... whalebm..tm...v T..C....V......YA.M... HAFR... whalebp..tt...v T..C.........YA.M... YAFR... dhorse S.TT...V T..C.........I.V... YTFL... horse S.TT...V T..C.........I.V... YTFL... rhinocer..tt...v T..C....M......I.V... YTFL... cat S.TM...V T..C.........YM.V...M... YTF... gseal S.TT...V T..C.........YM.V... YTFT... hseal S.TT...V T..C.........YM.V... YTFT... mouse S.TM...V T..C....L...M.......V... YTFM... rat S.TM...V T..C....L...Q.......V... YTFL... platypus S.T...V...C....L...M.....L..M.I..... YTQT... wallaroo S.TL...V...C....L..N......M....V...I... Y..K... opossum S.TL...V...C....L..NI......M....V...I... Y..K... chicken A.T.L...V..TC.N.Q...L..N.....F...I..... Y..K... xenopus A.T.M...V...CF... LL..N... L.F...IY.......K... carp S.I...V T..C....L..NV.....F...IYM..A... Y..K... loach S.I...V...C....L..NI.....F...Y...A... Y..K... trout S.I...V C..C...S...L..NI.....F...IYM..A... Y..K... lamprey ANTEL...V M..C...N..LM.N......IYA...I... Y..K... seaurchin1 A.I.L...A S..C....LL.NV.....L...MYC...G SNKI... seaurchin2 A.INL...V S..C....LL.NV...C Week 6: Restriction sites,..l...myc RAPDs, microsatellites, likelihood,...l hidden Markov models TNKI... p.62/63

PhyloHMMs: used in the UCSC Genome Browser The conservation scores calculated in the Genome Browser use PhyloHMMs, which is just these HMM methods. Week 6: Restriction sites, RAPDs, microsatellites, likelihood, hidden Markov models p.63/63