Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60

1 Motivation 2 Examples of Hidden Markov models 3 Hidden Markov Models Formal definition Most likely path: Viterbi algorithm Summing over paths Sampling paths Learning HMM parameters and structure 4 Examples 6 Hidden Markov models and sequence alignment Pairwise alignment Multiple alignment: profile HMMs 5 Sequence alignment 7 Summary Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 2 / 60

Motivation Models of gene structure!"#"$%"&"'()#$*#%$*+&),*('$*##)&*()#$ locally, detecting small signal sequences and signatures global arrangement (order) has to be meaningful Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 3 / 60

Motivation Promoters andstructure enhancers d!un promoteur!"#$!!!$!"$$$#!$!$"!#$!$$!"$!#$!##!$!$"!#$!$$!"$!#$!#% structure of enhancers small and degenerate binding sites clustered, and separated by intervals of variable length Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 4 / 60

Motivation Position Specific Scoring Matrix experimentally validated binding sites!!!"!#!$"#!!!#!$###"!#!#$!"#!##!$#!"#!"%!!!"!#!$"#!!!#!$###"!#!#$!"#!##!$#!"#!"%!!!"!#!$"#!!!#!$###"%!"!"!"!!!#!$###"!#!#$!"#!##!$#!"#!"% "#!!!#!$###"!#!#$!"#!##!$#!"#!"% binding site A 0.5 0.4 0.1 0.5 0.2 C 0.2 0.2 0.7 0.2 0.2 G 0.1 0.2 0.1 0.2 0.1 T 0.2 0.2 0.1 0.1 0.5 background A 0.4 C 0.3 G 0.2 T 0.2 q = (q ij ) i=1..5, j=a,c,g,t r = (r j ) j=a,c,g,t Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 5 / 60

Isochores G. Bernardi / Gene 241 (2000) 3 17 7 Fig. 6. Ideogram of human chromosomes at a 850 band resolution (Francke, 1994) showing the H3+ bands as red bands (from Saccone et al., 1999). Black and grey bands are G(Giemsa) bands, white and red bands are R(Reverse) bands. isochores alternating GC rich and GC poor regions of several kb due to biased gene conversion dependent on local recombination rate

Examples of Hidden Markov models Isochores A 0.40 C 0.06 G 0.09 T 0.45 A 0.25 C 0.25 G 0.30 T 0.20 δ 1 δ P R 1 ɛ ɛ observed sequence (S) hidden sequence (H) ACTGCGA PPPRRRP p(s, H) = π P e P (A) (1 δ) e P (C) (1 δ) e P (T ) δ e R (G) (1 ɛ) e R (C) (1 ɛ) e R (G) ɛ e P (A) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 7 / 60

Examples of Hidden Markov models Isochores A 0.40 C 0.06 G 0.09 T 0.45 A 0.25 C 0.25 G 0.30 T 0.20 δ 1 δ P R 1 ɛ ɛ observed sequence (S) hidden sequence (H) ACTGCGA PPPRRRP mean length of P-tracks is 1 δ, and 1 ɛ geometric distributions for R-tracks. Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 7 / 60

Examples of Hidden Markov models Isochores with begin and end state A 0.40 A 0.25 C 0.06 C 0.25 G 0.09 G 0.30 T 0.45 T 0.20 Begin P R End transition probabilities B P R E B 0.00 0.49 0.49 0.02 P 0.00 0.98 0.01 0.01 R 0.00 0.01 0.98 0.01 E 0.00 0.00 0.00 0.00 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 8 / 60

Examples of Hidden Markov models Secondary structure H L E 3 hidden states (L,H and E) in each state, 20 emission probabilities (amino-acids) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 9 / 60

Examples of Hidden Markov models Secondary structure H L E transition probabilities between hidden states L H E L 0.85 0.07 0.08 H 0.05 0.94 0.01 E 0.06 0.01 0.93 q = (q kl ) k,l=l,h,e emission probabilities A C... Y L 0.05 0.07... 0.08 H 0.03 0.03... 0.12 E 0.10 0.11... 0.03 e = (e k (x)) k=l,h,e, x=a,c,...,y Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 9 / 60

Examples of Hidden Markov models Secondary structure H L E!!!!!!!""""""""""""!!!!!!!!!!!!"""""""""""""""!!!!!!!!!####!!"""""" p(s, H) = π L e L (G) q LH e H (L) q HH e H (L) q HH e H (E) q HH e H (L)... Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 9 / 60

Enhancers A 1 A 2 A 3 A 4 1 ɛ O E 1 δ B 1 B 2 B 3 ATGCAAACATCCCGACACTAGAACCATCAGCAGGACACTAGCA! OOOOEEEAAAAEEEBBBEEBBBEEEEAAAAEEEEOOOOOOOOO! states O and E emit bases according to background probabilities states A i and B i emit bases according to position specific scoring matrices of transcription factors 1 and 2.

Enhancers A 1 A 2 A 3 A 4 1 ɛ O E 1 δ B 1 B 2 B 3 ATGCAAACATCCCGACACTAGAACCATCAGCAGGACACTAGCA! OOOOEEEAAAAEEEBBBEEBBBEEEEAAAAEEEEOOOOOOOOO! δ < ɛ tracks of E are smaller than tracks of O a way of modelling clustering of transcriptions factors

Formal definition A Hidden Markov chain M characterized by an alphabet of hidden states s 1, s 2,..., s K an alphabet of observable symbols v 1, v 2,..., v P initial probabilities (over hidden states) π (vector of dim K ) transition probabilities (between hidden states) q kl (K K matrix) emission probabilities e k (v p ) (K vectors of dim P) A realization: 2 parallel series of random variables the hidden state path h = (h t ) t=1..t the observed sequence of emitted symbols: x = (x t ) t=1..t Probability of the joint (hidden and observed) sequence p(x, h M) = π(h 1 )e h1 (x 1 ) [ T 1 t=1 q ht h t+1 e ht+1 (x t+1 ) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 11 / 60 ]

Formal definition HMM with begin and end states by convention hidden state 0 will stand for begin and end state begin/end state does not emit symbols for k = 1... K : q 0k : probability of starting chain in state k (q 0k = π k ) q k0 : probability of ending chain after being in state k joint probability p(x, h M) = q 0h1 e h1 (x 1 ) [ T 1 t=1 q ht h t+1 e ht+1 (x t+1 ) ] q ht 0 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 12 / 60

Formal definition The main algorithmic problems of HMM paths given an observed sequence x find the most likely hidden path integrate probability over all paths sample paths according to their probability learning HMMs estimating parameters (transition and emission probabilities) choosing structure (how many hidden states, etc) two types of learning supervised: given annotated examples (x, h pairs) unsupervised: given non annotated examples (x only) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 13 / 60

Most likely path: Viterbi algorithm Most likely path: Viterbi algorithm h = (h t ) t=1..t : the sequence of hidden state (hidden path) x = (x t ) t=1..t : the observed sequence of emitted symbols p(x, h M): the joint probability question Given x = (x t ) t=1..t, find ĥ = (ĥt) t=1..t such that: ĥ = max h p(x, h M) maximum is over P T possible paths cannot be computed by brute force searching but exhaustive search is possible using dynamic programming Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 14 / 60

Most likely path: Viterbi algorithm Viterbi algorithm example H L E!!!!!!!""""""""""""!!!!!!!!!!!!"""""""""""""""!!!!!!!!!####!!"""""" Given HMM (with emission and transition probabilities specified) and a protein sequence, find most likely secondary structure annotation Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 15 / 60

Most likely path: Viterbi algorithm Viterbi recursion Definition v k (t): probability of the most probable path ending in state k at time t. v k (t) = max h 1,...,h t 1 p(h 1,..., h t 1, h t = k, x M) ptr k (t): hidden state at time t 1 of this most probable path recursion v l (t + 1) = max v k (t)q kl e l (x t+1 ) k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 16 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 ptr 2 (t) v 1 (t) 2 3 ptr 1 (t) ptr 3 (t) v 2 (t) v 3 (t)... Definition v k (t): probability of the most probable path ending in state k at time t. ptr k (t): hidden state at time t 1 of this most probable path Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 17 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 2 3 recursion v 1 (t) v 2 (t) v 3 (t) v 1 (t + 1)?... v 1 (t + 1)? = v 1 (t)q 11 e 1 (L) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 18 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 2 3 recursion v 1 (t) v 2 (t) v 3 (t) v 1 (t + 1)?... v 1 (t + 1)? = v 2 (t)q 21 e 1 (L) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 19 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 2 3 recursion v 1 (t) v 2 (t) v 3 (t) v 1 (t + 1)?... v 1 (t + 1)? = v 3 (t)q 31 e 1 (L) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 20 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 2 3 recursion v 1 (t) v 2 (t) v 3 (t) v 1 (t + 1)... v 1 (t + 1) = max k=1,2,3 v k(t)q k1 e 1 (x t+1 ) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 21 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 2 ptr 1 (t + 1) v 1 (t + 1)... 3 recursion ptr 1 (t + 1) = arg max k=1,2,3 v k(t)q k1 e 1 (x t+1 ) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 22 / 60

Most likely path: Viterbi algorithm Viterbi recursion M V K R L 1 v 1 (t) v 1 (t + 1) 2 v 2 (t) v 2 (t + 1)... 3 v 3 (t) v 3 (t + 1) recursion v l (t + 1) = max v k (t)q kl e l (x t+1 ) k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 23 / 60

Most likely path: Viterbi algorithm Viterbi algorithm initialization v 0 (0) = 1, v k (0) = 0 for k < 0. recursion t = 0... T 1 v l (t + 1) = e l (x t+1 ) max k v k (t)q kl ptr l (t + 1) = argmax k v k (t)q kl termination p(x, ĥ M) = max k v k (T ) ĥ T = argmax k v k (T ) traceback t = T 1... 1 ĥ t = ptr t+1 (ĥt+1) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 24 / 60

Most likely path: Viterbi algorithm Complexity of Viterbi algorithm compute a product of probabilities at each time step t = 1..T for each state k = 1..K at time t and for each state l = 1..K at time t + 1 Complexity TK 2 linear in T : very efficient, even for long genomic sequences quadratic in K : should keep model simple (low number of hidden states) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 25 / 60

Most likely path: Viterbi algorithm Integrating over all paths. Baum Welsch algorithm h = (h t ) t=1..t : the sequence of hidden state (hidden path) x = (x t ) t=1..t : the observed sequence of emitted symbols p(x, h M): the joint probability question Given x = (x t ) t=1..t, compute p(x M) = h p(x, h M) sum is over P N possible paths again, exhaustive sum is possible using dynamic programming (forward and backward algorithms) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 26 / 60

Summing over paths Integrating over all paths. Baum Welsch algorithm question Given x = (x t ) t=1..t, compute p(x M) = h p(x, h M) sum is over P N possible paths again, exhaustive sum is possible using dynamic programming Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 27 / 60

Summing over paths Forward algorithm M V K R L 1 f 1 (t) 2... 3 definition f k (t) = p(x 1,..., x t, h t = k) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 28 / 60

Summing over paths Forward algorithm M V K R L 1 2 3 recursion f 1 (t) f 2 (t) f 3 (t) f 1 (t + 1)... f l (t + 1) = e l (x t+1 ) k f k (t)q kl Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 29 / 60

Summing over paths Forward algorithm M V K R L 1 f 1 (t) f 1 (t + 1) 2 f 2 (t) f 2 (t + 1)... 3 f 3 (t) f 3 (t + 1) recursion f l (t + 1) = e l (x t+1 ) k f k (t)q kl Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 29 / 60

Summing over paths Forward algorithm initialization f 0 (0) = 1, f k (0) = 0 for k > 0. recursion t = 0... T 1 f l (t + 1) = e l (x t+1 ) k f k(t)q kl termination p(x M) = k f k(t ) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 30 / 60

Summing over paths Backward algorithm D F Y M A 1 b 1 (t) 2... 3 definition b k (t) = p(x t+1,..., x T h t = k) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 31 / 60

Summing over paths Backward algorithm D F Y M A 1 2... 3 recursion b 1 (t 1) b 1 (t) b 2 (t) b 3 (t) b k (t 1) = l q kl e l (x t )b l (t) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 32 / 60

Summing over paths Backward algorithm D F Y M A 1 2... 3 recursion b 1 (t 1) b 2 (t 1) b 3 (t 1) b 1 (t) b 2 (t) b 3 (t) b k (t 1) = l q kl e l (x t )b l (t) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 32 / 60

Summing over paths Backward algorithm initialization b k (L) = q k0. recursion t = T... 2 b k (t 1) = l e l(x t )b l (t)q kl termination p(x M) = l q 0le l (x 1 )b l (1) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 33 / 60

Summing over paths Marginal posterior probabilities question Given x = (x t ) t=1..t, compute the pointwise marginal probability: p(h t = k x, M) = 1 p(x M) h h t =k p(x, h M) h h t =k p(x, h) = p(x 1,..., x t, h t = k) p(x t+1,..., x T h t = k) = f t (k) b t (k) p(h t = k x) = f t(k) b t (k) p(x) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 34 / 60

Summing over paths Marginal posterior probabilities question Given x = (x t ) t=1..t, compute the pointwise marginal probability: Posterior decoding p(h t = k x, M) = 1 p(x M) h h t =k for each t = 1..T, find h t that maximizes p(h t x) different from Viterbi path does not even necessarily represent a valid path more stable than Viterbi p(x, h M) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 34 / 60

Sampling paths Sampling paths question Given x = (x t ) t=1..t, sample a path h = (h t ) t=1..t probabilistically from h p(h x, M) viterbi path is the most likely outcome (by definition) but suboptimal paths may matter in particular, if total probability is spread over many paths sampling and averaging gives an idea of uncertainty about path Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 35 / 60

Sampling paths Sampling paths question Given x = (x t ) t=1..t, sample a path h = (h t ) t=1..t probabilistically from h p(h x, M) could in principle be done by Gibbs sampling but a more efficient dynamic programming method exists Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 35 / 60

Sampling paths D F Y M A 1 2 3 b 1 (1) b 2 (1) b 3 (1) Bayes theorem p(h 1 = k) = q 0k p(x 1... x T h 1 = k) = b k (1) p 1k = p(h 1 = k x 1... x T ) q 0k b k (1)e k (x 1 ) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 36 / 60

Sampling paths D F Y M A 1 2 3 b 1 (1) b 2 (1) b 3 (1) Compute (p 1k ) k=1..k, and sample h 1 at random: h 1 p 1k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 37 / 60

Sampling paths D F Y M A 1 b 1 (1) 2 3 Compute (p 1k ) k=1..k, and sample h 1 at random: h 1 p 1k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 37 / 60

Sampling paths D F Y M A 1 2 3 b 1 (t + 1) b 2 (t + 1) b 3 (t + 1) Suppose we have sampled h 1... h t p(h t+1 = l h t = k) = q kl p(x t+1... x T h t+1 = l) = b l (t + 1)e l (x t+1 ) p t+1 l = p(h t+1 = l x t+1... x T, h 1... h t ) q kl b l (t + 1)e l (x t+1 ) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 38 / 60

Sampling paths D F Y M A 1 2 3 b 1 (t + 1) b 2 (t + 1) b 3 (t + 1) Compute (p t+1 k ) k=1..k, and sample h t+1 at random:... h t+1 p t+1 k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 39 / 60

Sampling paths D F Y M A 1 b 1 (t + 1) 2 3 Compute (p t+1 k ) k=1..k, and sample h t+1 at random:... h t+1 p t+1 k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 39 / 60

Sampling paths D F Y M A 1 2 3... until the entire path has been sampled Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 40 / 60

Learning HMM parameters and structure Learning HMM parameters question Given a training database of (x, h) annotated pairs learn the parameters of the HMM. just compute N kl total number of transitions from k to l in database N k M kp M k total number of transitions from k: N k = l N kl total number of emissions of symbol p when in state k total number of emissions in state k: M k = p M kp ML estimate q kl = N kl N k, e k (s p ) = M kp M k Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 41 / 60

Learning HMM parameters and structure Learning HMM parameters question Given a training database of (x, h) annotated pairs learn the parameters of the HMM. just compute N kl total number of transitions from k to l in database N k M kp M k total number of transitions from k: N k = l N kl total number of emissions of symbol p when in state k total number of emissions in state k: M k = p M kp Bayes estimate (add pseudocounts) q kl = N kl+n l N k + n, e k(s p ) = M kp + m p M k + m Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 41 / 60

Learning HMM parameters and structure Unsupervised training Viterbi training (approximate) start from rough parameter estimates use Viterbi algorithm to infer paths compute N kl and M kp on viterbi paths estimate parameters based on N kl and M kp iterate Baum Welsch training (exact: EM) start from rough parameter estimates use Baum Welsch algorithm to compute expectations of N kl and M kp over all paths estimate parameters based on those expectations N kl and M kp iterate Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 42 / 60

Examples Predicting secondary structure Martin et al 2006 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 43 / 60

Examples Modelling protein architecture Handcrafted HMMs. Stultz et al 1993. Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 44 / 60

Examples Automatic annotation of gene structure Burge and Karlin 1997 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 45 / 60

Examples Automatic annotation of gene structure Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 46 / 60

Hidden Markov models and sequence alignment Using HMMs for aligning sequences rn to article PREVIOUS NEXT View Larger Image Download original TIFF Download PowerPoint Friendly Image Figure 5. The Java GUI allows users to visualize the estimated alignment accuracy under FSA's statistical model. FSA's alignment is colored according the expected accuracy under FSA's statistical model (top) as well as according to the true accuracy (bottom) given from a comparison between FSA's alignment and the reference structural alignment. It is Nicolas Lartillot clear (Universite from inspection de Montréal) that accuracies estimated BIN6009 under FSA's statistical model may 2012 47 / 60

Hidden Markov models and sequence alignment Pairwise alignment Pair HMM for pairwise alignment 1 2δ sequence X sequence Y hidden sequence X ɛ δ 1 ɛ M δ 1 ɛ Y ɛ HEAG--AW P-AGGHA-... MXMMYYMX gap opening with probability δ gap extension with probability ɛ Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 48 / 60

Hidden Markov models and sequence alignment Pairwise alignment Pair-HMM. pairwise alignment X B M E Y sequence X sequence Y hidden sequence HEAG--AW P-AGGHA-... MXMMYYMX Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 49 / 60

Assessing homology between 2 sequences model for homologous sequences (M 1 ) X B M E Y HEAG--AW P-AGGHAmodel for non homologous sequences (M 0 ) B X Y E HEAGAW----- ------PAGGHA L = p(s 1, s 2 M 1 ) p(s 1, s 2 M 0 ) L > 1: homologous L < 1: not homologous (but needs calibration).

rn to article Hidden Markov models and sequence alignment Pairwise alignment Posterior decoding (FSA) PREVIOUS NEXT View Larger Image Download original TIFF Download PowerPoint Friendly Image Figure 5. The Java GUI allows users to visualize the estimated alignment accuracy under FSA's statistical model. for each pair of positions, compute posterior probability of being aligned (i.e. jointly emitted by a match state) FSA's alignment is colored according the expected accuracy under FSA's statistical model (top) as well as according to the true accuracy (bottom) given from a comparison between FSA's alignment and the reference structural alignment. It is clear from inspection that accuracies estimated under FSA's statistical model correspond closely to the true accuracies. Sequences are from alignment BBS12030 in the RV12 dataset of BAliBASE 3 [24]. put in same colums positions that have high match probability Bradley et al PLOS Compute Biol 2009. Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 51 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs Multiple alignment multiple HMM generalization of pair HMM with P > 2 sequences is possible however, number of hidden states exponential in P pair HMM merging pairwise alignments obtained using a pair HMM as in previous slide (posterior decoding) profile HMM Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 52 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs!!!!!!!""""""""""""!!!!!!!!!!!!"""""""""""""""!!!!!!!!!####!!"" gaps are not uniformly distributed along sequences conserved blocks, alternating with regions of more variable length corresponds to structured / loop regions of the conformation Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 53 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs Profile HMM Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 54 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs Profile HMM trained on seed alignment: FQDN-R FK-N-R FK-E-R FK-EVR... 012.3.45 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 54 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs Profile HMM trained on seed alignment: FQDN-R FK-N-R FK-E-R FK-EVR... 012.3.45 applied on new sequence AEFWQ-RAI... 0ii1i2d4ii5 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 54 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs Profile HMM overall procedure for each protein family make a seed alignment (based on superposition of 3D structures) train a profile HMM on seed alignment homologues should have higher p(s M): using Baum Welsch, search for homologues in databases. using Viterbi (or posterior decoding), align homologues with seed alignment Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 55 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs Profile HMM seed alignments are often small using mixture of priors to model match state emission probabilities (see TP) improves detection of distant homologues (Brown et al 1993) Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 55 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs exemple: globin family Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 56 / 60

Hidden Markov models and sequence alignment Multiple alignment: profile HMMs exemple: globin family Krogh et al 1994!"#$%&'(&)*+&,--.& Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 57 / 60