Multiple Sequence Alignment using Profile HMM

Similar documents
Stephen Scott.

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. x 1 x 2 x 3 x K

Pairwise alignment using HMMs

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models

Hidden Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

CSCE 471/871 Lecture 3: Markov Chains and

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Pair Hidden Markov Models

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

HMMs and biological sequence analysis

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models

Lecture 9. Intro to Hidden Markov Models (finish up)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Basic math for biology

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models. Three classic HMM problems

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

HMM : Viterbi algorithm - a toy example

Hidden Markov Models for biological sequence analysis

11.3 Decoding Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Hidden Markov Models for biological sequence analysis I

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Lecture 11: Hidden Markov Models

HMM : Viterbi algorithm - a toy example

Statistical Sequence Recognition and Training: An Introduction to HMMs

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

STA 414/2104: Machine Learning

Hidden Markov Models (I)

Stephen Scott.

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Evolutionary Models. Evolutionary Models

Computational Genomics and Molecular Biology, Fall

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

MACHINE LEARNING 2 UGM,HMMS Lecture 7

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

order is number of previous outputs

O 3 O 4 O 5. q 3. q 4. Transition

Local Alignment: Smith-Waterman algorithm

Introduction to Machine Learning CMU-10701

EECS730: Introduction to Bioinformatics

Hidden Markov Models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Expectation Maximization (EM)

Phylogenetics: Likelihood

Pairwise sequence alignment and pair hidden Markov models

Statistical Methods for NLP

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Applications of Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

EECS E6870: Lecture 4: Hidden Markov Models

Statistical Methods for NLP

Design and Implementation of Speech Recognition Systems

STA 4273H: Statistical Machine Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Hidden Markov Models Hamid R. Rabiee

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Lecture 12: Algorithms for HMMs

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

Hidden Markov Models and Gaussian Mixture Models

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Today s Lecture: HMMs

p(d θ ) l(θ ) 1.2 x x x

Moreover, the circular logic

Graphical Models Seminar

Hidden Markov Modelling

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

1. Markov models. 1.1 Markov-chain

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Hidden Markov Models: All the Glorious Gory Details

Lecture 12: Algorithms for HMMs

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Generative Clustering, Topic Modeling, & Bayesian Inference

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS 7180: Behavioral Modeling and Decision- making in AI

Expectation Maximization (EM)

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Midterm sample questions

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

HMM: Parameter Estimation

Transcription:

Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi, Diana Popovici

1. PLAN 1. From profiles to Profile HMMs 2. Setting the parameters of a profile HMM; the optimal (MAP) model construction 3. Basic algorithms for profile HMMs 4. Profile HMM training from unaligned sequences: Getting the model and the multiple alignment simultaneously 5. Profile HMM variants for non-global alignments 6. Weighting the training sequences

2. 1 From profiles to Profile HMMs Problem Given a multiple alignment (obtained either manually or using one of the methods presented in Ch. 6 and Ch. 7), and the profile associated to a set of marked (X = match) columns, design a HMM that would perform sequence alignments to that profile. X X... X bat A G - - - C rat A - A G - C cat A G - A A - gnat - - A A A C goat A G - - - C Example 1 2... 3 A 4/5 C 4/5 G 3/5 T - 1/5 2/5 1/5

3. Building up a solution At first sight, not taking into account gaps: Begin M j End What about insert residues? I j Begin M j End

4. What about gaps? Begin M j End A better treatment of gaps: D j Begin M j End

Could we put it all together? 5. D j I j Begin M j End Transition structure of a profile HMM

Does it work? X X... X bat A G - - - C rat A - A G - C cat A G - A A - gnat - - A A A C goat A G - - - C 1 2... 3 6. D D D I I I I Begin M M M End 1 2 3 4

Any resemblance to pair HMMs? 7. δ 1 2δ τ δ X qx i ε Begin 1 1 2δ τ δ p M x y i j δ 1 ε τ 1 ε τ Y qy j τ τ τ ε End Doesn t seem so... τ

However, remember... An example of the state assignments for global alignment using the affine gap model: 8. V L S P A D K H L A E S K I x I x I x I x I x I x Ix Ix M M M M M M M M I y I y I y I y I y I y I y I y When making the extension to multiple sequence alignment, think of generating only one string (instead of a pair of strings); use I x for inserting residues, and I y to produce a gap; use one triplet of states (M, I x, I y ) for each column in the alignment; finally define (appropriate) edges in the resulting FSA.

9. Consequence It shouldn t be difficult to re-write the basic HMM algorithms for profile HMMs! one moment, though...

1. 2 Setting the parameters of a Profile HMM 2.1 Using Maximum Likelihood Estimates (MLE) for transition and emission probabilities For instance assuming a given multiple alignment with match states marked (X), the emission probabilities are computed as e Mj (a) = c ja a c ja where c ja is the observed frequency of residue a in the column j of the multiple alignment.

Counts for our example X X... X bat A G - - - C rat A - A G - C cat A G - A A - gnat - - A A A C goat A G - - - C 1 2... 3 match emissions insert emissions A C G T A C G 1 2 3 4 4 3 6 1 11. T D D D M M M D M I 4 1 1 3 2 4 1 I I I I state transitions I M I D 2 1 I I 4 Begin M M M End 1 2 3 4 D M D D D I _ 1 1 2

What about zero counts, i.e. unseen emissions/transitions? 12. One solution: use pseudocounts (generalising Laplace s law...): e Mj (a) = c ja + Aq a a c ja + A where A is a weight put on pseudocounts as compared to real counts (c ja ), and q a is the frequency with which a appears in a random model. A = 2 works well for protein alignments. Note: At the intuitive level, using pseudocount makes a lot of sense: e Mj (a) is approximately equal to q a if very little data is available, i.e. all real counts are very small compared to A; when a large amount of data is available, e Mj (a) is essentially equal to the maximum likelihood solution. For other solutions (e.g. Dirichlet mixtures, substitution matrix mixtures, estimation based on an ancestor), you may see Durbin et al., 1998, Section 6.5.

2.2 Setting the L parameter 13. The process of model construction represents a way to decide which columns of the alignment should be assigned to match states, and which to insert states. There are 2 L combinations of marking for alignment of L columns, and hence 2 L different profile HMMs to choose from. In a marked column, symbols are assigned to match states and gaps are assigned to delete states In an unmarked column, symbols are assigned to insert states and gaps are ignored. There are at least tree ways to determine the marking: manual construction: the user marks alignment columns by hand; heuristic construction: e.g. a column might be marked when the proportion of gap symbols in it is below a certain threshold; Maximum A Posteriori (MAP) model construction: next slides.

The MAP (maximum a posteriori) model construction 14. Objective: we search for the model µ that maximises the likelihood of the given data, namely: argmax µ P(C µ) where C is a set of aligned sequences. Note: The sequences in C are assumed to be statistically independent. Idea: The MAP model construction algorithm recursively calculates S j, the log probability of the optimal model for the alignment up to and including column j, assuming that the column j is marked. More specifically: S j is calculated from smaller subalignments ending at a marked column i, by incrementing S i with the summed log probability of the transitions and emissions for the columns between i and j.

MAP model construction algorithm: Notations 15. c xy the observed state transition counts a xy the transition probabilities, estimated from the c xy in the usual fashion (MLE) a xy = c xy y c xy S j the log probability of the optimal model for the alignment up to and including column j, assuming that column j is marked T ij the summed log probability of all the state transitions between marked columns i and j T ij = x,y M,D,I c xy log a xy M j the log probability contribution for match state symbol emissions in the column j L i,j the log probability contribution for the insert state symbol emissions for the columns i + 1,..., j 1 (for j i > 1).

The MAP model construction algorithm 16. Initialization: S =, M L+1 = Recurrence: for j = 1,...,L + 1 S j = max i<j S i + T ij + M j + L i+1,j 1 + λ σ j = arg max i<j S i + T ij + M j + L i+1,j 1 + λ Traceback: from j = σ L+1, while j > : mark column j as a match column; j = σ j Complexity: O(L) in memory and O(L 2 ) in time for an alignment of L columns... with some care in implementation! Note: λ is a penalty used to favour models with fewer match states. In Bayesian terms, λ is the log of the prior probability of marking each column. It implies a simple but adequate exponentially decreasing prior distribution over model lengths.

3 Basic algorithms for Profile HMMs Notations 17. v Mj (i) the probability of the best path matching the subsequence x 1...i to the (profile) submodel up to the column j, ending with x i being emitted by the state M j ; v Ij (i) the probability of the best path ending in x i being emitted by I j ; v Dj (i) the probability of the best path ending in D j (x i being the last character emitted before D j ). V Mj (i), V Ij (i), V Dj (i) the log-odds scores corresponding respectively to v Mj (i), v Ij (i), v Dj (i). f Mj (i) the combined probability of all alignments up to x i that end in state M j, and similarly f Ij (i), f Dj (i). b Mj (i), b Ij (i), b Dj (i) the corresponding backward probabilities.

The Viterbi algorithm for profile HMM Initialization: rename the Begin state as M, and set v M () = 1; rename the End state as M L+1 Recursion: v Mj (i) = e Mj (x i ) max v Ij (i) = e Ij (x i ) max v Dj (i) = max v Mj 1 (i 1) a Mj 1 M j v Ij 1 (i 1) a Ij 1 M j v Dj 1 (i 1) a Dj 1 M j v Mj (i 1) a Mj I j, v Ij (i 1) a Ij I j v Dj (i 1) a Dj I j v Mj 1 (i) a Mj 1 D j v Ij 1 (i) a Ij 1 D j v Dj 1 (i) a Dj 1 D j 18. Termination: the final score is v ML+1 (n), calculated using the top recursion relation.

The Viterbi algorithm for profile HMMs: log-odds version Initialization: V M () = ; (the Begin state is M, and the End state is M L+1 ) Recursion: V Mj (i) = log e M j (x i ) q xi V Ij (i) = log e I j (x i ) q xi Vj D (i) = max + max + max V Mj 1 (i) + log a Mj 1 D j V Ij 1 (i) + log a Ij 1 D j V Dj 1 (i) + log a Dj 1 D j V Mj 1 (i 1) + log a Mj 1 M j V Ij 1 (i 1) + log a Ij 1 M j V Dj 1 (i 1) + log a Dj 1 M j V Mj (i 1) + log a Mj I j, V Ij (i 1) + log a Ij I j V Dj (i 1) + log a Dj I j Termination: the final score is V ML+1 (n), calculated using the top recursion relation. 19.

2. The Forward algorithm for profile HMMs Initialization: f M () = 1 Recursion: f Mj (i) = e Mj (x i )[f Mj 1 (i 1)a Mj 1 M j + f Ij 1 (i 1)a Ij 1 M j + f Dj 1 (i 1)a Dj 1 M j ] f Ij (i) = e Ij (x i )[f Mj (i 1)a Mj I j + f Ij (i 1)a Ij I j + f Dj (i 1)a Dj I j ] f Dj (i) = f Mj 1 (i)a Mj 1 D j + f Ij 1 (i)a Ij 1 D j + f Dj 1 (i)a Dj 1 D j Termination: f ML+1 (n + 1) = f ML (n)a ML M L+1 + f IL (n)a IL M L+1 + f DL (n)a DL M L+1

21. The Backward algorithm for profile HMMs Initialization: b ML+1 (n + 1) = 1; b ML (n) = a ML M L+1 ; b IL (n) = a IL M L+1 ; b DL (n) = a DL M L+1 Recursion: b Mj (i)=b Mj+1 (i +1)a Mj M j+1 e Mj+1 (x i+1 )+b Ij (i + 1)a Mj I j e Ij (x i+1 )+b Dj+1 (i)a Mj D j+1 b Ij (i) = b Mj+1 (i + 1)a Ij M j+1 e Mj+1 (x i+1 ) + b Ij (i + 1)a Ij I j e Ij (x i+1 ) + b Dj+1 (i)a Ij D j+1 b Dj (i) = b Mj+1 (i+1)a Dj M j+1 e Mj+1 (x i+1 )+b Ij (i+1)a Dj I j e Ij (x i+1 )+b Dj+1 (i)a Dj D j+1

The Baum-Welch (Expectation-Maximization) algorithm for Profile HMMs: re-estimation equations 22. Expected emission counts from sequence x: E Mj (a) = 1 P(x) i x i =a f M j (i)b Mj (i) E Ij (a) = 1 P(x) i x i =a f I j (i)b Ij (i) Expected transition counts from sequence x: A Xj M j+1 = 1 P(x) i f X j (i)a Xj M j+1 e Mj+1 (x i+1 )b Mj+1 (i + 1) A Xj I j = 1 P(x) i f X j (i)a Xj I j e Ij (x i+1 )b Ij (i + 1) A Xj D j+1 = 1 P(x) i f X j (i)a Xj D j+1 b Dj+1 (i) where X j is one of M j, I j, and D j.

Avoiding local maxima 23. The Baum-Welch algorithm is guaranteed to find a local maximum on the probability surface but there is no guarantee that this local optimum is anywhere near the global optimum. A more involved approach is to use some form of stochastic search algorithm that bumps Baum-Welch off from local maxima: noise injection during Baum-Welch re-estimation, simulated annealing Viterbi approximation of EM, Gibbs sampling. For details, see Durbin et al. Section 6.5, pages 154 158.

4 Getting simultaneously the model and the multiple alignment Profile HMM training from unaligned sequences 24. Initialization: Choose the length of the profile HMM (i.e., the number of match states), and initialize the transition and emission parameters. A commonly used rule is to set the profile HMM s length to be the average length of the training sequences. Training: Estimate the model using the Baum-Welch algorithm or its Viterbi alternative. Start Baum-Welch from multiple different points to see if it all converges to approximately the same optimum. If necessary, use a heuristic method for avoiding local optima. (See the previous slide.) Multiple alignment: Align all sequences to the final model using the Viterbi algorithm and build a multiple alignment.

Further comments on profile HMM s initial parameters: 25. One possibility: guess a multiple alignment of some or all given sequences. A further possibility: derive the model s initial parameters from the Dirichlet prior over parameters (see Ch. 11). Alternatively: use frequencies derived from the prior to initialise the model s parameters, then use this model to generate a small number of random sequences, and finally use the resulting counts as data to estimate an initial model. Note: The model should be encouraged to use sensible transitions; for instance transitions into match states should be large compared to other transition probabilities.

Model surgery 26. After training a model we can analyse the alignment it produces: From counts estimated by the forward-backward procedure we can see how much a certain transition is used by the training sequences. The usage of a match state is the sum of counts for all letters emitted in the state. If a certain match state is used by less than half the number of given sequences, the corresponding module (triplet of match, insert, delete states) should be deleted. Similarly, if more than half (or some other predefined fraction) of the sequences use the transitions into a certain insert state, this should be expanded to some number of new modules (usually the average number of insertions).

5 Profile HMMs for non-global alignments 27. Local multiple alignment Begin End

28. The looping probabilities of the flanking (blue diamond) states should be close to 1; let s set them to 1 η. Transition probabilities from the left flanking state to the match states: one option: all η/l another option: η/2 for the transition into the first match state, and η/2(l 1) for the other positions.

29. For the rare case when the first residue might be missing: Begin End

3. A profile HMM for repeat matches to subsections of the profile model Begin End

6 Weighting the training sequences 31. 7 When the (training) sequences in the given multiple alignment are not statistically independent, one may use some simple weighting schemes derived from a phylogenetic tree. Example of a phylogenetic tree: t 5 =3 5 t 6 =3 6 t 3 =5 t 4 =8 t 1 =2 t 2 =2 1 2 3 4

6.1 A weighting scheme using Kirchhof s laws [Thompson et al., 1994] 32. I 1 + I 2 + I 3 V 5 I 1 + I 2 V 6 V 7 I 3 I 4 Let I n and V n be the current and respectively the voltage at node n. We can set the resistance equal to the edge length/time. Equations: V 5 = 2I 1 = 2I 2 V 6 = 2I 1 + 3(I 1 + I 2 ) = 5I 3 V 7 = 8I 4 = 5I 3 + 3(I 1 + I 2 + I 3 ) Result: I 1 : I 2 : I 3 : I 4 = 2 : 2 : 32 : 47 I 1 I 2

6.2 Another simple algorithm [Gerstein et al., 1994] 33. 7 Initially the weights are set to the edge lengths of the leafs: w 1 = w 2 = 2, w 3 = 5, w 4 = 8. Then t 6 =3 6 t 5 =3 5 t 1 =2 t 2 =2 1 2 t 3 =5 3 4 t 4 =8 ω i = t n ω i leaves k below n ω k So, at node 5: w 1 = w 2 = 2 + 3/2 = 3.5 At node 6: w 1 = w 2 = 3.5 + 3 3.5/12, w 3 = 5 + 3 5/12 Result: w 1 : w 2 : w 3 : w 4 = 35 : 35 : 5 : 64

34. For details on other, more involved weighting schemes, Root weights from Gaussian parameters Voronoi weights Maximum discrimination weights Maximum entropy weights you may see Durbin et al., Section 5.8, pages 126 132

35. Examples (including the use of weights in the computation of parameters for the HMM profile): pr. 5.6, 5.1 in [Borodovsky, Ekisheva, 26]