Stephen Scott.

Similar documents
Multiple Sequence Alignment using Profile HMM

Hidden Markov Models

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models. x 1 x 2 x 3 x K

Pair Hidden Markov Models

Computational Genomics and Molecular Biology, Fall

An Introduction to Bioinformatics Algorithms Hidden Markov Models

HMMs and biological sequence analysis

Hidden Markov Models

Today s Lecture: HMMs

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models

Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Stephen Scott.

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

EECS730: Introduction to Bioinformatics

Pairwise alignment using HMMs

Hidden Markov Models. Three classic HMM problems

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Lecture 7 Sequence analysis. Hidden Markov Models

11.3 Decoding Algorithm

HIDDEN MARKOV MODELS

Markov Chains and Hidden Markov Models. = stochastic, generative models

Hidden Markov Models (HMMs) and Profiles

Hidden Markov Models. x 1 x 2 x 3 x K

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

An Introduction to Bioinformatics Algorithms Hidden Markov Models

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Lecture 5: December 13, 2001

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Gibbs Sampling Methods for Multiple Sequence Alignment

Alignment Algorithms. Alignment Algorithms

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Hidden Markov Models: All the Glorious Gory Details

Pairwise sequence alignment and pair hidden Markov models

order is number of previous outputs

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Sequences and Information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Computational Genomics and Molecular Biology, Fall

CS711008Z Algorithm Design and Analysis

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models (Part 1)

Biological Sequences and Hidden Markov Models CPBS7711

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov models in population genetics and evolutionary biology

Numerically Stable Hidden Markov Model Implementation

Design and Implementation of Speech Recognition Systems

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Hidden Markov Models Part 2: Algorithms

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Statistical Methods for NLP

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Week 10: Homology Modelling (II) - HHpred

HMM : Viterbi algorithm - a toy example

Statistical Sequence Recognition and Training: An Introduction to HMMs

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

Hidden Markov Models for biological sequence analysis

Hidden Markov Models

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Linear Dynamical Systems (Kalman filter)

Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Data Mining in Bioinformatics HMM

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Hidden Markov Models (I)

Bio nformatics. Lecture 3. Saad Mneimneh

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models for biological sequence analysis I

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Applications of Hidden Markov Models

Multiple Sequence Alignment: HMMs and Other Approaches

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Hidden Markov Models. Ron Shamir, CG 08

HMM : Viterbi algorithm - a toy example

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Moreover, the circular logic

Midterm sample questions

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

HMM: Parameter Estimation

Dept. of Linguistics, Indiana University Fall 2009

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence modelling. Marco Saerens (UCL) Slides references

Transcription:

1 / 21 sscott@cse.unl.edu

2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for searching databases for more homologues and for aligning strings to the family

3 / 21 Outline of a profile HMM Ungapped regions Insert and delete states Non-global alignments a model Determining states: match, insert, delete Estimating probabilities Pseudocounts and aligning with HMMs Viterbi Forward

4 / 21 of a Profile HMM Match States Match States Insertion States Deletion States General Structure Start with a trivial HMM M (not really hidden at this point) 1 1 B M 1 1 M 1 1 E Each match state has its own set of emission probabilities, so we can compute probability of a new sequence x being part of this family: P(x M) = i L e i (x i ) i=1 Can, as usual, convert probabilities to log-odds score

5 / 21 of a Profile HMM (2) Insertion States Match States Insertion States Deletion States General Structure But this assumes ungapped alignments! To handle gaps, consider insertions and deletions Insertion: part of x that doesn t match anything in multiple alignment (use insert states) I i M i

6 / 21 of a Profile HMM (3) Deletion States Match States Insertion States Deletion States General Structure Deletion: parts of multiple alignment not matched by any residue in x (use silent delete states) D i M i

7 / 21 General Profile HMM Structure Match States Insertion States Deletion States General Structure B E

8 / 21 Handling non-global Alignments Original profile HMMs model entire sequence Add flanking model states (or free insertion modules) to generate non-local residues Match States Insertion States Deletion States General Structure B E

States Probabilities Pseudocounts a Model Determining States Given a multiple alignment, how to build an HMM? General structure defined, but how many match states? 9 / 21 Given a multiple alignment, how to build an HMM? General structure defined, but how many match states?... V G A - - H A G E Y...... V - - - - N V D E V...... V E A - - D V A G H...... V K G - - - - - - D...... V Y S - - T Y E T S...... F N A - - N I P K H...... I A G A D N G A G V...

a Model (2) Determining States States Probabilities Pseudocounts 10 / 21 Given a multiple alignment, how to build an HMM? General structure defined, but how many match states? Heuristic: if more than half of characters in a column are non-gaps, include a match state for that column... V G A - - H A G E Y...... V - - - - N V D E V...... V E A - - D V A G H...... V K G - - - - - - D...... V Y S - - T Y E T S...... F N A - - N I P K H...... I A G A D N G A G V...

11 / 21 a Model (3) Determining States a Model (cont d) States Probabilities Pseudocounts Now, find parameters Now, find parameters Multiple alignment + HMM structure state sequence Multiple alignment + HMM structure state sequence M1 D3 I3... V G A - - H A G E Y...... V - - - - N V D E V...... V E A - - D V A G H...... V K G - - - - - - D...... V Y S - - T Y E T S...... F N A - - N I P K H...... I A G A D N G A G V... Non-gap in match column -> match state Gap in match column -> delete state Non-gap in insert column -> insert state Gap in insert column -> ignore Durbin Fig 5.4, p. 109 11

12 / 21 a Model (4) Estimating Probabilities States Probabilities Pseudocounts Count number of transitions and emissions and compute: a kl = A kl l A kl e k (b) = E k(b) b E k(b ) Still need to beware of some counts = 0

13 / 21 Weighted Pseudocounts States Probabilities Pseudocounts Let c ja = observed count of residue a in position j of multiple alignment e Mj (a) = c ja + Aq a a c ja + A q a = background probability of a, A = weight placed on pseudocounts (sometimes use A 20) Background probabilities also called a prior distribution

14 / 21 Dirichlet Mixtures States Probabilities Pseudocounts Can be thought of as a mixture of pseudocounts The mixture has different components, each representing a different context of a protein sequence E.g., in parts of a sequence folded near protein s surface, more weight (higher q a ) can be given to hydrophilic residues But in other regions, may want to give more weight to hydrophobic residues Will find a different mixture for each position of the alignment based on the distribution of residues in that column

15 / 21 Dirichlet Mixtures (2) States Probabilities Pseudocounts Each component k consists of a vector of pseudocounts α k (so α k a corresponds to Aq a ) and a mixture coefficient (m k, for now) that is the probability that component k is selected Pseudocount model k is the correct one with probability m k We ll set the mixture coefficients for each column based on which vectors best fit the residues in that column E.g., first column of alignment on slide 10 is dominated by V, so any vector α k that favors V will get a higher m k

Dirichlet Mixtures (3) States Probabilities Pseudocounts Let c j be vector of counts in column j e Mj (a) = k P (k c j ) c ja + α k a a ( cja + α k a ) P (k c j ) are the posterior mixture coefficients, which are easily computed [Sjölander et al. 1996], yielding: e Mj (a) = X a a X a, where X a = k m k0 exp ( ln B ( α k + c j ) ln B ( α k )) c ja + α k a a ( cja + α k a ) 16 / 21 ln B( x) = i ( ) ln Γ(x i ) ln Γ x i i

17 / 21 Dirichlet Mixtures (4) Γ is gamma function, and ln Γ is computed via lgamma and related functions in C m k0 is prior probability of component k (= q below) States Probabilities Pseudocounts

18 / 21 for Homologues Viterbi Forward Aligning Score a candidate match x by using log-odds: P(x, π M) is probability that x came from model M via most likely path π Find using Viterbi Pr(x M) is probability that x came from model M summed over all possible paths Find using forward algorithm score(x) = log(p(x M)/P(x φ)) φ is a null model, which is often the distribution of amino acids in the training set or AA distribution over each individual column If x matches M much better than φ, then score is large and positive

Viterbi Equations Viterbi Forward Aligning 19 / 21 V M j Vj M (i) = log-odds score of best path matching x 1...i to model, x i emitted by M j (similarly define Vj I(i) and VD j (i)) B is M 0, V M 0 (0) = 0, E is M L+1 (V M L+1 = final) ( ) emj (x i ) (i) = log + max q xi ( ) Vj I eij (x i ) (i) = log + max q xi Vj D (i) = max Vj 1 M (i 1) + log a M j 1 M j Vj 1 I (i 1) + log a I j 1 M j Vj 1 D (i 1) + log a D j 1 M j Vj M (i 1) + log a Mj I j Vj I(i 1) + log a I j I j Vj D (i 1) + log a Dj I j V M j 1 (i) + log a M j 1 D j V I j 1 (i) + log a I j 1 D j V D j 1 (i) + log a D j 1 D j

Forward Equations Viterbi Forward Aligning ( ) Fj M emj (x i ) (i) = log + log [ a Mj 1 M q j exp ( Fj 1(i M 1) ) + xi a Ij 1 M j exp ( Fj 1(i I 1) ) + a Dj 1 M j exp ( Fj 1(i D 1) )] ( ) Fj I eij (x i ) (i) = log + log [ a Mj I q j exp ( Fj M (i 1) ) + xi a Ij I j exp ( Fj I (i 1) ) + a Dj I j exp ( Fj D (i 1) )] F D j (i) = log [ a Mj 1 D j exp ( F M j 1(i) ) + a Ij 1 D j exp ( F I j 1(i) ) +a Dj 1 D j exp ( F D j 1(i) )] 20 / 21 exp( ) needed for sums and logs (can still be fast; see p. 78)

21 / 21 Aligning a Sequence with a Model (Multiple Alignment) Viterbi Forward Aligning Given a string x, use Viterbi to find most likely path π and use the state sequence as the alignment More detail in Durbin, Section 6.5 Also discusses building an initial multiple alignment and HMM simultaneously via Baum-Welch