Pair Hidden Markov Models

Similar documents
Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K

Stephen Scott.

EECS730: Introduction to Bioinformatics

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Hidden Markov Models

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models: All the Glorious Gory Details

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Pairwise alignment using HMMs

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Design and Implementation of Speech Recognition Systems

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

11.3 Decoding Algorithm

Alignment Algorithms. Alignment Algorithms

Markov Chains and Hidden Markov Models. = stochastic, generative models

Evolutionary Models. Evolutionary Models

1. Markov models. 1.1 Markov-chain

Hidden Markov Models

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Conditional Random Fields

Log-Linear Models, MEMMs, and CRFs

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

O 3 O 4 O 5. q 3. q 4. Transition

Moreover, the circular logic

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Lecture 9. Intro to Hidden Markov Models (finish up)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models Part 2: Algorithms

HIDDEN MARKOV MODELS

Statistical Sequence Recognition and Training: An Introduction to HMMs

Applications of Hidden Markov Models

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models

HMMs and biological sequence analysis

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Statistical Methods for NLP

Hidden Markov Models 1

HMM : Viterbi algorithm - a toy example

HMM : Viterbi algorithm - a toy example

Basic math for biology

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Data Mining in Bioinformatics HMM

Dept. of Linguistics, Indiana University Fall 2009

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

Hidden Markov Models Hamid R. Rabiee

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Hidden Markov Models

Hidden Markov Models for biological sequence analysis

Computational Genomics and Molecular Biology, Fall

Introduction to Machine Learning CMU-10701

Lecture 3: ASR: HMMs, Forward, Viterbi

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 4: September 19

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models. x 1 x 2 x 3 x N

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Lecture 3: Markov chains.

CS 580: Algorithm Design and Analysis

Hidden Markov Models NIKOLAY YAKOVETS

Hidden Markov Models (HMMs) November 14, 2017

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

CS 7180: Behavioral Modeling and Decision- making in AI

Pairwise sequence alignment and pair hidden Markov models

order is number of previous outputs

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Hidden Markov Models Last Updated: Feb 7th, 2017

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Sequences and Information

N H 2 2 NH 3 and 2 NH 3 N H 2

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Hidden Markov Models

Pairwise & Multiple sequence alignments

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Hidden Markov Modelling

Transcription:

Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ] initial state probabilities: a 0i emission probabilities: e i (b k ) 2 Pair Hidden Markov Models Pair HMMs give a probability distribution over certain sequences of pairs of observations. With respect to sequence alignment, we have 2 sequences in our pairwise alignment case. Question: What is our sequence of pairs of observations? Each pair of observations is either letter / letter, or letter / gap, or gap / letter. Consider the following HMM : ((Σ 1 {η})x(σ 2 {η}), Q, A, a o, e) So in every state, we have the choice of emitting two letters, or a letter paired with the empty string (denoted by η). 2.1 Example Model Let s walk through the following pair-hmm (given in the lecture slides): 1

We always emit a pair of letters. Define P as a table of letters (for every pair of letters X, Y, define P(X,Y)), with each probability summing to 1. M, I, J are similarly defined (M corresponds to emissions consisting of pairs of letters, I corresponds to a letter and a gap, and J consists of a gap and a letter). We define transitions between the states as shown, noting that δ and ɛ are very small. Furthermore, δ < ɛ (so opening the gap happens with a smaller probability, and thus incurs a larger penalty, than extending the gap). 2.2 Stepping through the example model What does this have to do with sequences? Let s look at an example alignment using the pair HMM defined above. Consider this alignment: AG-TCA AGGTC- The state path for this alignment is: MMJMMI. Given the state path, we can assign a probabiity to this alignment P (A, A) (1 2δ) P (G, G) δ P J (, G) ɛ P (T, T ) (1 2δ) P (C, C) δ P I (A, ) We get a different probability for every alignment. This procedure can always be used to go from alignment to state path and then to probability of the alignment. 2.3 Adapting Viterbi Decoding Now, we consider a harder question: can we find optimal alignment using this state model? What is the connection between this state model and the traditional alignment algorithms we ve been talking about? Also, how does the probability of a pair of letters correspond to matches and mismatches? With traditional HMMs, we defined the Viterbi algorithm to do this process (i.e., finding a state sequence that occurs with maximum probability and thus defines the best alignment). This was a straight-forward algorithm, using dynamic programming. It is tempting to apply the same Viterbi algorithm to pair HMMs. The naive adaptation would be, for every position i, letting V k (i) be the optimal way to emit the first i letters of x and the first i letters of y, ending in state k. But that doesn t quite work, because we dont know that the first i letters of x align precisely with the first i letters of y. Let s assume inductively that for smaller numbers of x and y, that Viterbi is correctly computed, we can now compute for a given position in x and y, and a given state, what is the optimal Viterbi path. Denote Q(x i ) = P (x i, η), and Q(y j ) = P (η, y j ). Now we can show a fixed Viterbi, for pair- HMMs: 2

We can define forward and backward algorithms similarly. Forward example: F M (i, j) = Prob(x 1...x i, y 1...y j, x i y j ) In other words, the probability that the first i letters of x align to first j letters of y, and x i aligns to y j is given by P (x i, y j ) * (the sum of the three probabilities given for V M (i, j) that are maxed over for the pair-hmm Viterbi algorithm). 2.4 Probabilistic Interpretation Consider the product over all the sequence characters Q(x i )Q(y j ). This is a very small constant. This is the probability of emitting every letter with a gap (i.e., nothing matches...) Emit x by itself... and then emit y by itself. Imagine taking the probability of an alignment, and dividing it by this quantity. It s a constant - so it doesn t favor any one alignment over another. The optimal alignment will be the same, if you divide the probability of every alignment by this number. But we have something interesting now a probabilistic interpretation of alignment. This new denominator is the baseline theory that the two sequences were generated independently of each other. (the null hypothesis, in a sense?) If you take the alignment probability, and divide it by this baseline probability, and that quotient is > 1, it is more likely than not that the sequences were generated by the pair HMM (and have characters in common)... but if it s < 1, it is more likely than not that the sequences are in fact independent and were emitted randomly, not by the pair HMM. Rather than dividing by everything at the end, we ll divide along the way (every time we see an x i, divide by Q(x i ), etc...). 2.5 Connection to Needleman-Wunsch Can we find a connection between Viterbi for pair HMMs and Needleman-Wunsch? Given a NW scoring function, how do you set pair HMM viterbi parameters to make their outputs match? Maximizing the multiplication of probabilities as we do here in the Viterbi algorithm is the same maximizing the sum of the logs of the probabilities. Then our DP looks very similar and we can connect the scoring parameters to the Viterbi parameters. 2.5.1 Transforming pair Viterbi to NW with affine gaps 1. Convert products to sum of logs. 2. The log of a number close to 1 is very close to 0, so remove small terms. 3. Every time you open a gap you pay log(δ) 4. Every time you extend a gap you pay log(ɛ) Viterbi has a few extra small terms we dropped but the DP is virtually identical. 2.5.2 Setting N-W Parameters We can use the following procedure to set the parameters of Needleman-Wunsch to have real meaning now: 1. Hand-curate (true) alignments that we trust. 3

2. Count up how frequently given transitions (between states) occur, and the expected gap length. 3. Compute the maximum likelihood parameters using those counts. If we don t have necessarily true alignments at our disposal, but we have sequences we think that align, use Baum-Welch as discussed in a previous lecture. 2.5.3 Scoring Example S(x i, y j ) = log P (x i, y j ) Q(x i )Q(y j ) P (AA) = P (CC) = P (GG) = P (T T ) = 0.2 P (x i) = 0.2/12 (the 12 different ways you could have a mismatch) We can set Q(A) = 0.25 Now m = log 0.2 0.25 2 = 1.16 and s = log 0.2/12 0.25 2 = 1.32 Consider 80%, 20% mismatch. That yields.8 + (.25/2) =.525 = break-even point (score of 0). + 66. Clearly more aligned than not. Implication: Need a little bit more matches than mismatches (52.5%) for the sequences to be considered a positive alignment. 2.6 Conditional Random Fields We don t really care about generating x. So let s lose that ability in favor of a much richer model for P (π x), the probability of a parse given a sequence. P (π x) = exp( i=1... x wt F (π i, π i 1, x, i)) π exp( i=1... x w T F (π,π i 1,x,i)) The denominator of the above expression gives the sum over scores of all possible parses (forces the quantity to become a probability); i.e., the sum over all possibilities of numerators. We are really only interested in the exponent because the proability is monotonic with respect to the exponent. w is a vector - a set of weights of different parameters of the model F is a given set of feature counts (e.g., a pair of states (current state, previous state), sequence of observations, and the position we re in) w T F is dot product of w and F. 4

2.6.1 CRFs versus HMMs CRFs are richers than HMMs. In HMMs, parameters must obey the constraint of being probabilities, while this is not so for CRFs. CRFs can contain features that are not possible to include in HMMs. In HMMs, we were only allowed to consider x i (cannot look at x i 1 or x i 2 because that would be a double emission). With a CRF, there are no longer emissions, so this restriction no longer applies, so you can look at how many ever previous events you want. CRFs require training data with knowledge of underlying alignments; i.e., there is no equivalent of the Baum-Welch algorithm for CRFs. CRF feature selection ultimately relies on intuition. One example: might look for codon pattern in genes (i.e., MMS MMS MMS). There is a connection between HMMs and CRFs: to convert a HMM into a CRF, the CRF s feature vector would be a 1 indicator function for the specific transition and emission, and 0 for everything else. 5