Hidden Markov Models. x 1 x 2 x 3 x K

Similar documents
Hidden Markov Models. x 1 x 2 x 3 x K

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models

CSCE 471/871 Lecture 3: Markov Chains and

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Three classic HMM problems

Lecture 9. Intro to Hidden Markov Models (finish up)

Pair Hidden Markov Models

Stephen Scott.

HIDDEN MARKOV MODELS

Introduction to Machine Learning CMU-10701

Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models for biological sequence analysis

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models (I)

11.3 Decoding Algorithm

Hidden Markov Models. Hosein Mohimani GHC7717

6 Markov Chains and Hidden Markov Models

Hidden Markov Models

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

HMMs and biological sequence analysis

Hidden Markov Models for biological sequence analysis I

Pairwise alignment using HMMs

Markov Chains and Hidden Markov Models. = stochastic, generative models

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models. x 1 x 2 x 3 x N

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Info 2950, Lecture 25

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

HMM: Parameter Estimation

CS711008Z Algorithm Design and Analysis

Advanced Data Science

Hidden Markov Models Part 2: Algorithms

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Today s Lecture: HMMs

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models (HMMs) November 14, 2017

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Lecture 7 Sequence analysis. Hidden Markov Models

Stephen Scott.

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bioinformatics: Biology X

Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

order is number of previous outputs

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Basic math for biology

Hidden Markov Models 1

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

EECS730: Introduction to Bioinformatics

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

A Higher-Order Interactive Hidden Markov Model and Its Applications Wai-Ki Ching Department of Mathematics The University of Hong Kong

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Evolutionary Models. Evolutionary Models

Hidden Markov Models

Statistical NLP: Hidden Markov Models. Updated 12/15

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Alignment Algorithms. Alignment Algorithms

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Design and Implementation of Speech Recognition Systems

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall

STA 414/2104: Machine Learning

Lecture 5: December 13, 2001

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Automatic Speech Recognition (CS753)

Chapter 4: Hidden Markov Models

Hidden Markov Models: All the Glorious Gory Details

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

2 : Directed GMs: Bayesian Networks

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models and Gaussian Mixture Models

Markov chains and Hidden Markov Models

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Transcription:

Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K

HiSeq X & NextSeq

Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0) = 0, for all k > 0 Initialization: b k (N) = 1, for all k Iteration: Iteration: Iteration: V l (i) = e l (x i ) max k V k (i-1) a kl f l (i) = e l (x i ) Σ k f k (i-1) a kl b l (i) = Σ k e l (x i +1) a kl b k (i+1) Termination: Termination: Termination: P(x, π*) = max k V k (N) P(x) = Σ k f k (N) P(x) = Σ k a 0k e k (x 1 ) b k (1)

Variants of HMMs

Higher-order HMMs How do we model memory larger than one time point? P(π i+1 = l π i = k) a kl P(π i+1 = l π i = k, π i -1 = j) a jkl A second order HMM with K states is equivalent to a first order HMM with K 2 states a HHT a HT (prev = H) a HT (prev = T) state HH a HTH state HT state H state T a THH a THT a HTT a TH (prev = H) a TH (prev = T) state TH a TTH state TT

Similar Algorithms to 1 st Order P(π i+1 = l π i = k, π i -1 = j) V lk (i) = max j { V kj (i 1) + } Time? Space?

Modeling the Duration of States Length distribution of region X: 1-p E[l X ] = 1/(1-p) p X Y q Geometric distribution, with mean 1/(1-p) 1-q This is a significant disadvantage of HMMs Several solutions exist for modeling different length distributions

Example: exon lengths in genes

Solution 1: Chain several states p 1-p X X X Y q 1-q Disadvantage: Still very inflexible l X = C + geometric with mean 1/(1-p)

Solution 2: Negative binomial distribution p p p 1 p 1 p 1 p X (1) X (2) X (n) Y Duration in X: m turns, where During first m 1 turns, exactly n 1 arrows to next state are followed During m th turn, an arrow to next state is followed m 1 m 1 P(l X = m) = n 1 (1 p) n-1+1 p (m-1)-(n-1) = n 1 (1 p) n p m-n

Example: genes in prokaryotes EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

Solution 3: Duration modeling Upon entering a state: 1. Choose duration d, according to probability distribution 2. Generate d letters according to emission probs 3. Take a transition to next state according to transition probs F d<d f x i x i+d-1 Disadvantage: Increase in complexity of Viterbi: Time: O(D) Space: O(1) where D = maximum duration of state P f Warning, Rabiner s tutorial claims O(D 2 ) & O(D) increases

Viterbi with duration modeling emissions emissions F L d<d f P f transitions d<d l P l x i x i + d 1 x j x j + d 1 Recall original iteration: Vl(i) = max k V k (i 1) a kl e l (x i ) Precompute cumulative values New iteration: V l (i) = max k max d=1 Dl V k (i d) P l (d) a kl Π j=i-d+1 i e l (x j )

Learning Re-estimate the parameters of the model based on training data

Two learning scenarios 1. Estimation when the right answer is known Examples: GIVEN: a genomic region x = x 1 x 1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls 2. Estimation when the right answer is unknown Examples: GIVEN: the porcupine genome; we don t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don t see when he changes dice QUESTION: Update the parameters θ of the model to maximize P(x θ)

1. When the states are known Given x = x 1 x N for which the true π = π 1 π N is known, Define: A kl E k (b) = # times k l transition occurs in π = # times state k in π emits b in x We can show that the maximum likelihood parameters θ (maximize P(x θ)) are: A kl E k (b) a kl = e k (b) = Σ i A ki Σ c E k (c)

1. When the states are known Intuition: When we know the underlying states, Best estimate is the normalized frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x θ) is maximized, but θ is unreasonable 0 probabilities BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 π = F, F, F, F, F, F, F, F, F, F Then: a FF = 1; a FL = 0 e F (1) = e F (3) =.2; e F (2) =.3; e F (4) = 0; e F (5) = e F (6) =.1

Pseudocounts Solution for small training sets: Add pseudocounts A kl = # times k l transition occurs in π + r kl E k (b) = # times state k in π emits b in x + r k (b) r kl, r k (b) are pseudocounts representing our prior belief Larger pseudocounts Strong priof belief Small pseudocounts (ε < 1): just to avoid 0 probabilities

Pseudocounts Example: dishonest casino We will observe player for one day, 600 rolls Reasonable pseudocounts: r 0F = r 0L = r F0 = r L0 = 1; r FL = r LF = r FF = r LL = 1; r F (1) = r F (2) = = r F (6) = 20 (strong belief fair is fair) r L (1) = r L (2) = = r L (6) = 5 (wait and see for loaded) Above #s are arbitrary assigning priors is an art

2. When the states are hidden We don t know the true A kl, E k (b) Idea: We estimate our best guess on what A kl, E k (b) are Or, we start with random / uniform values We update the parameters of the model, based on our guess We repeat

2. When the states are hidden Starting with our best guess of a model M, parameters θ: Given x = x 1 x N for which the true π = π 1 π N is unknown, We can get to a provably more likely parameter set θ i.e., θ that increases the probability P(x θ) Principle: EXPECTATION MAXIMIZATION 1. Estimate A kl, E k (b) in the training data 2. Update θ according to A kl, E k (b) 3. Repeat 1 & 2, until convergence

Estimating new parameters To estimate A kl : (assume θ CURRENT, in all formulas below) At each position i of sequence x, find probability transition k l is used: P(π i = k, π i+1 = l x) = [1/P(x)] P(π i = k, π i+1 = l, x 1 x N ) = Q/P(x) where Q = P(x 1 x i, π i = k, π i+1 = l, x i+1 x N ) = = P(π i+1 = l, x i+1 x N π i = k) P(x 1 x i, π i = k) = = P(π i+1 = l, x i+1 x i+2 x N π i = k) f k (i) = = P(x i+2 x N π i+1 = l) P(x i+1 π i+1 = l) P(π i+1 = l π i = k) f k (i) = = b l (i+1) e l (x i+1 ) a kl f k (i) f k (i) a kl e l (x i+1 ) b l (i+1) So: P(π i = k, π i+1 = l x, θ) = P(x θ CURRENT )

Estimating new parameters So, A kl is the E[# times transition k l, given current θ] f k (i) a kl e l (x i+1 ) b l (i+1) A kl = Σ i P(π i = k, π i+1 = l x, θ) = Σ i P(x θ) Similarly, f k (i) x 1 x i-1 k x i a kl l e l (x i+1 ) x i+1 b l (i+1) x i+2 x N E k (b) = [1/P(x θ)]σ {i xi = b} f k (i) b k (i)

The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1. Forward 2. Backward 3. Calculate A kl, E k (b), given θ CURRENT 4. Calculate new model parameters θ NEW : a kl, e k (b) 5. Calculate new log-likelihood P(x θ NEW ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x θ) does not change much

The Baum-Welch Algorithm Time Complexity: # iterations O(K 2 N) Guaranteed to increase the log likelihood P(x θ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model: Overtraining

Alternative: Viterbi Training Initialization: Same Iteration: 1. Perform Viterbi, to find π * 2. Calculate A kl, E k (b) according to π * + pseudocounts 3. Calculate the new parameters a kl, e k (b) Until convergence Notes: Not guaranteed to increase P(x θ) Guaranteed to increase P(x θ, π * ) In general, worse performance than Baum-Welch

Pair-HMMs and CRFs Slide Credits: Chuong B. Do

Quick recap of HMMs Formally, an HMM = (Σ, Q, A, a 0, e). alphabet: Σ = {b 1,, b M } set of states: Q = {1,, K} transition probabilities: A = [a ij ] initial state probabilities: a 0i emission probabilities: e i (b k ) Example: 0.95 FAIR 0.05 LOADED 0.95 0.05

Pair-HMMs Consider the HMM = ((Σ 1 U {η}) x (Σ 2 U {η}), Q, A, a 0, e). Instead of emitting a pair of letters, in some states we may emit a letter paired with η (the empty string) For simplicity, assume η is never emitted for both observation sequences simultaneously Call the two observation sequences x and y

Application: sequence alignment ε Consider the following pair-hmm: 1 ε I P(x i, η) δ 1 2δ M P(x i, y j ) op(onal δ 1 ε c Σ, P(η, c) = P(c, η) = Q(c) J P(η, y j ) QUESTION: What are the interpretations of P(c,d) and Q(c) for c,d Σ? QUESTION: What does this model have to do with alignments? ε QUESTION: What is the average length of a gapped region in alignments generated by this model? Average length of matched regions?

Recap: Viterbi for single-sequence HMMs 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Algorithm: V k (i) = max π1 πi-1 P(x 1 x i-1, π 1 π i-1, x i, π i = k) Compute using dynamic programming!

(Broken) Viterbi for pair-hmms In the single sequence case, we defined V k (i) = max π1 πi-1 P(x 1 x i-1, π 1 π i-1, x i, π i = k) = e k (x i ) max j a jk V j (i - 1) In the pairwise case, (x 1, y 1 ) (x i -1, y i-1 ) no longer correspond to the first i 1 letters of x and y

(Fixed) Viterbi for pair-hmms Consider this special case: 1 ε 1 2δ M P(x i, y j ) 1 ε V M (i, j) = P(x i, y j ) max (1-2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) (1 - ε) V J (i - 1, j - 1) ε δ δ ε V I (i, j) = Q(x i ) max δ V M (i - 1, j) ε V I (i - 1, j) I P(x i, η) op(onal c Σ, P(η, c) = P(c, η) = Q(c) J P(η, y j ) V J (i, j) =Q(y j ) max δ V M (i, j - 1) ε V J (i, j - 1) Similar for forward/backward algorithms (see Durbin et al for details) QUESTION: What s the computational complexity of DP?

Connection to NW with affine gaps V M (i, j) = P(x i, y j ) max V I (i, j) = Q(x i ) max (1-2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) (1 - ε) V J (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) V J (i, j) =Q(y j ) max δ V M (i, j - 1) ε V J (i, j - 1) QUESTION: How would the optimal alignment change if we divided the probability for every single alignment by i=1 x Q(x i ) j = 1 y Q(y j )?

Connection to NW with affine gaps V M (i, j) = P(x i, y j ) max Q(x i ) Q(y j ) V I (i, j) = max V J (i, j) = max (1-2δ) V M (i - 1, j - 1) (1 - ε) V I (i - 1, j - 1) (1 - ε) V J (i - 1, j - 1) δ V M (i - 1, j) ε V I (i - 1, j) δ V M (i, j - 1) ε V J (i, j - 1) Account for the extra terms along the way.

Connection to NW with affine gaps log V M (i, j) = log P(x i, y j ) + max Q(x i ) Q(y j ) log (1-2δ) + log V M (i - 1, j - 1) log (1 - ε) + log V I (i - 1, j - 1) log (1 - ε) + log V J (i - 1, j - 1) log V I (i, j) = max log δ + log V M (i - 1, j) log ε + log V I (i - 1, j) log V J (i, j) = max log δ + log V M (i, j - 1) log ε + log V J (i, j - 1) Take logs, and ignore a couple terms.

Connection to NW with affine gaps M(i, j) = S(x i, y j ) + max I(i, j) = max M(i - 1, j - 1) I(i - 1, j - 1) J(i - 1, j - 1) d + M(i - 1, j) e + I(i - 1, j) J(i, j) = max d + M(i, j - 1) e + J(i, j - 1) Rename!