Hidden Markov Models (I)

Similar documents
CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models for biological sequence analysis

Stephen Scott.

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

CSCE 471/871 Lecture 3: Markov Chains and

Hidden Markov Models

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

Hidden Markov Models for biological sequence analysis I

HIDDEN MARKOV MODELS

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

HMM: Parameter Estimation

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

O 3 O 4 O 5. q 3. q 4. Transition

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Three classic HMM problems

Introduction to Machine Learning CMU-10701

Hidden Markov Models

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

CS711008Z Algorithm Design and Analysis

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

HMMs and biological sequence analysis

Pairwise alignment using HMMs

Hidden Markov Models. x 1 x 2 x 3 x K

Markov Chains and Hidden Markov Models. = stochastic, generative models

6 Markov Chains and Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x K

Advanced Data Science

Hidden Markov Models

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Lecture 7 Sequence analysis. Hidden Markov Models

STA 414/2104: Machine Learning

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Markov chains and Hidden Markov Models

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov Models 1

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

BMI/CS 576 Fall 2016 Final Exam

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Models Part 2: Algorithms

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Basic math for biology

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Introduction to Hidden Markov Models (HMMs)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

Markov Models & DNA Sequence Evolution

11.3 Decoding Algorithm

STA 4273H: Statistical Machine Learning

Today s Lecture: HMMs

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Data Mining in Bioinformatics HMM

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Computational Genomics and Molecular Biology, Fall

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Hidden Markov Models. x 1 x 2 x 3 x N

Computational Genomics and Molecular Biology, Fall

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Hidden Markov Models. Hosein Mohimani GHC7717

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Machine Learning for natural language processing

Hidden Markov Models (Part 1)

Sequence Analysis. BBSI 2006: Lecture #(χ+1) Takis Benos (2006) BBSI MAY P. Benos

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Hidden Markov Models (HMMs) November 14, 2017

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Lab 3: Practical Hidden Markov Models (HMM)

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models

Dynamic Approaches: The Hidden Markov Model

p(d θ ) l(θ ) 1.2 x x x

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Multiple Sequence Alignment using Profile HMM

Parametric Models Part III: Hidden Markov Models

Genome 373: Hidden Markov Models II. Doug Fowler

CS 7180: Behavioral Modeling and Decision- making in AI

order is number of previous outputs

Directed Probabilistic Graphical Models CMSC 678 UMBC

L23: hidden Markov models

Linear Dynamical Systems (Kalman filter)

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

DNA Feature Sensors. B. Majoros

Lecture 5: December 13, 2001

Hidden Markov Models,99,100! Markov, here I come!

Hidden Markov Models. Terminology, Representation and Basic Problems

Graphical Models Seminar

Transcription:

GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm

Hidden Markov models A Markov chain of states At each state, there are a set of possible observables (symbols), and The states are not directly observable, namely, they are hidden. E.g., Casino fraud 0.95 0.9 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 Fair 0.05 0.1 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 Loaded Three major problems Most probable state path The likelihood Parameter estimation for HMMs

A biological example: CpG islands Higher rate of Methyl-C mutating to T in CpG dinucleotides generally lower CpG presence in genome, except at some biologically important ranges, e.g., in promoters, -- called CpG islands. The conditional probabilities P ± (N N )are collected from ~ 60,000 bps human genome sequences, + stands for CpG islands and for non CpG islands. P + A C G T P - A C G T A.180.274.426.120 A.300.205.285.210 C.171.368.274.188 C.322.298.078.302 G.161.339.375.125 G.248.246.298.208 T.079.355.384.182 T.177.239.292.292

# of sequence Task 1: given a sequence x, determine if it is a CpG island. One solution: compute the log-odds ratio scored by the two Markov chains: S(x) = log [ P(x model +) / P(x model -)] where P(x model +) = P + (x 2 x 1 ) P + (x 3 x 2 ) P + (x L x L-1 ) and P(x model -) = P - (x 2 x 1 ) P - (x 3 x 2 ) P - (x L x L-1 ) Histogram of the length-normalized scores (CpG sequences are shown as dark shaded ) S(x)

Task 2: For a long genomic sequence x, label these CpG islands, if there are any. Approach 1: Adopt the method for Task 1 by calculating the log-odds score for a window of, say, 100 bps around every nucleotide and plotting it. Problems with this approach: Won t do well if CpG islands have sharp boundary and variable length No effective way to choose a good Window size.

Approach 2: using hidden Markov model 0.65 0.70 A:.170 C:.368 G:.274 T:.188 0.35 0.30 A:.372 C:.198 G:.112 T:.338 + The model has two states, + for CpG island and - for non CpG island. Those numbers are made up here, and shall be fixed by learning from training examples. The notations: a kl is the transition probability from state k to state l; e k (b) is the emission frequency probability that symbol b is seen when in state k.

0.65 0.70 A:.170 C:.368 G:.274 T:.188 0.35 0.30 A:.372 C:.198 G:.112 T:.338 + The probability that sequence x is emitted by a state path π is: i:123456789 x:tgcgcgtac π :--++++--- P(x, π) = i=1 to L e πi (x i ) a πi πi+1 P(x, π) = 0.338 0.70 0.112 0.30 0.368 0.65 0.274 0.65 0.368 0.65 0.274 0.35 0.338 0.70 0.372 0.70 0.198. Then, the probability to observe sequence x in the model is P(x) = π P(x, π), which is also called the likelihood of the model.

Decoding: Given an observed sequence x, what is the most probable state path, i.e., π* = argmax π P(x, π) Q: Given a sequence x of length L, how many state paths do we have? A: N L, where N stands for the number of states in the model. As an exponential function of the input size, it precludes enumerating all possible state paths for computing P(x).

Let v k (i) be the probability for the most probable path ending at position i with a state k. Viterbi Algorithm Initialization: v 0 (0) =1, v k (0) = 0 for k > 0. Recursion: v k (i) = e k (x i ) max j (v j (i-1) a jk ); ptr i (k) = argmax j (v j (i-1) a jk ); Termination: P(x, π* ) = max k (v k (L) a k0 ); π* L = argmax j (v j (L) a j0 ); Traceback: π* i-1 = ptr i (π* i ).

0.5 0.5 0.65 0.70 A:.170 C:.368 G:.274 T:.188 0.35 0.30 v k (i) = e k (x i ) max j (v j (i-1) a jk ); 0 ε A:.372 C:.198 G:.112 T:.338 + V k (i) 0 + ε A A G C G 1 0 0 0 0 0 0 0.186. 048.0038.00053.000042.085.0095.0039.00093.00038 i k + + + hidden state path

Casino Fraud: investigation results by Viterbi decoding

The log transformation for Viterbi algorithm v k (i) = e k (x i ) max j (v j (i-1) a jk ); a jk = log a jk ; e k (x i ) = log e k (x i ); v k (i) = log v k (i); v k (i) = e k (x i ) + max j (v j (i-1) + a jk );

GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (II) The model likelihood: Forward algorithm, backward algorithm Posterior decoding

0.65 0.70 A:.170 C:.368 G:.274 T:.188 0.35 0.30 A:.372 C:.198 G:.112 T:.338 + The probability that sequence x is emitted by a state path π is: i:123456789 x:tgcgcgtac π :--++++--- P(x, π) = i=1 to L e πi (x i ) a πi πi+1 P(x, π) = 0.338 0.70 0.112 0.30 0.368 0.65 0.274 0.65 0.368 0.65 0.274 0.35 0.338 0.70 0.372 0.70 0.198. Then, the probability to observe sequence x in the model is P(x) = π P(x, π), which is also called the likelihood of the model.

How to calculate the probability to observe sequence x in the model? P(x) = π P(x, π) Let f k (i) be the probability contributed by all paths from the beginning up to (and include) position i with the state at position i being k. The the following recurrence is true: f k (i) = [ j f j (i-1) a jk ] e k (x i ) Graphically, x 1 x 2 x 3 x L 1 1 1 1 0 0 N N N N Again, a silent state 0 is introduced for better presentation

Forward algorithm Initialization: f 0 (0) =1, f k (0) = 0 for k > 0. Recursion: f k (i) = e k (x i ) j f j (i-1) a jk. Termination: P(x) = k f k (L) a k0. Time complexity: O(N 2 L), where N is the number of states and L is the sequence length.

Let b k (i) be the probability contributed by all paths that pass state k at position i. b k (i) = P(x i+1,, x L (i) = k) Backward algorithm Initialization: b k (L) = a k0 for all k. Recursion (i = L-1,, 1): b k (i) = j a kj e j (x i+1 ) b j (i+1). Termination: P(x) = k a 0k e k (x 1 )b k (1). Time complexity: O(N 2 L), where N is the number of states and L is the sequence length.

Posterior decoding P(π i = k x) = P(x, π i = k) /P(x) = f k (i)b k (i) / P(x) Algorithm: for i = 1 to L do argmax k P(π i = k x) Notes: 1. Posterior decoding may be useful when there are multiple almost most probable paths, or when a function is defined on the states. 2. The state path identified by posterior decoding may not be most probable overall, or may not even be a viable path.

GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (III) - Viterbi training - Baum-Welch algorithm - Maximum Likelihood - Expectation Maximization

Model building - Topology - Requires domain knowledge - Parameters - When states are labeled for sequences of observables - Simple counting: a kl = A kl / l A kl and e k (b) = E k (b) / b E k (b ) - When states are not labeled Method 1 (Viterbi training) 1. Assign random parameters 2. Use Viterbi algorithm for labeling/decoding 2. Do counting to collect new a kl and e k (b); 3. Repeat steps 2 and 3 until stopping criterion is met. Method 2 (Baum-Welch algorithm)

Baum-Welch algorithm (Expectation-Maximization) An iterative procedure similar to Viterbi training Probability that a kl is used at position i in sequence j. P(π i = k, π i+1 = l x,θ ) = f k (i) a kl e l (x i+1 ) b l (i+1) / P(x j ) Calculate the expected number of times that is used by summing over all position and over all training sequences. A kl = j {(1/P(x j ) [ i f kj (i) a kl e l (x j i+1) b lj (i+1)] } Similarly, calculate the expected number of times that symbol b is emitted in state k. E k (b) = j {(1/P(x j ) [ {i x_i^j = b} f kj (i) b kj (i)] }

Maximum Likelihood Define L( ) = P(x ) Estimate such that the distribution with the estimated best agrees with or support the data observed so far. ML = argmax L( ) E.g. There are red and black balls in a box. What is the probability P of picking up a black ball? Do sampling (with replacement).

Maximum Likelihood Define L( ) = P(x ) Estimate such that the distriibution with the estimated best agrees with or supports the data observed so far. ML = argmax L( ) When L( ) is differentiable, L( ) ML 0 For example, want to know the ratio: # of blackball/# of whiteball, in other words, the probability P of picking up a black ball. Sampling (with replacement): Prob ( iid) = p 9 (1-p) 91 Likelihood L(p) = p 9 (1-p) 91. Lp ( ) 8 91 9 90 9 p (1 p) 91 p (1 p) 0 P 100 times Counts: whiteball 91, blackball 9 => P ML = 9/100 = 9%. The ML estimate of P is just the frequency.

A proof that the observed frequency -> ML estimate of probabilities for polynomial distribution Let Counts n i for outcome i The observed frequencies i = n i /N, where N = i n i If i ML = n i /N, then P(n ML ) > p(n ) for any ML Proof: ML ni ML ( i ) ML Pn ( ) i i log log log ( ) n i Pn ( ) ( ) i i ML ML ML i ni i ML i nilog( ) N log( ) i log( ) N i i i i i i i i n i ML H( ) 0

Maximum Likelihood: pros and cons - Consistent, i.e., in the limit of a large amount of data, ML estimate converges to the true parameters by which the data are created. - Simple - Poor estimate when data are insufficient. e.g., if you roll a die for less than 6 times, the ML estimate for some numbers would be zero. Pseudo counts: i n i N i A, where A = i i

Conditional Probability and Join Probability P(one) = 5/13 P(square) = 8/13 P(one, square) = 3/13 P(one square) = 3/8 = P(one, square) / P(square) In general, P(D,M) = P(D M)P(M) = P(M D)P(D) => Baye s Rule: P(M D) P( D M ) P( M ) PD ( )

Conditional Probability and Conditional Independence

Baye s Rule: P(M D) P( D M ) P( M ) PD ( ) Example: disease diagnosis/inference P(Leukemia Fever) =? P(Fever Leukemia) = 0.85 P(Fever) = 0.9 P(Leukemia) = 0.005 P(Leukemia Fever) = P(F L)P(L)/P(F) = 0.85*0.01/0.9 = 0.0047

Bayesian Inference Maximum a posterior estimate MAP arg max P( x)

Expectation Maximization P( x, y ) P( y x, ) P(x ) P( x ) P( x, y ) / P( y x, ) log P( x ) log P( x, y ) log P(y x, ) P( y x, t ) ( ) Expectation y log ( ) (, t t P x P y x )log P( x, y ) P( y x, )log P( y x, ) y t t Q( ) P( y x, )logp(x, y ) y t log P( x ) log P( x ) y t t t t Q( ) Q( ) P( y x, )log y t P( y x, ) P( y x, ) t t t Q( ) Q( ) t 1 t arg max Q( ) Maximization

EM explanation of the Baum-Welch algorithm We like to maximize by choosing P( x ) P( x, ) t t Q( ) P( x, )log P( x, ) But state path is hidden variable. Thus, EM.

EM Explanation of the Baum-Welch algorithm E-term A-term is maximized if E-term is maximized if a EM kl e EM k A kl Akl ' l ' Ek ( b) ( b) E ( b') b' A-term k

GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (IV) a. Profile HMMs b. iphmms c. GeneScan d. TMMOD

Interaction profile HMM (iphmm) Friedrich et al, Bioinformatics 2006

GENSCAN (generalized HMMs) Chris Burge, PhD Thesis 97, Stanford http://genes.mit.edu/genscan.html Four components A vector π of initial probabilities A matrix T of state transition probabilities A set of length distribution f A set of sequence generating models P Generalized HMMs: at each state, emission is not symbols (or residues), rather, it is a fragment of sequence. Modified viterbi algorithm

Initial state probabilities As frequency for each functional unit to occur in actual genomic data. E.g., as ~ 80% portion are non-coding intergenic regions, the initial probability for state N is 0.80 Transition probabilities State length distributions

Training data 2.5 Mb human genomic sequences 380 genes, 142 single-exon genes, 1492 exons and 1254 introns 1619 cdnas

Open areas for research Model building Integration of domain knowledge, such as structural information, into profile HMMs Meta learning? Biological mechanism DNA replication Hybrid models Generalized HMM

TMMOD: An improved hidden Markov model for predicting transmembrane topology

TMHMM by Krogh, A. et al JMB 305(2001)567-580 Cytoplasmic side membrane Non-cytoplasmic side Cap cyt Helix core Cap Non-cyt Long loop Non-cyt globular globular Loop cyt Cap cyt Helix core Cap Non-cyt Long loop Non-cyt globular Accuracy of prediction for topology: 78%

No. loops 25 20 cyto acyto 15 10 5 0 0 20 40 60 80 100 length

Mod. Reg. Data set Correct topology Correct location Sensitivity Specificity TMMOD 1 (a) (b) (c) S-83 65 (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%) 65 (78.3%) 97.4% 71.3% 97.1% 97.4% 71.3% 97.1% TMMOD 2 (a) (b) (c) S-83 61 (73.5%) 54 (65.1%) 54 (65.1%) 65 (78.3%) 61 (73.5%) 66 (79.5%) 99.4% 93.8% 99.7% 97.4% 71.3% 97.1% TMMOD 3 (a) (b) (c) S-83 70 (84.3%) 64 (77.1%) 74 (89.2%) 71 (85.5%) 65 (78.3%) 74 (89.2%) 98.2% 95.3% 99.1% 97.4% 71.3% 97.1% TMHMM S-83 64 (77.1%) 69 (83.1%) 96.2% 96.2% PHDtm S-83 (85.5%) (88.0%) 98.8% 95.2% TMMOD 1 (a) (b) (c) S-160 117 (73.1%) 92 (57.5%) 117 (73.1%) 128 (80.0%) 103 (64.4%) 126 (78.8%) 97.4% 77.4% 96.1% 97.0% 80.8% 96.7% TMMOD 2 (a) (b) (c) S-160 120 (75.0%) 97 (60.6%) 118 (73.8%) 132 (82.5%) 121 (75.6%) 135 (84.4%) 98.4% 97.7% 98.4% 97.2% 95.6% 97.2% TMMOD 3 (a) (b) (c) S-160 120 (75.0%) 110 (68.8%) 135 (84.4%) 133 (83.1%) 124 (77.5%) 143 (89.4%) 97.8% 94.5% 98.3% 97.6% 98.1% 98.1% TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1% 97.7%