Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Similar documents
Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models

CSCE 471/871 Lecture 3: Markov Chains and

Stephen Scott.

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Stephen Scott.

HIDDEN MARKOV MODELS

Hidden Markov Models. x 1 x 2 x 3 x K

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Lecture 5: December 13, 2001

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. Three classic HMM problems

Multiple Sequence Alignment using Profile HMM

Hidden Markov Models for biological sequence analysis

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Hidden Markov Models (I)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models for biological sequence analysis I

Pairwise alignment using HMMs

Lecture 9. Intro to Hidden Markov Models (finish up)

Hidden Markov Models

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

11.3 Decoding Algorithm

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Hidden Markov Models. Hosein Mohimani GHC7717

CS711008Z Algorithm Design and Analysis

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

6 Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

HMMs and biological sequence analysis

EECS730: Introduction to Bioinformatics

Computational Genomics and Molecular Biology, Fall

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models 1

Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

HMM: Parameter Estimation

Data Mining in Bioinformatics HMM

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Introduction to Machine Learning CMU-10701

Today s Lecture: HMMs

order is number of previous outputs

Markov chains and Hidden Markov Models

STA 414/2104: Machine Learning

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Computational Genomics and Molecular Biology, Fall

Lecture 7 Sequence analysis. Hidden Markov Models

Pair Hidden Markov Models

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Advanced Data Science

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Hidden Markov Models (HMMs) and Profiles

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models and Their Applications in Biological Sequence Analysis

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

STA 4273H: Statistical Machine Learning

Lecture #5. Dependencies along the genome

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Hidden Markov Models. Terminology, Representation and Basic Problems

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Chapter 4: Hidden Markov Models

Hidden Markov Models. x 1 x 2 x 3 x N

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Hidden Markov Models (HMMs) November 14, 2017

Lecture 12: Algorithms for HMMs

Markov Models & DNA Sequence Evolution

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Biology 644: Bioinformatics

Bioinformatics Introduction to Hidden Markov Models Hidden Markov Models and Multiple Sequence Alignment

Hidden Markov Models and Applica2ons. Spring 2017 February 21,23, 2017

Hidden Markov Models and Gaussian Mixture Models

Biological Sequences and Hidden Markov Models CPBS7711

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Lecture 12: Algorithms for HMMs

Basic math for biology

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

Hidden Markov models in population genetics and evolutionary biology

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Gibbs Sampling Methods for Multiple Sequence Alignment

Transcription:

Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1

The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P B (5)=0.1 P B (6) =0.5 13652656643662612564 13652656643662612564 Can we tell when the loaded die is used? 2

CpG islands: Example - CpG islands DNA stretches (100~1000bp) with frequent CG pairs (contiguous on same strand). Rare, appear in significant genomic parts. Problem (1): Given a short genome sequence, decide if it comes from a CpG island. 3

Preliminaries: Markov Chains (S, A, p) S: State set p: Initial state prob. vector {p(x 1 =s)} [alternatively, use a begin state] A: Transition prob. matrix a st = P(x i =t x i-1 =s) Assumption: X=x 1 x n is a random process with memory length 1, i.e.: s i S P(x i =s i x 1 =s 1,,x i-1 =s i-1 ) = P(x i =s i x i-1 =s i-1 ) = a s i-1,si Sequence probability: P(X) = p(x 1 ) Π i=2 L a x i-1,xi

Markov Models - Transition probs for non-cpg islands + Transition probs for CpG islands - A C G T A 0.300 0.322 0.248 0.177 C 0.205 0.298 0.246 0.239 G 0.285 0.078 0.298 0.292 T 0.210 0.302 0.208 0.292 + A C G T A 0.180 0.171 0.161 0.079 C 0.274 0.368 0.339 0.355 G 0.425 0.274 0.375 0.384 T 0.120 0.188 0.125 0.182 5

CpG islands: Fixed Window Problem (1): Given a short genome sequence X, decide if it comes from a CpG island. Solution: Model by a Markov chain. Let a + st : transition prob. in CpG islands, a - st : transition prob. outside CpG islands. Decide by log-likelihood ratio score: score( X ) P( X CpG island) = log P( X non CpG island) 1 bits _ score( X ) = n n + xi 1,x log2 i = 1 ax,x a i 1 i i = n log a a + x i 1 i = x,x 1 i 1,x i i 6

Discrimination of sequences via Markov Chains 48 CpG islands, tot length ~60K nt. Similar non-cpg. Durbin et. al, Fig. 3.2 7

CpG islands the general case Problem(2): Detect CpG islands in a long DNA sequence. Naive Solution - Sliding windows: 1 k L-l, window: X k = (x k+1,,x k+l ) score: score(x k ) positive score potential CpG island Disadvantage: what is the length of the islands? How do we identify transitions? Idea: Use Markov chains as before, with additional (hidden) states 8

Hidden Markov Model (HMM) Alphabet of symbols Example: {A, C, G, T} path Π=π 1,,π n (sequence of states - simple Markov chain; convention: π 0 - begin, π L+1 - end) Given sequence X = (x 1,,x L ): a kl = P(π i =l π i-1 =k), e k (b) = P(x i =b π i =k) Finite set of states, capable of emitting symbols. Example: Q = {A +,C +,G +,T +,A -,C -,G -,T - } M=(Σ, Q, Θ) Θ=(A,E) A: Transition prob. a kl k,l Q E: Emission prob. e k (b) k Q, b Σ P(X,Π) = a π0,π 1 Π i=1 L e π i(x i ) a π i,πi+1 Ex.: express P(X,Π) in terms of counts 9

Viterbi s Decoding Algorithm (finding most probable state path) Want: path Π maximizing P(X, Π) v k (i) = prob. of most probable path ending in state k at step i. Init: v 0 (0) = 1; v k (0)=0 k>0 Step: v k (i+1)=e k (x i+1 ) max l {v l (i) a lk } End: P(X, Π * ) = max l {v l (L) a l0 } Time complexity: O(Ln 2 ) for n states, L steps Can find Π* using back pointers. 10

The occasionally dishonest casino (2) A B 13652656643662612564 13652656643662612564 11

The occasionally dishonest casino (2) 12

HMM for CpG Islands States: A + C + G + T + A - C - G - T - Symbols: A C G T A C G T Path Π=π 1,,π n : sequence of states + A C G T A 0.180 0.171 0.161 0.079 C 0.274 0.368 0.339 0.355 G 0.425 0.274 0.375 0.384 T 0.120 0.188 0.125 0.182 http://www.cs.huji.ac.il/~cbio/handouts/class4.ppt - A C G T A C G 0.300 0.205 0.285 0.322 0.298 0.078 0.248 0.246 0.298 0.177 0.239 0.292 transition prob. T 0.210 0.302 0.208 0.292 13

Posterior State Probabilities Goal: calculate P(π i =k X) Our strategy: P(X, π i =k) = = P(x 1,,x i, π i =k) P(x i+1,,x L x 1,,x i, π i =k) = P(x 1,,x i, π i =k) P(x i+1,,x L π i =k) P(π i =k X) = P(π i =k, X) / P(X) Need to compute these two terms - and P(X) 14

Forward Algorithm Goal: calculate P(X) = Σ Π P(X, Π) Approximation: take max path Π* from Viterbi alg. Not justified when several near maximal paths Exact alg : Forward Algorithm f k (i) = P(x 1,,x i, π i =k) Init: f 0 (0) = 1; f k (0)=0 k>0 Step: f k (i+1) = e k (x i+1 ) l f l (i) a lk End: P(X) = Σ l f l (L) a l0 15

Backward Algorithm b k (i) = P(x i+1, x L π i =k) init: k, b k (L) = a k0 step: b k (i) = l a kl e l (x i+1 ) b l (i+1) End: P(X) = Σ k a 0k e k (x 1 ) b k (1) 16

Posterior State Probabilities (2) Goal: calculate P(π i =k X) Recall: f k (i) = P(x 1,,x i, π i =k) b k (i) = P(x i+1, x L π i =k) Each can be used to compute P(X) P(X, π i =k) = = P(x 1,,x i, π i =k) P(x i+1,,x L x 1,,x i, π i =k) = P(x 1,,x i, π i =k) P(x i+1,,x L π i =k) = f k (i) b k (i) P(π i =k X) = P(π i =k, X) / P(X) 17

Dishonest Casino (3) Durbin et al. pp. 60 18

Posterior Decoding Now we have P(π i =k X). How do we decode? 1. π i* =argmax k P(π i =k X) Good when interested in state at particular point path of states π 1*,.., π * L may not be legal 2. Define a function of interest g(i) on the states. Compute G(i X) = Σ k P(π i =k X) g(k) E.g.: g(i) =1 for states in S, 0 on the rest: G(i X) is posterior prob of symbol i coming from S e.g., CpG island 19 S={A +,C +,G +,T + }

Parameter Estimation for HMMs Log likelihood of model Input: X 1,,X n independent training sequences Goal: estimation of Θ = (A,E) (model parameters) Note: P(X 1,,X n Θ) = Π i=1 n P(X i Θ) (indep.) l(x 1,,x n Θ) = log P(X 1,,X n Θ) = i=1 n log P(X i Θ) Case 1 - Estimation When State Sequence is Known: A kl = #(occurred k l transitions) E k (b) = #(emissions of symbol b that occurred in state k) Max. Likelihood Estimators: a kl e k (b) = A kl / l A kl = E k (b) / b E k (b ) small sample, or prior knowledge correction: A kl = A kl + r kl E k(b) = E k (b) + r k (b) Dirichlet priors 20

Parameter Estimation in HMM Case 2: -Estimation When States are Unknown Input: X 1,,X n indep training sequences Baum-Welch alg. (1972): Expectation: guarantees convergence monotone many local optima special case of EM compute expected no. of k l state transitions: (ex.) P(π i =k, π i+1 =l X, Θ) = [1/P(x)] f k (i) a kl e l (x i+1 ) b l (i+1) A kl = j [1/P(X j )] i f kj (i) a kl e l (x j i+1) b lj (i+1) compute expected no. of symbol b appearances in state k E k (b) = j [1/P(X j )] {i x j i =b} f kj (i) b kj (i) (ex.) Maximization: re-compute new parameters from A, E using max. likelihood. repeat (1)+(2) until improvement ε 21

Profile HMMs (Haussler et al, 1993) Ungapped alignment AG---C of X against a profile M: e i (a) = prob. of A-AT-C observing a at position i. P(X M) = Π i=1 L AG-AA- e i (x i ), or Score(X M) --AAAC = i=1 L log [e i (x i ) / q x i] indels: AG---C background prob. insert I match states begin j M j end delete D j 22

Profile HMMs Gapped alignment of X against a profile M: assume e I j(a) = q a (backgroud prob.) gap of length k contributes to log-odds: log(a Mj,Ij ) + log(a Ij,Mj+1 ) + (k-1) log(a Ij,Ij ) { gap open } (gap extension) No log odds contribution from the emission insert I match states begin j M j end delete D j 23

Profile HMM Transition Probabilities Mi Mi+ 1 Mi Di+ 1 Mi Ii Emission probabilities M i a I i a Ii Mi+ 1 Ii Ii Ii Di+ 1 Di Di+ 1 Di Mi+ 1 Di Ii http://www.cs.huji.ac.il/~cbio/handouts/class6.ppt 24

Example Suppose we are given the aligned sequences **---* AG---C A-AT-C AG-AA- --AAAC AG---C Suppose also that the match positions are marked... http://www.cs.huji.ac.il/~cbio/handouts/class6.ppt 25

Calculating A, E count transitions and emissions: transitions 0 1 2 3 M-M M-D M-I I-M I-D I-I D-M D-D D-I **---* AG---C A-AT-C AG-AA- --AAAC AG---C A C G T A C T G emissions 0 1 2 3 http://www.cs.huji.ac.il/~cbio/handouts/class6.ppt 26

Calculating A, E count transitions and emissions: transitions 0 1 2 3 M-M 4 3 2 4 M-D 1 1 0 0 M-I 0 0 1 0 I-M 0 0 2 0 I-D 0 0 1 0 I-I 0 0 4 0 D-M - 0 0 1 D-D - 1 0 0 D-I - 0 2 0 **---* AG---C A-AT-C AG-AA- --AAAC AG---C emissions 0 1 2 3 A - 4 0 0 C - 0 0 4 G - 0 3 0 T - 0 0 0 A 0 0 6 0 C 0 0 0 0 T 0 0 1 0 G 0 0 0 0 http://www.cs.huji.ac.il/~cbio/handouts/class6.ppt 27

Estimating Maximum Likelihood probabilities using fractions emissions 0 1 2 3 A - 4 0 0 C - 0 0 4 G - 0 3 0 T - 0 0 0 A 0 0 6 0 C 0 0 0 0 T 0 0 1 0 G 0 0 0 0 http://www.cs.huji.ac.il/~cbio/handouts/class6.ppt 0 1 2 3 A - 1 0 0 C - 0 0 1 G - 0 1 0 T - 0 0 0 A.25.25.86.25 C.25.25 0.25 T.25.25.14.25 G.25.25 0.25 28

Estimating ML probabilities (contd) transitions 0 1 2 3 M-M 4 3 2 4 M-D 1 1 0 0 M-I 0 0 1 0 I-M 0 0 2 0 I-D 0 0 1 0 I-I 0 0 4 0 D-M - 0 0 1 D-D - 1 0 0 D-I - 0 2 0 0 1 2 3 M-M.8.75.66 1.0 M-D.2.25 0 0 M-I 0 0.33 0 I-M.33.33.28.33 I-D.33.33.14.33 I-I.33.33.57.33 D-M - 0 0 1 D-D - 1 0 0 D-I - 0 1 0 http://www.cs.huji.ac.il/~cbio/handouts/class6.ppt 29

HMM from multiple alignment Shade: insert 30

.. and multiple alignment from a given HMM Align each sequence to the profile separately Right: a new sequence realigend with the model Inserts are unaligned. 31

Searching with Profile HMMs Compute: log P( X Model) P( X random) V jm (i): log odds of best path matching x 1,,x i to submodel up to level j, ending with x i emitted by state M j V ji (i) : same, ending with x i emitted by state I j V jd (i) : score for best path ending in D j after x i has been emitted (and x i+1 has not been emitted yet) V jm (i) = log [e Mj (x i ) / q x i] + max { V j-1m (i-1)+log(a Mj-1,Mj ), V j-1i (i-1)+log(a Ij-1,Mj ), V j-1d (i-1)+log(a Dj-1,Mj ) } V ji (i) = log [e Ij (x i ) / q x i] + max { V jm (i-1)+log(a Mj-1,Ij ), V ji (i-1)+log(a Ij,Ij ), V jd (i-1)+log(a Dj,Ij ) } V jd (i) = max {V j-1m (i)+log(a Mj-1,Dj ), V j-1i (i)+log(a Ij-1,Dj ), V j-1d (i)+log(a Dj-1,Dj ) } Π i q x i 32

Training a profile HMM from unaligned sequences choose length of the profile HMM, initialize parameters train the model using Baum-Welch obtain MA with the resulting profile as before. (Formulas inlc. forward/backward in handout) 33

An Illustrative Study HMMs in Computational Biology, Krogh, Brown, Mian, Sjoalnder, Haussler 93 Globin experiment: Heme-containing proteins, involved in the storage and transport of oxygen 625 globins from Swissprot, ave. legnth 145AA Training set: 400 sequences Built model and ran it against test globins and all Swissprot proteins (25K). 34

Representative globins Ron Shamir, CG 08 Alignment from Bashford et al 87 A-H: alpha helices 35

Realignment by trained HMM Ron Shamir, CG 08 36

Part of the final globin model 37

Score vs sequence length Negative LL. Length dependent using log odds is preferable For all Swissprot seqs of < 300AA NLL=-log(P(seq model)) 38

Z-score distribution Z-score: (S-E(S)) / sd(s) On ~25K proteins, cutoff 5 misses 2/628 globins with no fp 39

Discovering subfamilies 10 component HMM. Training set: 628 globins. Classified each sequence using the model. Generated 7 nonempty clusters, with ~560 falling into 4 biologically meaningful clusters. 40

Local alignment in HMM Add flanking states for regions of unaligned seq. Durbin et al. Ch. 5.5 41

42