Chapter 4: Hidden Markov Models

Similar documents
An Introduction to Bioinformatics Algorithms Hidden Markov Models

Markov Chains and Hidden Markov Models. = stochastic, generative models

HIDDEN MARKOV MODELS

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

EECS730: Introduction to Bioinformatics

Hidden Markov Models. Three classic HMM problems

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models for biological sequence analysis

Hidden Markov Models

Hidden Markov Models

Hidden Markov Models for biological sequence analysis I

O 3 O 4 O 5. q 3. q 4. Transition

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models

Hidden Markov Models 1

Hidden Markov Models

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Computational Genomics and Molecular Biology, Fall

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Lecture 9. Intro to Hidden Markov Models (finish up)

Today s Lecture: HMMs

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

HMMs and biological sequence analysis

CS711008Z Algorithm Design and Analysis

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Introduction to Machine Learning CMU-10701

Lecture 3: Markov chains.

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. x 1 x 2 x 3 x K

CSCE 471/871 Lecture 3: Markov Chains and

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Computational Genomics and Molecular Biology, Fall

Stephen Scott.

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Hidden Markov Models (I)

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

11.3 Decoding Algorithm

Hidden Markov Models. x 1 x 2 x 3 x K

Genome 373: Hidden Markov Models II. Doug Fowler

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

Introduction to Hidden Markov Models (HMMs)

Biology 644: Bioinformatics

Hidden Markov Models. Terminology, Representation and Basic Problems

HMM: Parameter Estimation

Hidden Markov Models. Hosein Mohimani GHC7717

6 Markov Chains and Hidden Markov Models

Info 2950, Lecture 25

HMM : Viterbi algorithm - a toy example

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

DNA Feature Sensors. B. Majoros

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Methods. Algorithms and Implementation

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. Terminology and Basic Algorithms

Alignment Algorithms. Alignment Algorithms

Applications of Hidden Markov Models

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Evolutionary Models. Evolutionary Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov models in population genetics and evolutionary biology

Hidden Markov Models The three basic HMM problems (note: change in notation) Mitch Marcus CSE 391

Data Mining in Bioinformatics HMM

Hidden Markov Models and some applications

BMI/CS 576 Fall 2016 Final Exam

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

STA 414/2104: Machine Learning

Bio nformatics. Lecture 3. Saad Mneimneh

Bioinformatics 2 - Lecture 4

Pairwise alignment using HMMs

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Hidden Markov Models and some applications

Statistical Machine Learning from Data

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Advanced Data Science

L23: hidden Markov models

Hidden Markov Models (HMMs) and Profiles

Stephen Scott.

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Lecture #5. Dependencies along the genome

STA 4273H: Statistical Machine Learning

Dynamic Approaches: The Hidden Markov Model

8: Hidden Markov Models

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Hidden Markov Models (HMMs)

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Lecture 7 Sequence analysis. Hidden Markov Models

Hidden Markov Models. Terminology and Basic Algorithms

Transcription:

Chapter 4: Hidden Markov Models 4.1 Introduction to HMM Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Overview Markov models of sequence structures Introduction to Hidden Markov Models (HMM) HMM algorithms; Viterbi decoder Durbin chapters 3-5 2 1

The Challenges Biological sequences have modular structure Genes exons, introns Promoter regions modules, promoters Proteins domains, folds, structural parts, active parts How do we identify informative regions? How do we find & map genes How do we find & map promoter regions 3 Mapping Protein Regions Interfaces Active components ATP Binding Domain Activation Domain M244 E236 L266D276 E275 E258 Y257 V270 F311 F317 A269 L301 L284 E255 T315 L248 K271 E281 E316 M278 Y253 E286 Q252 E279 L384 E282 Q346 M351 V339 Y342 Y440 M472 L451 S265 E450 E494 F486 E352 Binding Site H396 4 2

Statistical Sequence Analysis Example: CpG islands indicate important regions CG (denoted CpG) is typically transformed by methylation into TG Promoter/start regions of gene suppress methylation This leads to higher CpG density How do we find CpG islands? Example: active protein regions are statistically similar Evolution conserves structural motifs but varies sequences Simple comparison techniques are insufficient Global/local alignment Consensus sequence The challenge: analyzing statistical features of regions 5 Review of Markovian Modeling Recall: a Markov chain is described by transition probabilities π(n+1)=aπ(n) where π(i,n)=prob{s(n)=i} is the state probability A(i,j)=Prob[S(n+1)=j S(n)=i] is the transition probability Markov chains describe statistical evolution In time: evolutionary change depends on previous state only In space: change depends on neighboring sites only 6 3

From Markov To Hidden Markov Models (HMM) Nature uses different statistics for evolving different regions Gene regions: CpG, promoters, introns/exons Protein regions: active, interfaces, hydrophobic/philic How can we tell regions? Sample sequences have different statistics Model regions as Markovian states emitting observed sequences Example: CpG islands Model: two connected MCs one for CpG one for normal The MC is hidden; only sample sequences are seen Detect transition to/from CpG MC Similar to a dishonest casino: transition from fair to biased dice 7 HMM Basics Hidden Markov Models A Markov Chain: states & transition probabilities A=[a(i,j)] Observable symbols for each state O(i) A probability e(i,x) of emitting the symbol X at state i.6 e(f,h)=0.5.4 e(b,h)=0.9.2 e(f,t)=0.5.8 e(b,t)=0.1 F B 8 4

Coin Example Two states MC: {F,B} F=fair coin, B=biased Emission probabilities Described in state boxes Or through emission boxes Example: transmembrane proteins Hydrophilic/hydrophobic regions.6 e(f,h)=0.5.4 e(b,h)=0.9.2 e(f,t)=0.5.8 e(b,t)=0.1.6.4 F.8 B.5.5.9.1.2 H T 9 HMM Profile Example (Non-gapped) S A=.0 A=1. A=.1 A=.5 A=.9 A=.0 1. 1. 1. C=.11. 1. C=.1 1. 1. G=.2 G=.0 G=.0 G=.2 G=.0 G=.0 T=.8 T=.0 T=.8 T=.3 T=.0 T=1. A State per Site A 0 10 1 5 9 0 C 0 0 1 0 1 0 G 2 0 0 2 0 0 T 8 0 8 3 0 10 E T A T G A T T A T A A T T A T A A T T A A T A T T A T A A T T A T T A T G A T A A T G A T A C T T A C G A T T A T T A T 10 5

How Do We Model Gaps? Gap can result from deletion or insertion S S Deletion = hidden delete state Insertion= hidden insert state A=.0 G=.2 T=.8 A=.0 G=.2 T=.8 A=1. G=.0 T=.0 A=1. G=.0 T=.0 A=.1 C=.1 G=.0 T=.8 A=.1 C=.1 G=.0 T=.8 A=.5 G=.2 T=.3 A=.5 G=.2 T=.3 Insert-State.7.3 A=.6 C=.2 G=.2 T=.0 1. 1. - =1. Delete-State.2.8 A=.25 C=.25 G=.25 T=.25 1. A=.0 G=.0 T=1. A=.0 G=.0 T=1. E E T A T G A T T A T A C T T A T A A T T A T T A T T A T A G T T A T T - T G A T A A T G A T A - T T A C G - T T A T T C T T A T G A T T A T A C T T A T A - T T A T T - T T A T A - T T A T T - T G A T A - T G A T A - T T A C G - T T A T T - 11 T Profile HMM ACA - - - ATG TCA ACT ATC ACA C - - AGC AGA - - - ATC ACC G - - ATC Output Probabilities insertion Transition probabilities Profile alignment E.g., What is the most likely path to generate ACATATC? How likely is ACATATC to be generated by this profile? 12 6

In General: HMM Sequence Profile D D D S M M M E i i i i 13 HMM For CpG Islands CpG generator Regular Sequence A+ G+ A- G- C+ T+ C- T- + A G C T + A G C T A 0.1800.2740.4260.120 G 0.1710.3680.2740.188 C 0.1610.3390.3750.125 T 0.0790.3550.3840.182 + A G A G C T A G C T C T 0.3 0.2050.285 0.21 0.3220.2980.0780.302 0.2480.2460.2980.208 0.1770.2390.2920.292 14 7

Modeling Gene Structure With HMM Genes are organized into sequential functional regions Regions have distinct statistical behaviors Splice site 15 HMM Gene Models HMM state region ; Markov transitions between regions Emission {A,C,T,G}; regions have different probabilities A=.29 C=.31 G=.04 T=.36 16 8

Computing Probabilities on HMM Path = a sequence of states E.g., X=FFBBBF Path probability: 0.5 (0.6) 2 0.4(0.2) 3 0.8 =4.608 *10-4 Probability of a sequence emitted by a path: p(s X) E.g., p(hhhhhh FFBBBF)=p(H F)p(H F)p(H B)p(H B)p(H B)p(H F) =(0.5) 3 (0.9) 3 =0.09 Note: usually one avoids multiplications and computes logarithms to minimize error propagation.5.5.6 e(f,h)=0.5.4 e(b,h)=0.9.2 e(f,t)=0.5.8 e(b,t)=0.1 17 The Three Computational Problems of HMM Decoding: what is its most likely sequence of transitions & emissions that generated a given observed sequence? Likelihood: how likely is an observed sequence to have been generated by a given HMM? Learning: how should transition and emission probabilities be learned from observed sequences? 18 9

The Decoding Problem: Viterbi s Decoder Input: an observed sequence S Output: a hidden path X maximizing P(S X) Key Idea (Viterbi): map to a dynamic programming problem Describe the problem as optimizing a path over a grid DP search: (a) compute price of forward paths (b) backtrack Complexity: O(m 2 n) (m=number of states, n= sequence size) States A 6 A 5 A 4 A 3 A 2 A 1 s 1 s 2 s 3 s 4 s 5 1 2 3 4 5 6 7 8 9 1012131415161718 State sequence Observed sequence x 19 Viterbi s Decoder F(i,k) = probability of the most likely path to state i generating S 1 S k Forward recursion: F(i,k+1)=e(i,S k+1 )*max j{f(j,k)a(i,j)} Best path to i Backtracking: start with highest F(i,n) and backtrack Initialization: F(0,0)=1, F(i,0)=0 S k+1 F(6,k)a(6,i) A 6 A 5 A 4 A 3 A 2 A 1 1 2 3 4 i k k+1 20 10

Example: Dishonest Coin Tossing what is the most likely sequence of transitions & emissions to explain the observation: S=HHHHHH B.2*.9=.18.5.5 F.8*.5=.4.4*.9=.36.6*.5=.3.6 e(f,h)=0.5 e(f,t)=0.5.4.8 e(b,h)=0.9 e(b,t)=0.1.2 B 1 H.45 H H H H H.09.065.019.006.0028 F 1 0.25 1.18.054.026.008.0023 2 3 4 5 6 w=.5*(.9)*(.4) 2 *(.3)*(.36) 2 21 Example: CpG Islands Given: observed sequence CGCG what is the likely state sequence generating it? C G C G A C G T A - G - C - T - A + G + C + T + 0 1 2 3 4 A+ G+ C+ T+ A- G- C- T- 22 11

Computing probability products propagates errors Instead of multiplying probabilities add log-likelihood Define f(i,k)=log F(i,k) Computational Note f(i,k+1)=log e(i,s k+1 ) + max j{f(j,k)+log a(i,j)} Or, define the weight w(i,j,k)=log e(i,s k+1 )+ log a(i,j) To get the following standard DP formulation f(i,k+1)=max j{f(j,k)+w(i,j,k)} 23 Example what is the most likely sequence of transitions & emissions to explain the observation: S=HHHHHH (using base 2 log) -1-2.32 B -2.32-0.15=-2.47 -.32-1=-1.32-1.32-.15=-1.47 F -.74-1=-1.74 W(.,.,H) F B S F B -2-1.15-1.74-1.47-1.32-2.47-0.74 lge(f,h)=-1-1.32 lge(b,h)=-0.15 lge(f,t)=-1-0.32 lge(b,t)=-3.32 f(i,k+1)=max j{f(j,k)+w(i,j,k)} H H H H H H B -1.15-3.47 F -2-2.47-1 24 12

Concluding Notes Viterbi decoding: hidden pathway of an observed sequence Hidden pathway explains the underlying structure E.g., identify CpG islands E.g., align a sequence against a profile E.g., determine gene structure.. This leaves the two other HMM computational problems How do we extract an HMM model, from observed sequences? How do we compute the likelihood of a given sequence? 25 13