Hidden Markov Models

Similar documents
HIDDEN MARKOV MODELS

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models 1

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

Hidden Markov Models

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

11.3 Decoding Algorithm

Hidden Markov Models. Hosein Mohimani GHC7717

Hidden Markov Models for biological sequence analysis

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Hidden Markov Models for biological sequence analysis I

Stephen Scott.

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Hidden Markov Models. x 1 x 2 x 3 x K

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Ron Shamir, CG 08

Markov Chains and Hidden Markov Models. = stochastic, generative models

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Hidden Markov Models (I)

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Bioinformatics: Biology X

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models

CS711008Z Algorithm Design and Analysis

EECS730: Introduction to Bioinformatics

HMM: Parameter Estimation

Chapter 4: Hidden Markov Models

Lecture 9. Intro to Hidden Markov Models (finish up)

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Introduction to Machine Learning CMU-10701

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Bio nformatics. Lecture 3. Saad Mneimneh

HMMs and biological sequence analysis

Lecture #5. Dependencies along the genome

Lecture 5: December 13, 2001

6 Markov Chains and Hidden Markov Models

ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

O 3 O 4 O 5. q 3. q 4. Transition

Biology 644: Bioinformatics

Info 2950, Lecture 25

8: Hidden Markov Models

Hidden Markov Models and some applications

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Today s Lecture: HMMs

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Genome 373: Hidden Markov Models II. Doug Fowler

Brief Introduction of Machine Learning Techniques for Content Analysis

Pairwise alignment using HMMs

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

Hidden Markov Models and some applications

11 Hidden Markov Models

CS 361: Probability & Statistics

Hidden Markov Methods. Algorithms and Implementation

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 3: Markov chains.

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

MACHINE LEARNING 2 UGM,HMMS Lecture 7

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

Computational Genomics and Molecular Biology, Fall

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Basic math for biology

Advanced Data Science

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Bioinformatics: Biology X

order is number of previous outputs

Hidden Markov Models (HMMs)

8: Hidden Markov Models

L23: hidden Markov models

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Data Mining in Bioinformatics HMM

Hidden Markov Models. Terminology, Representation and Basic Problems

2 : Directed GMs: Bayesian Networks

Computational Genomics and Molecular Biology, Fall

Hidden Markov Models (HMMs) November 14, 2017

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Statistical Sequence Recognition and Training: An Introduction to HMMs

Final Examination CS 540-2: Introduction to Artificial Intelligence

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov models in population genetics and evolutionary biology

Statistical Methods for NLP

Machine Learning, Midterm Exam

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Lecture 7 Sequence analysis. Hidden Markov Models

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Be able to define the following terms and answer basic questions about them:

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

Transcription:

Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of occurrence of a dinucleotide is ~ 1/16. However, the frequencies of dinucleotides in DNA sequences vary widely. In particular, CG is typically underepresented (frequency of CG is typically < 1/16) 1

Why CG-Islands? CG is the least frequent dinucleotide because C in CG is easily methylated and has the tendency to mutate into T afterwards However, the methylation is suppressed around genes in a genome. So, CG appears at relatively high frequency within these CG islands So, finding the CG islands in a genome is an important problem Markov Sequence Models Markov sequence models are like little machines that generate sequences. Markov models can model sequences with specific probability distributions. In an n-order Markov sequence model, the probability distribution of the next letter depends on the previous n letters generated. Markov models of order 5 or more are often needed to model DNA sequences well. 2

Markov Sequence Models Markov models are composed of states and transitions, drawn as boxes and arrows. States emit letters States have labels n letters long, showing the letter the state emits and the previous n-1 letters emitted by the model Transitions have probabilities Outgoing transitions from a state sum to 1 Markov Sequence Models There is a 1-1 correspondence between sequences and paths through a (non-hidden) Markov model. The probability of a sequence is the product of the transition probabilities on the arcs. The 0-order model is a special case since it also has emission probabilities at its one state. 3

Example Markov Sequence Models 0-order model: 1 state no labels 1 st -order: 4 states Label = next letter 2 nd -order: 20 states Label = previous+next letters Pr(A A) Pr(T T) Pr(A AA) S p 1 M 1-p E S Pr(A) A Pr(G G) Pr(T A) Pr(T G) Pr(C A) T Pr(C C) E AA Pr(G AA) Pr(T AA) Pr(C AA) AT E q A q C Pr(G) G Pr(C G) C AG AC q G qt Most transitions omitted for clarity Example: 1 st -order Let X = X 1 X 2 X L be a sequence. Then the 1 st -order Markov model below generates it with probability: P(X) = P(x1). 2 L P(X i-1 X i ) P(A A) P(T T) P(X) = 1 L P(X i-1 X i ) S P(A S) A P(G G) P(T A) P(T G) T P(C C) P(E T) E Note: in S at t=0, in E at t=l+1 P(G S) G P(C A) P(C G) C P(E C) Transitions omitted for clarity 4

Problem: Identifying a CpG island Input: A short DNA sequence X = (x1,...,xl) Σ (where Σ = {A,C,G,T}). Question: Decide whether X is a CpG island. Use two Markov chain models: - One for dealing with CpG islands (the + model) - Other for dealing with non CpG islands (the model) Problem: Identifying a CpG island Let a st + denote the transition probability of s,t Σ inside a CpG island Let a st - denote the transition probability of s,t outside a CpG island The logarithmic likelihood score for a sequence X P( X, CpG) Score( X ) = log = P( X, noncpg) The higher this score, the more likely it is that X is a CpG island a L + xi 1, xi log i= 1 axi 1, xi 5

Problem: Locating CpG island in a DNA sequence Input: A long DNA sequence X = (x1,...,xl) Σ (where Σ = {A,C,G,T}). Question: Locate the CpG islands along X The naive approach: - Extract a sliding window X k = (x k+1,..., x k+l ) of a given lenght l (l << L) from the sequence X - Calculate Score (X k ) for each one of the resulting subsequences - Subsequences that receive positive scores are potential CpG islands Problem: Locating CpG island in a DNA sequence Disadvantage: - We have no information about the lenghts of the islands - Assume that the islands are at least l nucleotides long - Different windows may classify the same position differently A better solution: a unified model 6

CG Islands and the Fair Bet Casino The CG islands problem can be modeled after a problem named The Fair Bet Casino The game is to flip coins, which results in only two possible outcomes: Head or Tail. The Fair coin will give Heads and Tails with same probability ½. The Biased coin will give Heads with prob. ¾. The Fair Bet Casino (cont d) Thus, we define the probabilities: P(H F) = P(T F) = ½ P(H B) = ¾,, P(T B) = ¼ The crooked dealer changes between Fair and Biased coins with probability 10% 7

The Fair Bet Casino Problem Input: A sequence x = x 1 x 2 x 3 x n of coin tosses made by two possible coins (F( or B). Output: A sequence π = π 1 π 2 π 3 π n, with eachπ i being either F or B indicating that x i is the result of tossing the Fair or Biased coin respectively. Problem Fair Bet Casino Problem Any observed outcome of coin tosses could have been generated by any sequence of states! Need to incorporate a way to grade different sequences differently. Decoding Problem 8

P(x fair coin) vs. P(x biased coin) Suppose first that dealer never changes coins. Some definitions: P(x fair coin): prob. of the dealer using the F coin and generating the outcome x. P(x biased coin): prob. of the dealer using the B coin and generating outcome x. P(x fair coin) vs. P(x biased coin) P(x fair coin)=p(x 1 x n fair coin) Π i=1,n p (x( i fair coin)= (1/2) n P(x biased coin)= P(x 1 x n biased coin)= Π i=1,n p (x( i biased coin)=(3/4) k (1/4) n-k = 3 k /4 n k - the number of Heads in x. 9

P(x fair coin) vs. P(x biased coin) P(x fair coin) = P(x biased coin) 1/2 n = 3 k /4 n 2 n = 3 k n = k log 2 3 when k = n / log 2 3 (k ~ 0.67n) If k < 0.67n, the dealer most likely used a fair coin Log-odds Ratio We define log-odds odds ratio as follows: log 2 (P(x fair coin) / P(x biased coin)) =1 log 2 (p + (x i ) / p - (x i )) =Σ k i=1 = n k log 2 3 10

Computing Log-odds Ratio in Sliding Windows x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x n Consider a sliding window of the outcome sequence. Find the log-odds for this short window. 0 Log-odds value Biased coin most likely used Fair coin most likely used Hidden Markov Models: HMMs There is no 1-1 correspondence between states and (sequence) symbols. You can t always tell what state emitted a particular character: states are hidden. There can be more than one path that generates any given sequence. This allows us to model more complex sequence distributions. 11

Hidden Markov Model (HMM) Can be viewed as an abstract machine with k hidden states that emits symbols from an alphabet Σ. Each state has its own probability distribution, and the machine switches between states according to this probability distribution. While in a certain state, the machine makes 2 decisions: What state should I move to next? What symbol - from the alphabet Σ - should I emit? Why Hidden? Observers can see the emitted symbols of an HMM but have no ability to know which state the HMM is currently in. Thus, the goal is to infer the most likely hidden states of an HMM based on the given sequence of emitted symbols. 12

Example: CpG Island HMM CpG island DNA states: large C, G transition probabilities Normal DNA states: small C, G transition probabilities -+ a AC A + C + A - C - B + a GC + a CG - a GC - a CG E G + T + G - T - CpG Island Sub-model Normal DNA Sub-model Most transitions omitted for clarity Example: Modeling CpG Islands In a CpG island, the probability of a C following a G is much higher than in normal intragenic DNA sequence. We can construct an HMM to model this by combining two HMMs: one for normal sequence and one for CpG island sequence. Transitions between the two sub-models allow the model to switch between CpG island and normal DNA. Because there is more than one state that can generate a given character, the states are hidden when you just see the sequence. For example, a C can be generated by either the C + or C - states in the following model. 13

HMM Parameters Σ: : set of emission characters. Ex.: Σ = {H, T} or {1,0} for coin tossing Σ = {A,C,T,G} for DNA Q: : set of hidden states, each emitting symbols from Σ. Q={F,B} for coin tossing HMM Parameters (cont d) A = (a( kl ): a Q x Q matrix of probability of changing from state k to state l. a FF = 0.9 a FB = 0.1 a BF = 0.1 a BB = 0.9 E = (e( k (b)): a Q x Σ matrix of probability of emitting symbol b while being in state k. e F (0) = ½ e F (1) = ½ e B (0) = ¼ e B (1) = ¾ 14

HMM for Fair Bet Casino The Fair Bet Casino in HMM terms: Σ = {0, 1} (0( for Tails and 1 Heads) Q = {F,B{ F,B} F for Fair & B for Biased coin. Transition Probabilities A *** Emission Probabilities E A Fair Biased E Tails(0) Heads(1) Fair a FF = 0.9 a FB = 0.1 Fair e F (0) = ½ e F (1) = ½ Biased a BF = 0.1 a BB = 0.9 Biased e B (0) = ¼ e B (1) = ¾ HMM for Fair Bet Casino (cont d) HMM model for the Fair Bet Casino Problem 15

Hidden Paths A pathπ= π 1 π n in the HMM is defined as a sequence of states. Consider path π = FFFBBBBBFFF and sequence x = 01011101001 Probability that x i was emitted from stateπ i x 0 1 0 1 1 1 0 1 0 0 1 π = F F F B B B B B F F F P(x π i i ) ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½ P(π i-1 π i ) ½ 9 / 9 10 / 1 10 / 9 10 / 9 10 / 9 10 / 9 10 / 1 10 / 9 10 / 9 10 / 10 Transition probability from stateπ i-1 to stateπ i P(x π) Calculation P(x π): Probability that sequence x was generated by the path π: P(x π) ) = P(π 0 π 1 ) ΠP(x i π i ) P(π i π i+1 n i=1 +1 ) = a π0, π 1 Π e π i (x i ) a πi,π i+1 π 0 andπ n+1 represent the fictitious initial and terminal state (begin and end) 16

Three classic HMM problems Evaluation Given a model and an output sequence, what is the probability that the model generated that output? Decoding Given a model and an output sequence, what is the most likely state sequence (path) through the model that produced that sequence? Learning Given a model and a set of observed sequences, what should the model parameters be so that it has a high probability of generating those sequences? Decoding Problem Goal: Find an optimal hidden path of states given observations. Input: Sequence of observations x = x 1 x n generated by an HMM M(Σ,, Q, A, E) E Output: A path that maximizes P(x π) over all possible paths π. 17

Building Manhattan for Decoding Problem Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem. Every choice of π = π 1 π n corresponds to a path in the graph. The only valid direction in the graph is eastward. This graph has Q 2 (n-1) edges. Edit Graph for Decoding Problem 18

Decoding Problem vs. Alignment Problem Valid directions in the alignment problem. Valid directions in the decoding problem. Decoding Problem as Finding a Longest Path in a DAG The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above. Notes: the length of the path is defined as the product of its edges weights, not the sum. 19

Decoding Problem (cont d) Every path in the graph has the probability P(x π). The Viterbi algorithm finds the path that maximizes P(x π) ) among all possible paths. The Viterbi algorithm runs in O(n Q 2 ) time. Decoding Problem: weights of edges w (k, i) (l, i+1) The weight w is given by:??? 20

Decoding Problem: weights of edges n P(x π) ) =Πeπ= i+1 i=0 w +1 (x i+1 +1 ). a πi,π i+1 (k, i) (l, i+1) The weight w is given by:?? Decoding Problem: weights of edges i-th term = e π i+1 (x i+1 ). a πi, π i+1 w (k, i) (l, i+1) The weight w is given by:? 21

Decoding Problem: weights of edges i-th term = e l (x i+1 ). a kl for π i =k,π i+1 =l w (k, i) (l, i+1) The weight w=e l (x i+1 ). a kl Decoding Problem and Dynamic Programming S k,i k,i is the probability of the most probable path for the prefix (x 1,...,x i ) s l,i+1 = max kєq {s k,i weight of edge between (k,i) and = max kєq {s k,i a kl e l (x i+1 +1) }= e l (x i+1 +1) max kєq {s k,i a kl } and (l,i+1)}= 22

Decoding Problem (cont d) Initialization: s begin,0 = 1 s k,0 = 0 for k begin. For each i =0,...,L-1 1 and for each l Q recursively calculate: s l,i+1 = e l (x i+1 +1) max kєq {s k,i a kl } Let π * be the optimal path. Then, P(x π * ) = max kєq {s k,n. a k,end } Viterbi Algorithm The value of the product can become extremely small, which leads to overflowing. To avoid overflowing, use log value instead. s k,i+1,i+1= loge l (x i+1 ) + max kєq {s k,i + log(a kl )} Initialization: s begin,0 = 0 s k,0 = - for k begin The score for the best path Score (X, π * ) = max kєq {s k,n + log(a kend )} 23

Viterbi Algorithm Complexity: We calculate the values of O( Q.L) cells of the matrix V Spend O( Q ) operations per cell Overall time complexity: O(L. O 2 ) Space complexity: O(L. Q ) 24