ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

Size: px
Start display at page:

Download "ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY"

Transcription

1 BIOINFORMATICS Lecture Hidden Markov Models ROBI POLIKAR 2011, All Rights Reserved, Robi Polikar. IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY These lecture notes are prepared by Robi Polikar. Unauthorized use, including duplication, even in part, is not allowed without an explicit written permission. Such permission will be given upon request for noncommercial educational purposes if you agree to all of the following: 1. Restrict the usage of this material for noncommercial and nonprofit educational purposes only; AND 2. The entire presentation is kept together as a whole, including this page and this entire notice; AND 3. You include the following link/reference on your site: Bioinformatics, 2011 Robi Polikar, Rowan Robi University Polikar

2 THIS WEEK IN BIOINFORMATICS Markov Models Markov chain Hidden Markov model Viterbi algorithm Forward algorithm Backward algorithm HMM parameter estimation: Baum- Welch algorithm Photo / diagram credits CH Courtesy: National Human Genome Research Institute N. Cristianini & M. W. Hahn, Int. Computational Genomics, Cambridge, 2007 DE R. Durbin, S. Eddy, A. Krogh & G. Mitchison, Biological Sequence Analysis, Cambridge, 1988 ZB RP M. Zvelebil & J.O. Baum, Understanding Bioinformatics, Garland Sci., 2008 Robi Polikar All Rights Reserved 2011.

3 NEED FOR BETTER METHODS? We have a number of bioinformatics applications so far, which were solved by relatively simple algorithms: Finding genes Determine the ORFs by looking at start and stop codons Detecting evolutionary changes Analyze sequence statistics, look for sudden changes Determine if two genes are related Local and global alignment algorithms The problem is that most real world bioinformatics problems require somewhat more sophisticated approaches Finding genes in eukaryotic species is virtually impossible with the simple ORF detection technique due to introns and exons, or for short genes. Detecting evolutionary changes simply by looking at sequence statistics is impossible for changes with smaller evolutionary time frames. Determining whether two genes are related is also very difficult, even with the most sophisticated BLAST algorithm, when the genes are relatively short. More sophisticated probabilistic models that can describe the complex genomic interactions are necessary. Model of choice for biological sequences: Markov models (Markov chains and hidden Markov models)

4 RECALL MULTINOMIAL MODEL DNA sequence is generated by randomly drawing letters from the alphabet Ν DNA = {A, C, G, T}, where the letters are assumed to be independent and identically distributed (i.i.d). For any given sequence position i, we independently draw one of these four letters, from the same distribution over the alphabet N DNA. The constant distribution simply assigns a probability to each letter, p=(p A, p G, p C, p T ), such that the probability of observing any of the nucleotides at position i of the sequence s is p x =p(s(i)=x). Note that since there are a finite number of outcomes for each experiment (of drawing a letter from this distribution), this is a discrete distribution defined by its probability mass function. N! P X x X x p p # of possible outcomes (4) pa pc pg pt 1 x1,..., 1 1 k k 1 k x1! xk! k # of times outcome 1 (an A) is observed p 1, x N i i1 i1 k Probability of success for i th outcome i x k # of trials (sequence length)

5 RECALL MULTINOMIAL MODEL Because of the iid assumption, the multinomial distribution allows us to compute the likelihood of observing any given sequence simply by multiplying the individual probabilities. Given s = s 1 s 2 s N N Ps p s i i1 Note that the iid assumption makes this an unrealistic model, we also know that the DNA sequence is not completely random. However, this model explains a quite a bit of the behavior of the DNA data. Finding the regions of DNA where this model is violated do in fact lead to interesting findings. We can easily evaluate the validity of this assumption, by looking at the frequency distributions of the letters in a specific regions, and checking whether such distributions change over the regions.

6 MARKOV SEQUENCE MODEL This is a more complex, and but possibly a more accurate model, based on the Markov chains. A Markov chain is a series of discrete observations, called states, where the probability of observing the next state is given by fixed transitional probabilities, called the transition matrix T. When used in the context of bioinformatics, the states are the individual nucleotides (or amino acids) In a Markov chain, the probability of observing any one of the finite outcomes depends on the previous observations; specifically, the transition probabilities of observing a particular outcome after another one. If the process is in state x at step i, then the probability of being in state y at step i+1 is given by the transition probability p xy p P s y s x xy i1 i The transitional probabilities between every possible states then determines the transition matrix.

7 MARKOV CHAINS Formally, a first order Markov chain is a sequence of discrete valued random variables, X 1, X 2, that follows the Markov property: N 1 1 1, 2 2,, N N N 1 N N P X x X x X x X x P X x X x i.e., the probability of observing the present state (outcome) depends only on the previous state through the transition probability to move from that state to the current state: given the present state, future and past states are independent. The multinomial model can be interpreted as a zeroth order Markov chain, where there is no dependence on the previous outcomes. A Markov chain of order m is a process where the next state depends on the m previous states The probability of observing a particular sequence is then given by s i1 P P s s P s s P s s s s p N N N N s si i2 where π(s 1 ) is the probability for the starting state to be s 1. N

8 APPLICATIONS OF MARKOV MODELS Segmentation / Gene finding Gene and protein sequences contain distinct regions with different chemical / genomic properties. Markov models can be used to precisely locate the boundaries of these regions. Is a particular (DNA/AA) sequence a gene? Is a greater than average occurrence of CG nucleotides represent a CpG island? Does a given sequence belong to a particular gene / protein family? Is a particular sequence of repeats, unusual insertions, and over/under representation of GC content represent a pathogenicity island (PAI)? Multiple alignment Aligning several sequences at once can be done by first creating a so-called profile HMM a model that describes the probabilistic characteristics of all sequences against which each of the sequences can then be aligned. Prediction of function Simple alignment does not lead to function prediction. Two well aligned AA sequences do not necessarily have the same function. However, profile HMMs can also be used to quickly determine protein function from the given sequences, by making probabilistic predictions. There are also many non-bioinformatics related applications as well Speech processing : speaker or speech identification Financial data analysis Pattern recognition

9 MARKOV CHAINS For the 4-letter DNA nucleotide sequence, we have the following model. Given T and π, we can compute the probability of any given sequence. p AA p TA p AT p AG p GA p AC p TG p TC p GT p GG p CG p GC T p p p p AA AC AG AT p p p p CA CC CG CT p p p p GA GC GG GT p p p p TA TC TG TT p CA A C G T p CT RP p TT p CC

10 AN EXAMPLE CPG ISLANDS CG islands, also called CpG islands, are short stretches of genomic regions with a higher frequency of the CG sequence compared to other regions. Also called the CpG island, where "p" indicates that "C" and "G" are connected by a phosphodiester bond, to distinguish from C-G base pairings across the two strands normally connected by hydrogen bonds. In humans, the C in CG dinucleotides are commonly modified by a process called methylation, which usually converts the C to a T, particularly at those sites that are biologically unimportant, or at inactive genes to suppress their expression. As a result, the CG dinucleotides appear less often than would be expected from their independent probabilities of C and G. But at biologically important places, such as the promoters (start regions of genes), the methylation process is suppressed, leading to higher concentration of CG dinucleotides than other regions of the genome. Such regions are called CpG islands, and they are typically bp long.

11 AN EXAMPLE CPG ISLANDS About 56% of human genes and 47% of mouse genes are associated with CpG islands ( Antequera and Bird, 1993 ) CpG islands are commonly defined as regions of DNA of at least 200 bp in length and that have a G+C content above 50% and a ratio of observed vs. expected CpGs close to or above 0.6. Q Given a short sequence of nucleotides, how can we determine whether it comes from a CpG island? Q Given an long sequence, how can we find the CpG islands in it?

12 MARKOV CHAIN Finding CpG islands can be best tackled by using a Markov Chain, as we are interested in a probabilistic model that generates a particular ordering of nucleotides: a C followed by a G. Of course CG pairs can appear in CpG islands as well as other areas. How do we separate them? Let s denote the area that is in fact a CpG island with +, and other areas with Then we have two models p AA p AG p GG p AA p AG p GG p GA p GA p TA p AT p AC + p GT p CG p GC p TA p AT p AC _ p GT p CG p GC p TG p CA p TG p CA p TC p TC p CT p CT p TT p CC p TT p CC RP

13 MARKOV CHAINS FOR CPG ISLANDS We can also represent these two models with tables, the transition probabilities of which come from maximum likelihood estimates: sxy s xy sxy sxy s s where s xy is the number of times nucleotide y follows nucleotide x in the + regions. Note that the normalization term in the denominator sums these number of occurrences over all possible choices of the second nucleotide. Then, given training data from CpG (+) and non CpG (-) regions, we can compute the transition probabilities. Let s assume that we get the following values for each region. z xz z xz + A C G T A C G T _ A C G T A C G T

14 BIOINFORMATICS MARKOV CHAINS FOR CPG ISLANDS We can make some observations from these tables, e.g., note that the transition probability from C to G is much higher in the + region than it is in the region, as we would expect. Also note that each row adds up to 1, as these are the probabilities, i.e., the first row shows the probabilities of each nucleotide following an A. + A C G T _ A C G T A A C C G G T T Given a particular sequence s = [x1, x2,, xn], we can compute the probability of that sequence under each model. Let s call these p(s +) and p(s -). The log-odds ratio of the two can then be used to determine whether the sequence s is coming from the CpG or not: N sxi 1xi p s S s log log in 1 p s sxi 1xi i 1 N N sx i 1xi log xi 1xi sxi 1xi i 1 i 1 log-likelihood ratios

15 MARKOV CHAINS FOR CPG ISLANDS So, how do we use this? First take a look at the histogram of scores generated by computing S(s) for all sequences in + and regions, normalized with respect to their length. _ + DE We can see clear separation between the two groups at 0 bits. Then, those sequences whose score is greater than 0 are most likely to come from CpG regions. Let s take a look at the CGCG sequence. The score (sum of log likelihoods) for this sequence is S CGCG log log log therefore, we conclude that the sequence CGCG must come from a CpG island.

16 MARKOV CHAINS FOR GENE FINDING? Recall that we started this discussion with two questions: Q Given a short sequence of nucleotides, how can we determine whether it comes from a CpG island? Q Given a long sequence, how can we find the CpG islands in it? We answered the first one, but what about the second one? Note that this is essentially the same question as asking given a long sequence, how can we find the genes in it? We can use a modified approach, essentially following the same process: Consider windows of (short) sequences, say of 100 nucleotides around each nucleotide of the long sequence Calculate the log-odds ratios based scores for each window Identify the windows with positive scores. Merge intersecting windows to then determine which segments of the long sequence correspond to the CpG islands (or genes for that matter).

17 MARKOV CHAINS: WHAT IS NOT TO LIKE? Well, the problem is that CpG islands are of variable lengths, but have sharp boundaries. And also, why use a window of 100? If we use a too small window, then every occurrence of CG start appearing as an island If we use a too wide window, we miss the sharp boundaries where CpG islands begin and end. A better way is to combine the two Markov chains (one for each + and regions) into a single model, where the combined model allows to switch from one chain to the other with some specified probability. But this introduces additional complexities: we now have two states corresponding to the same nucleotide symbols We resolve this by relabeling the states as A +, C +, G +, T + and A -, C -, G -, T - Our combined model may then look like (drum roll )

18 MARKOV CHAINS HIDDEN MARKOV MODELS DE Note that in addition to transitions between every nucleotide of different sets (+ and -), not shown in the figure, are another complete set of transitions within each set, as shown earlier in the Markov chain. The transition probabilities within each group are set close to their original values, with a small probability to allow switching to the other group (e.g. from a + state to state) This reflects the fact that once the sequence is in one region (say + region), the next nucleotide is more likely to be in the same region than it is in the opposite region, but with a small yet positive probability allowing to switch to the other region. This is the hidden Markov model.but wait what exactly is hidden here?

19 MARKOV CHAINS HIDDEN MARKOV MODELS Note that there is no longer a one-to-one correspondence between the states and the symbols. In other words, it is no longer possible to tell what state (region, + or -) the system is in just by looking at the observed sequence Specifically, when we receive the sequence, we receive ACGGC, not A + C + G - G - C +, hence we do not know whether each nucleotide is coming from the + region or the region. The C could come either from the + or the region. Hence, the states are hidden, as the sequence does not tell how it was generated.

20 HIDDEN MARKOV MODELS This brings us to the kind of problems HMMs can help us solve: In general, three types of problems can be addressed by HMMs 1. Evaluation: Given an HMM and a sequence of observations, s, what is the probability p(s) that the sequence s is generated by this HMM? 2. Decoding: Given an HMM and a sequence of observations, s, what is the most likely set of states (path) that produced s according to this HMM? Determining whether a given sequence came from a CpG island or not is a decoding problem. 3. Learning: Given an HMM and a sequence of observations, s, what are the model parameters (transition and emission probabilities ) that will optimize, i.e., maximize p(s)?

21 HIDDEN MARKOV MODELS A hidden Markov model (HMM) is formally described as follows: We have a set of states with a transition probability between each state. Let s call the transition probability from state i to state j as s ij. Needless to say, Σ j s ij = 1 In standard HMM terminology, the states are indicated with the letter π and the transition probabilities by a ij. So state i is indicated as π i. We have an alphabet of observable symbols. In our case these are either the nucleotides or the AAs. Again in HMM terminology, the symbols are usually indicated with the letter x. For each state k and each symbol b, we have an emission probability, ek b Pxi b i k, the probability of observing symbol b when the HMM is in state k. Necessarily, then, we have Σ b e k (b) = 1. For our nucleotide model, we have e A+ (A)=1, e A+ (C)=0, e A+ (G)=0, e A+ (T)=0 Similarly, e C+ (A)=0, e C+ (C)=1, e C+ (G)=0, e C+ (T)=0, and so forth for all other states. The emission as well as transition probabilities between + and states are computed from training data where the exact locations of CpG non-cpg transitions are known. This is called HMM training.

22 HIDDEN MARKOV MODELS HMMs can conveniently be represented graphically as follows: State i a ij State j a ii a jj e i (b) a ji e j (b) RP

23 HIDDEN MARKOV MODELS In our case, we have something like this (the numbers are simulated) CpG 0.25 Non-CpG 0.75 A: C: G: T: A: C: G: T: RP

24 HMM A CLASSIC EXAMPLE The Occasionally Dishonest Casino Problem A casino uses two sets of dice, one fair and the other loaded, according to the following model Fair 1: : : : : : Loaded 1: : : : : : RP 0.90 We observe the following sequence: Which die was used for each observation? In other words, what is the most likely path of states (fair vs. loaded) that produced the above sequence of observations?

25 BIOINFORMATICS HIDDEN MARKOV MODELS Given an HMM with the transition and emission probabilities, one can easily generate a sequence from this HMM: Choose the first state, π1, according to the transition probabilities a0i, where state 0 is considered the beginning state. In this first state, emit an observation (a nucleotide) according to the emission probability eπ1. Choose a new state, π2, according to the transition probabilities aπ1i, followed by an observation from this state. Continue in this manner until the entire sequence is generated. The joint probability of an observed sequence x and a state sequence π, i.e., obtaining a particular sequence of x on a particular path π is then N P x, a0 1 e i xi a i i 1 i 1 Probability of emitting xi while in state πi where a0π1 is the transition probability from the start state (say, state 0) to the initial state of π1. Customarily, πn+1 = 0 (END) Probability of transitioning from state πi to state πi+1

26 HIDDEN MARKOV MODELS Given such a model, imagine how the sequence CGCG might have been generated CGCG might have been generated by the state sequence C + G - C - G + Then the probability of CGCG being generated by C + G - C - G + is CGCG, 0,C 1 C,G 1 G,C 1 C,G 1 G,0 1 P a a a a a Transition to START Transition to END Note that the 1 come from the fact that the probability of a C being emitted by state C + (or C -, for that matter) is 1, whereas the probability of emitting any other letter is 0. Recall: e C± (A)=0, e C± (C)=1, e C± (G)=0, e C± (T)=0, The problem, of course, is we normally do not know the path, i.e., the hidden states. We only observe the sequence CGCG. How do we know what path it came from? More specifically, given a sequence, what is the most likely path (set of states) that generated the observed sequence? In other words, we want to find arg max P x, arg max P x why?

27 HIDDEN MARKOV MODELS Going back to our CGCG example: CGCG could have come from C + G + C + G +, C - G - C - G -, C + G - C + G -, etc. Which is the most likely path that generated CGCG? Each state sequence has a different probability. Since transitioning from one state to another (+-) usually has very low probability, the third one is least likely. The middle sequence is also significantly less likely than the first one, because it includes two C G transitions which have a low probability in the state. + A C G T _ A C G T A A C G T C G T So, the most likely path is the first one. But wait, we have other problems We just looked at three possible paths above. There are a lot more of them, and computing P(x,π) for each one is computationally unfeasible. Solution?

28 BIOINFORMATICS DYNAMIC PROGRAMMING FOR HMMS Recall our problem: given a sequence, what is the most likely path (set of states) that generated the observed sequence? In other words, we want to find arg max x, The most probable path π* can be found recursively: Let vk(i) be the probability of the most probable path ending in state k and with observation i, i.e., the probability of the most probable path π = π1 πi that generates the observations x1,,xi, and ends in state πi=k. If vk(i-1) is known for observation i-1, it can be computed for observation i as follows: vl i el xi max k vk i 1 akl Probability of most probable path ending in state l with observation i Probability of symbol xi being seen at state l This is known as the Viterbi Algorithm. Transition probability from state k to state l

29 So what does this equation say? It says two things: 1. The most probable path for generating the observations x 1,,x i, and ending in state l has to emit x i in state l (this is where the e l (x i ) comes in), and 2. It has to contain the most probable path for generating x 1,,x i-1 ending in any state k, followed by the transition from state k to state l (hence the transition probability a kl ) This is a dynamic programming problem, and as in previous cases, we need to maintain pointers to the best path. This algorithm is called the Viterbi Decoding Algorithm. Do all calculations in log-space to avoid underflow due to repeated product of many small numbers. VITERBI ALGORITHM max 1 v i e x v i a l l i k k kl Initialize: v 0 1, v 0 0, k 0 0 Main recursion for i 1,, N k max 1 arg max 1 v i e x v i a l l i k k kl Pointer l v i a end Termination * P x, max v N a i k k kl The state k that maximizes this product arg max v N a k k k 0 * L k k k 0 Trace Back * * i1 Pointer i, i N,,1

30 HMMGENERATE() hmmgenerate Hidden Markov model states and emissions (Statistics Toolbox) [seq,states] = hmmgenerate(len,trans,emis) takes a known Markov model, specified by transition probability matrix TRANS and emission probability matrix EMIS, and uses it to generate i) a random sequence seq of emission symbols and ii) a random sequence states of states. The length of both seq and states is len. TRANS(i,j) is the probability of transition from state i to state j. EMIS(k,l) is the probability that symbol l is emitted from state k. Note The function hmmgenerate begins with the model in state 1 at step 0, prior to the first emission. The model then makes a transition to state i 1, with probability T 1,i1, and generates an emission a k1 with probability E i1,k1. hmmgenerate returns i 1 as the first entry of states, and a k1 as the first entry of seq. hmmgenerate(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmgenerate(...,'statenames',statenames) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states. Since the model always begins at state 1, whose transition probabilities are in the first row of TRANS, in the following example, the first entry of the output states is be 1 with probability 0.95 and 2 with probability Examples trans = [0.95,0.05; 0.10,0.90]; emis = [ 1/6 1/6 1/6 1/6 1/6 1/6; 1/10 1/10 1/10 1/10 1/10 1/2]; [seq,states] = hmmgenerate(1000,trans,emis); %Generate a sequence 1000 states and emissions according to trans and emis matrices. [seq,states] = hmmgenerate(1000,trans,emis, 'Symbols',{'one','two','three','four','five','six'}, 'Statenames',{'fair';'loaded'})

31 HMMVITERBI() hmmviterbi Hidden Markov model most probable state path (Statistics Toolbox) STATES = hmmviterbi(seq,trans,emis) given a sequence, seq, calculates the most likely path through the hidden Markov model specified by transition probability matrix, TRANS, and emission probability matrix EMIS. TRANS(i,j) is the probability of transition from state i to state j. EMIS(i,k) is the probability that symbol k is emitted from state i. Note The function hmmviterbi begins with the model in state 1 at step 0, prior to the first emission. hmmviterbi computes the most likely path based on the fact that the model begins in state 1. hmmviterbi(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmviterbi(...,'statenames',statenames) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states. Examples trans = [0.95,0.05; 0.10,0.90]; emis = [ 1/6 1/6 1/6 1/6 1/6 1/6; 1/10 1/10 1/10 1/10 1/10 1/2]; [seq,states] = hmmgenerate(1000,trans,emis); % Generate a sequence 1000 states and emissions according to trans and emis matrices. estimatedstates = hmmviterbi(seq,trans,emis); % Estimate the states from the actual observations (seq) and hmm matrices, trans and emis. [seq,states] = hmmgenerate(1000,trans,emis, 'Statenames',{'fair';'loaded'}); estimatesstates = hmmviterbi(seq,trans,emis, 'Statenames',{'fair';'loaded'});

32 BIOINFORMATICS FORWARD ALGORITHM The probability of observing a particular sequence P(x) under a given HMM model can also be obtained using a similar algorithm, called forward. Because multiple states can result in the identical sequence, the total probability is obtained by summing over all possible paths that result in the same sequence P x P x, Because the number of possible paths that result in any given sequence x increases exponentially with the length of the sequence, computing the probability of each is not practical. But an algorithm very similar to Viterbi, that replaces maximization operation with the sum (or probabilities) The quantity corresponding to Viterbi s vk(i), is replaced with fk(i), which represents the probability of the observed sequence x, up to and including xi, that ends with the state πi = k; i.e., f k i P x1, Initialize: f0 0 1,, xi, i k f k 0 0, k 0 Main recursion for i 1,, N fl i el xi k f k i 1 akl end Termination P x, * k f k N ak 0

33 BIOINFORMATICS BACKWARD ALGORITHM We have seen two algorithms so far: f k i P x1, Viterbi: Finds the most likely path, given the sequence and HMM probabilities. Forward: Finds the probability of observing a given sequence We may also be interested in the most probable state for an observation xi, given the entire observed sequence x., xi, i k, bk i P xi 1,, xn i k Initialize: i=n bk N ak 0, k Main recursion for i N 1,,1 bk i l akl el xi 1 bl i 1 More specifically, we want to obtain the probability that observation xi came from state k given the entire observed sequence. Note the conditioning: this is the posterior probability, P(πi = k x), of state k at time i when the entire emitted sequence is known. The probability of producing the entire sequence x1,, xn, with the ith symbol produced by state k: end Termination (technically this step not necessary b/c P(x) is found by the forward algorithm) P x l a0l el x1 bl 1 Posterior calculation P i k x P x, i k P x1,, xi, i k P xi 1,, xn x1, P x1,, xi, i k P xi 1,, xn i k fk i bk i f k i bk i P x, xi, i k b/c everything after state i depends on the then current state k (why?)

34 POSTERIOR DECODING So, when may we need to know these posterior probabilities? Sometimes, we are interested not in the state sequence itself, but some property of this sequence. Consider the function x i x G i P k g k k where g(k) may be 0 for certain states and 1 for certain other states. Then G(i x) is precisely the posterior probability of the i th observation coming from a state in the specified set. Recall the CpG island example, where we can define g(k)=1 for k{a +, C +, G +, T + } and g(k)=0 for k{a -, C -, G -, T - }. Then G(i x) is the posterior probability that the i th observation is in a CpG island.

35 HMM PARAMETER ESTIMATION Of course, everything we have discussed so far assumes that we know the HMM parameters: specifically, the transition and emission probabilities. How do we obtain these probabilities? There are two cases: When the state sequence is known Maximum likelihood estimator Easier When the state sequence is unknown Baum-Welch Algorithm Well by method of elimination not easier!

36 HMM PARAMETER ESTIMATION MAXIMUM LIKELIHOOD ESTIMATE If the sequence / path are known for all the examples, as is often the case if there is prelabelled training data, we can simply count the relative ratio of the number of times each particular transition and emission occurs. The maximum likelihood estimate (MLE) of these probabilities is defined as the values that would maximize the log-likelihood of the observed sequence, : 1 1 J J J j,, θ log,, θ log θ L x x P x x P x j1 kl akl ek b l kl and are given by: where represents all HMM parameters, A kl is the number of transitions from state k to state l in the training data, and E k (b) is the number of emissions of observation b from state k in the training data. All MLE estimates suffer from potential over fitting due to insufficient data: what if the particular transition or observation is not represented in the small training set? Leads to undefined probabilities (both numerator and denominator may be zero) Solution: Add a fudge factor, a pseudocount r kl and r k (b) to A kl and E k (b), respectively. Can be based on prior knowledge / bias of the natural probabilities of these quantities. b b A Ek A E b k

37 BIOINFORMATICS HMM PARAMETER ESTIMATION BAUM-WELCH If the exact path / sequence is not known, then the probabilities must be estimated using an iterative process. Estimate Akl and Ek(b) by considering some probable paths for the training sequences based on the current values of akl and ek(b). Then use (see previous slide) to obtain the new values of akl and ek(b) Iterate This is Baum-Welch algorithm, a special case of the more general (and well known) approach called Expectation Maximization (EM Algorithm). It is guaranteed to reach a local maximum of the log-likelihood. However, since there may and usually are many local maxima, the solution depends on the initial values of the iteration. BW computes Akl and Ek(b) as the expected number of times each transition or emission will be seen, given the training sequence (of observations only). The probability that akl is used at position i of sequence x is where represents all model parameters of f k i akl el xi 1 bl i 1 P i k, i 1 l x, θ the HMM. P x From this, we can obtain total expected number of times 1) akl is used in the sequence, and 2) letter b appears in state k by summing over all positions and over all training sequences j: Akl j 1 j j j f i a e x b i 1 k kl l l i 1 P x j i Ek b j 1 f k j i bl j i j P x i xij b

38 HMM PARAMETER ESTIMATION BAUM-WELCH So then, here is the complete BW Algorithm Initialize Pick arbitrary (random) model parameters Main Recursion Set all A and E variables to their pseudocount values r For each sequence in the training data j=1,,j Calculate f k (i) for sequence j using forward algorithm Calculate b k (i) for sequence j using backward algorithm Add contribution of sequence j to A using and to E using Calculate the new model parameters of the HMM using Termination Stop if the change in log-likelihood 1 1 J J J j,, θ log,, θ log θ L x x P x x P x is less than a threshold, or the maximum # of iterations is reached. j1

39 HMMESTIMATE() hmmestimate Hidden Markov model parameter estimates from emissions and states [TRANS,EMIS] = hmmestimate(seq, states) calculates the maximum likelihood estimate of the transition, TRANS, and emission, EMIS, probabilities of a hidden Markov model for sequence, seq, with known states, states. hmmestimate(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmestimate(...,'statenames',statenames) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states. hmmestimate(...,'pseudoemissions',pseudoe) specifies pseudocount emission values in the matrix PSEUDO. Use this argument to avoid zero probability estimates for emissions with very low probability that might not be represented in the sample sequence. PSEUDOE should be a matrix of size m-by-n, where m is the number of states in the hidden Markov model and n is the number of possible emissions. If the ik emission does not occur in seq, you can set PSEUDOE(i,k) to be a positive number representing an estimate of the expected number of such emissions in the sequence seq. hmmestimate(...,'pseudotransitions',pseudotr) specifies pseudocount transition values. You can use this argument to avoid zero probability estimates for transitions with very low probability that might not be represented in the sample sequence. PSEUDOTR should be a matrix of size m-by-m, where m is the number of states in the hidden Markov model. If the ij transition does not occur in states, you can set PSEUDOTR(i,j) to be a positive number representing an estimate of the expected number of such transitions in the sequence states. Pseudotransitions and Pseudoemissions If the probability of a specific transition or emission is very low, the transition might never occur in the sequence states, or the emission might never occur in the sequence seq. In either case, the algorithm returns a probability of 0 for the given transition or emission in TRANS or EMIS. You can compensate for the absence of transition with the 'Pseudotransitions' and 'Pseudoemissions' arguments. The simplest way to do this is to set the corresponding entry of PSEUDO or PSEUDOTR to 1. For example, if the transition ij does not occur in states, set PSEUOTR(i,j) = 1. This forces TRANS(i,j) to be positive. If you have an estimate for the expected number of transitions in a sequence of the same length as states, and the actual number of transitions ij that occur in seq is substantially less than what you expect, you can set PSEUOTR(i,j) to the expected number. This increases the value of TRANS(i,j). For transitions that do occur in states with the frequency you expect, set the corresponding entry of PSEUDOTR to 0, which does not increase the corresponding entry of TRANS. If you do not know the sequence of states, use hmmtrain to estimate the model parameters.

40 HMMTRAIN() hmmtrain Hidden Markov model parameter estimates from emissions [ESTTR,ESTEMIT] = hmmtrain(seq,trguess,emitguess) estimates the transition and emission probabilities for a hidden Markov model using the Baum-Welch algorithm. seq can be a row vector containing a single sequence, a matrix with one row per sequence, or a cell array with each cell containing a sequence. TRGUESS and EMITGUESS are initial estimates of the transition and emission probability matrices. TRGUESS(i,j) is the estimated probability of transition from state i to state j. EMITGUESS(i,k) is the estimated probability that symbol k is emitted from state i. hmmtrain(...,'algorithm',algorithm) specifies the training algorithm, 'BaumWelch' or 'Viterbi'. The default algorithm is 'BaumWelch'. hmmtrain(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmtrain(...,'tolerance',tol) specifies the tolerance used for testing convergence of the iterative estimation process. The default is 1e-4 and applies to any of i) the log likelihood that the input sequence seq generated by the currently estimated values of the transition and emission matrices ; ii) the change in the norm of the transition matrix, normalized by the size of the matrix; iii) the change in the norm of the emission matrix, normalized by the size of the matrix hmmtrain(...,'maxiterations',maxiter) specifies the maximum number of iterations for the estimation process. The default maximum is 100. hmmtrain(...,'verbose',true) returns the status of the algorithm at each iteration. hmmtrain(...,'pseudoemissions',pseudoe) specifies pseudocount emission values for the Viterbi training algorithm. Use this argument to avoid zero probability estimates for emissions with very low probability that might not be represented in the sample sequence. PSEUDOE should be a matrix of size m-by-n, where m is the number of states in the hidden Markov model and n is the number of possible emissions. If the emission does not occur in seq, you can set PSEUDOE(i,k) to be a positive number representing an estimate of the expected number of such emissions in the sequence seq. hmmtrain(...,'pseudotransitions',pseudotr) specifies pseudocount transition values for the Viterbi training algorithm. Use this argument to avoid zero probability estimates for transitions with very low probability that might not be represented in the sample sequence. PSEUDOTR should be a matrix of size m-by-m, where m is the number of states in the hidden Markov model. If the transition does not occur in states, you can set PSEUDOTR(i,j) to be a positive number representing an estimate of the expected number of such transitions in the sequence states. If you know the states corresponding to the sequences, use hmmestimate to estimate the model parameters.

41 LAB 3 Implement the Viterbi / Baum Welch based HMM for the dishonest casino problem. 100% credit for self-implementation, 60% for using built-in Matlab HMM functionalities only. Project Idea (UG): Find three real world relevant bioinformatics problems that can be solved by the Viterbi algorithm Forward algorithm Backward algorithm / Posterior decoding Baum-Welch Algorithm Project Idea: Find how HMMs can be used for pairwise (UG) / multiple alignment (G), identify at least three bioinformatics applications and solve them using HMMs Project Idea: Study profile HMMs (phmms), identify at least three relevant bioinformatics applications and solve them using HMMs (G/UG)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Hidden Markov Models (I)

Hidden Markov Models (I) GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from:

Hidden Markov Models. Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: Hidden Markov Models Ivan Gesteira Costa Filho IZKF Research Group Bioinformatics RWTH Aachen Adapted from: www.ioalgorithms.info Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas Forward Algorithm For Markov chains we calculate the probability of a sequence, P(x) How

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K HiSeq X & NextSeq Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization:

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II

6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution. Lecture 05. Hidden Markov Models Part II 6.047/6.878/HST.507 Computational Biology: Genomes, Networks, Evolution Lecture 05 Hidden Markov Models Part II 1 2 Module 1: Aligning and modeling genomes Module 1: Computational foundations Dynamic programming:

More information

Lecture 9. Intro to Hidden Markov Models (finish up)

Lecture 9. Intro to Hidden Markov Models (finish up) Lecture 9 Intro to Hidden Markov Models (finish up) Review Structure Number of states Q 1.. Q N M output symbols Parameters: Transition probability matrix a ij Emission probabilities b i (a), which is

More information

HIDDEN MARKOV MODELS

HIDDEN MARKOV MODELS HIDDEN MARKOV MODELS Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

Hidden Markov Models. Three classic HMM problems

Hidden Markov Models. Three classic HMM problems An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Hidden Markov Models Slides revised and adapted to Computational Biology IST 2015/2016 Ana Teresa Freitas Three classic HMM problems

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

CS711008Z Algorithm Design and Analysis

CS711008Z Algorithm Design and Analysis .. Lecture 6. Hidden Markov model and Viterbi s decoding algorithm Institute of Computing Technology Chinese Academy of Sciences, Beijing, China . Outline The occasionally dishonest casino: an example

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms  Hidden Markov Models Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2

Hidden Markov Models. music recognition. deal with variations in - pitch - timing - timbre 2 Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis Shamir s lecture notes and Rabiner s tutorial on HMM 1 music recognition deal with variations

More information

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16

VL Algorithmen und Datenstrukturen für Bioinformatik ( ) WS15/2016 Woche 16 VL Algorithmen und Datenstrukturen für Bioinformatik (19400001) WS15/2016 Woche 16 Tim Conrad AG Medical Bioinformatics Institut für Mathematik & Informatik, Freie Universität Berlin Based on slides by

More information

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. Ron Shamir, CG 08 Hidden Markov Models 1 Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Hidden Markov Models (HMMs) November 14, 2017

Hidden Markov Models (HMMs) November 14, 2017 Hidden Markov Models (HMMs) November 14, 2017 inferring a hidden truth 1) You hear a static-filled radio transmission. how can you determine what did the sender intended to say? 2) You know that genes

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides revised and adapted to Bioinformática 55 Engª Biomédica/IST 2005 Ana Teresa Freitas CG-Islands Given 4 nucleotides: probability of occurrence is ~ 1/4. Thus, probability of

More information

Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

More information

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping Plan for today! Part 1: (Hidden) Markov models! Part 2: String matching and read mapping! 2.1 Exact algorithms! 2.2 Heuristic methods for approximate search (Hidden) Markov models Why consider probabilistics

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma COMS 4771 Probabilistic Reasoning via Graphical Models Nakul Verma Last time Dimensionality Reduction Linear vs non-linear Dimensionality Reduction Principal Component Analysis (PCA) Non-linear methods

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3)

R. Durbin, S. Eddy, A. Krogh, G. Mitchison: Biological sequence analysis. Cambridge University Press, ISBN (Chapter 3) 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) lgorithms: Viterbi, forward, backward, posterior decoding Profile HMMs Baum-Welch algorithm This chapter

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

6 Markov Chains and Hidden Markov Models

6 Markov Chains and Hidden Markov Models 6 Markov Chains and Hidden Markov Models (This chapter 1 is primarily based on Durbin et al., chapter 3, [DEKM98] and the overview article by Rabiner [Rab89] on HMMs.) Why probabilistic models? In problems

More information

Lecture #5. Dependencies along the genome

Lecture #5. Dependencies along the genome Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3., Polanski&Kimmel Section 2.8. Prepared by Shlomo Moran, based on Danny Geiger s and Nir Friedman s. Dependencies along the genome

More information

Genome 373: Hidden Markov Models II. Doug Fowler

Genome 373: Hidden Markov Models II. Doug Fowler Genome 373: Hidden Markov Models II Doug Fowler Review From Hidden Markov Models I What does a Markov model describe? Review From Hidden Markov Models I A T A Markov model describes a random process of

More information

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Grundlagen der Bioinformatik, SS 09, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence rundlagen der Bioinformatik, SS 09,. Huson, June 16, 2009 81 7 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis,

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

HMM: Parameter Estimation

HMM: Parameter Estimation I529: Machine Learning in Bioinformatics (Spring 2017) HMM: Parameter Estimation Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Content Review HMM: three problems

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Info 2950, Lecture 25

Info 2950, Lecture 25 Info 2950, Lecture 25 4 May 2017 Prob Set 8: due 11 May (end of classes) 4 3.5 2.2 7.4.8 5.5 1.5 0.5 6.3 Consider the long term behavior of a Markov chain: is there some set of probabilities v i for being

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/14/07 CAP5510 1 CpG Islands Regions in DNA sequences with increased

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Chapter 4: Hidden Markov Models

Chapter 4: Hidden Markov Models Chapter 4: Hidden Markov Models 4.1 Introduction to HMM Prof. Yechiam Yemini (YY) Computer Science Department Columbia University Overview Markov models of sequence structures Introduction to Hidden Markov

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence

Grundlagen der Bioinformatik, SS 08, D. Huson, June 16, S. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence rundlagen der Bioinformatik, SS 08,. Huson, June 16, 2008 89 8 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Profile HMMs his chapter is based on: nalysis,

More information

MACHINE LEARNING 2 UGM,HMMS Lecture 7

MACHINE LEARNING 2 UGM,HMMS Lecture 7 LOREM I P S U M Royal Institute of Technology MACHINE LEARNING 2 UGM,HMMS Lecture 7 THIS LECTURE DGM semantics UGM De-noising HMMs Applications (interesting probabilities) DP for generation probability

More information

Lecture 5: December 13, 2001

Lecture 5: December 13, 2001 Algorithms for Molecular Biology Fall Semester, 2001 Lecture 5: December 13, 2001 Lecturer: Ron Shamir Scribe: Roi Yehoshua and Oren Danewitz 1 5.1 Hidden Markov Models 5.1.1 Preface: CpG islands CpG is

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Pairwise alignment using HMMs

Pairwise alignment using HMMs Pairwise alignment using HMMs The states of an HMM fulfill the Markov property: probability of transition depends only on the last state. CpG islands and casino example: HMMs emit sequence of symbols (nucleotides

More information

Hidden Markov Models. x 1 x 2 x 3 x N

Hidden Markov Models. x 1 x 2 x 3 x N Hidden Markov Models 1 1 1 1 K K K K x 1 x x 3 x N Example: The dishonest casino A casino has two dice: Fair die P(1) = P() = P(3) = P(4) = P(5) = P(6) = 1/6 Loaded die P(1) = P() = P(3) = P(4) = P(5)

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene.

The Computational Problem. We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. GENE FINDING The Computational Problem We are given a sequence of DNA and we wish to know which subsequence or concatenation of subsequences constitutes a gene. The Computational Problem Confounding Realities:

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding

More information

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs Bioinformatics 1--lectures 15, 16 Markov chains Hidden Markov models Profile HMMs target sequence database input to database search results are sequence family pseudocounts or background-weighted pseudocounts

More information

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm

6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm 6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic

More information

Hidden Markov Methods. Algorithms and Implementation

Hidden Markov Methods. Algorithms and Implementation Hidden Markov Methods. Algorithms and Implementation Final Project Report. MATH 127. Nasser M. Abbasi Course taken during Fall 2002 page compiled on July 2, 2015 at 12:08am Contents 1 Example HMM 5 2 Forward

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

L23: hidden Markov models

L23: hidden Markov models L23: hidden Markov models Discrete Markov processes Hidden Markov models Forward and Backward procedures The Viterbi algorithm This lecture is based on [Rabiner and Juang, 1993] Introduction to Speech

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

1 What is a hidden Markov model?

1 What is a hidden Markov model? 1 What is a hidden Markov model? Consider a Markov chain {X k }, where k is a non-negative integer. Suppose {X k } embedded in signals corrupted by some noise. Indeed, {X k } is hidden due to noise and

More information

Markov chains and Hidden Markov Models

Markov chains and Hidden Markov Models Discrete Math for Bioinformatics WS 10/11:, b A. Bockmar/K. Reinert, 7. November 2011, 10:24 2001 Markov chains and Hidden Markov Models We will discuss: Hidden Markov Models (HMMs) Algorithms: Viterbi,

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 8: Sequence Labeling Jimmy Lin University of Maryland Thursday, March 14, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Advanced Data Science

Advanced Data Science Advanced Data Science Dr. Kira Radinsky Slides Adapted from Tom M. Mitchell Agenda Topics Covered: Time series data Markov Models Hidden Markov Models Dynamic Bayes Nets Additional Reading: Bishop: Chapter

More information

Hidden Markov Models. Terminology and Basic Algorithms

Hidden Markov Models. Terminology and Basic Algorithms Hidden Markov Models Terminology and Basic Algorithms The next two weeks Hidden Markov models (HMMs): Wed 9/11: Terminology and basic algorithms Mon 14/11: Implementing the basic algorithms Wed 16/11:

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information