ROBI POLIKAR. ECE 402/504 Lecture Hidden Markov Models IGNAL PROCESSING & PATTERN RECOGNITION ROWAN UNIVERSITY

BIOINFORMATICS Lecture 11-12 Hidden Markov Models ROBI POLIKAR 2011, All Rights Reserved, Robi Polikar. IGNAL PROCESSING & PATTERN RECOGNITION LABORATORY @ ROWAN UNIVERSITY These lecture notes are prepared by Robi Polikar. Unauthorized use, including duplication, even in part, is not allowed without an explicit written permission. Such permission will be given upon request for noncommercial educational purposes if you agree to all of the following: 1. Restrict the usage of this material for noncommercial and nonprofit educational purposes only; AND 2. The entire presentation is kept together as a whole, including this page and this entire notice; AND 3. You include the following link/reference on your site: Bioinformatics, 2011 Robi Polikar, Rowan Robi University Polikar http://engineering.rowan.edu/~polikar.

THIS WEEK IN BIOINFORMATICS Markov Models Markov chain Hidden Markov model Viterbi algorithm Forward algorithm Backward algorithm HMM parameter estimation: Baum- Welch algorithm Photo / diagram credits CH Courtesy: National Human Genome Research Institute N. Cristianini & M. W. Hahn, Int. Computational Genomics, Cambridge, 2007 DE R. Durbin, S. Eddy, A. Krogh & G. Mitchison, Biological Sequence Analysis, Cambridge, 1988 ZB RP M. Zvelebil & J.O. Baum, Understanding Bioinformatics, Garland Sci., 2008 Robi Polikar All Rights Reserved 2011.

NEED FOR BETTER METHODS? We have a number of bioinformatics applications so far, which were solved by relatively simple algorithms: Finding genes Determine the ORFs by looking at start and stop codons Detecting evolutionary changes Analyze sequence statistics, look for sudden changes Determine if two genes are related Local and global alignment algorithms The problem is that most real world bioinformatics problems require somewhat more sophisticated approaches Finding genes in eukaryotic species is virtually impossible with the simple ORF detection technique due to introns and exons, or for short genes. Detecting evolutionary changes simply by looking at sequence statistics is impossible for changes with smaller evolutionary time frames. Determining whether two genes are related is also very difficult, even with the most sophisticated BLAST algorithm, when the genes are relatively short. More sophisticated probabilistic models that can describe the complex genomic interactions are necessary. Model of choice for biological sequences: Markov models (Markov chains and hidden Markov models)

RECALL MULTINOMIAL MODEL DNA sequence is generated by randomly drawing letters from the alphabet Ν DNA = {A, C, G, T}, where the letters are assumed to be independent and identically distributed (i.i.d). For any given sequence position i, we independently draw one of these four letters, from the same distribution over the alphabet N DNA. The constant distribution simply assigns a probability to each letter, p=(p A, p G, p C, p T ), such that the probability of observing any of the nucleotides at position i of the sequence s is p x =p(s(i)=x). Note that since there are a finite number of outcomes for each experiment (of drawing a letter from this distribution), this is a discrete distribution defined by its probability mass function. N! P X x X x p p # of possible outcomes (4) pa pc pg pt 1 x1,..., 1 1 k k 1 k x1! xk! k # of times outcome 1 (an A) is observed p 1, x N i i1 i1 k Probability of success for i th outcome i x k # of trials (sequence length)

RECALL MULTINOMIAL MODEL Because of the iid assumption, the multinomial distribution allows us to compute the likelihood of observing any given sequence simply by multiplying the individual probabilities. Given s = s 1 s 2 s N N Ps p s i i1 Note that the iid assumption makes this an unrealistic model, we also know that the DNA sequence is not completely random. However, this model explains a quite a bit of the behavior of the DNA data. Finding the regions of DNA where this model is violated do in fact lead to interesting findings. We can easily evaluate the validity of this assumption, by looking at the frequency distributions of the letters in a specific regions, and checking whether such distributions change over the regions.

MARKOV SEQUENCE MODEL This is a more complex, and but possibly a more accurate model, based on the Markov chains. A Markov chain is a series of discrete observations, called states, where the probability of observing the next state is given by fixed transitional probabilities, called the transition matrix T. When used in the context of bioinformatics, the states are the individual nucleotides (or amino acids) In a Markov chain, the probability of observing any one of the finite outcomes depends on the previous observations; specifically, the transition probabilities of observing a particular outcome after another one. If the process is in state x at step i, then the probability of being in state y at step i+1 is given by the transition probability p xy p P s y s x xy i1 i The transitional probabilities between every possible states then determines the transition matrix.

MARKOV CHAINS Formally, a first order Markov chain is a sequence of discrete valued random variables, X 1, X 2, that follows the Markov property: N 1 1 1, 2 2,, N N N 1 N N P X x X x X x X x P X x X x i.e., the probability of observing the present state (outcome) depends only on the previous state through the transition probability to move from that state to the current state: given the present state, future and past states are independent. The multinomial model can be interpreted as a zeroth order Markov chain, where there is no dependence on the previous outcomes. A Markov chain of order m is a process where the next state depends on the m previous states The probability of observing a particular sequence is then given by s 1 1 2 2 1 1 1 i1 P P s s P s s P s s s s p N N N N s si i2 where π(s 1 ) is the probability for the starting state to be s 1. N

APPLICATIONS OF MARKOV MODELS Segmentation / Gene finding Gene and protein sequences contain distinct regions with different chemical / genomic properties. Markov models can be used to precisely locate the boundaries of these regions. Is a particular (DNA/AA) sequence a gene? Is a greater than average occurrence of CG nucleotides represent a CpG island? Does a given sequence belong to a particular gene / protein family? Is a particular sequence of repeats, unusual insertions, and over/under representation of GC content represent a pathogenicity island (PAI)? Multiple alignment Aligning several sequences at once can be done by first creating a so-called profile HMM a model that describes the probabilistic characteristics of all sequences against which each of the sequences can then be aligned. Prediction of function Simple alignment does not lead to function prediction. Two well aligned AA sequences do not necessarily have the same function. However, profile HMMs can also be used to quickly determine protein function from the given sequences, by making probabilistic predictions. There are also many non-bioinformatics related applications as well Speech processing : speaker or speech identification Financial data analysis Pattern recognition

MARKOV CHAINS For the 4-letter DNA nucleotide sequence, we have the following model. Given T and π, we can compute the probability of any given sequence. p AA p TA p AT p AG p GA p AC p TG p TC p GT p GG p CG p GC T p p p p AA AC AG AT p p p p CA CC CG CT p p p p GA GC GG GT p p p p TA TC TG TT p CA A C G T p CT RP p TT p CC

AN EXAMPLE CPG ISLANDS CG islands, also called CpG islands, are short stretches of genomic regions with a higher frequency of the CG sequence compared to other regions. Also called the CpG island, where "p" indicates that "C" and "G" are connected by a phosphodiester bond, to distinguish from C-G base pairings across the two strands normally connected by hydrogen bonds. In humans, the C in CG dinucleotides are commonly modified by a process called methylation, which usually converts the C to a T, particularly at those sites that are biologically unimportant, or at inactive genes to suppress their expression. As a result, the CG dinucleotides appear less often than would be expected from their independent probabilities of C and G. But at biologically important places, such as the promoters (start regions of genes), the methylation process is suppressed, leading to higher concentration of CG dinucleotides than other regions of the genome. Such regions are called CpG islands, and they are typically 300 3000 bp long.

AN EXAMPLE CPG ISLANDS About 56% of human genes and 47% of mouse genes are associated with CpG islands ( Antequera and Bird, 1993 ) CpG islands are commonly defined as regions of DNA of at least 200 bp in length and that have a G+C content above 50% and a ratio of observed vs. expected CpGs close to or above 0.6. Q Given a short sequence of nucleotides, how can we determine whether it comes from a CpG island? Q Given an long sequence, how can we find the CpG islands in it?

MARKOV CHAIN Finding CpG islands can be best tackled by using a Markov Chain, as we are interested in a probabilistic model that generates a particular ordering of nucleotides: a C followed by a G. Of course CG pairs can appear in CpG islands as well as other areas. How do we separate them? Let s denote the area that is in fact a CpG island with +, and other areas with Then we have two models p AA p AG p GG p AA p AG p GG p GA p GA p TA p AT p AC + p GT p CG p GC p TA p AT p AC _ p GT p CG p GC p TG p CA p TG p CA p TC p TC p CT p CT p TT p CC p TT p CC RP

MARKOV CHAINS FOR CPG ISLANDS We can also represent these two models with tables, the transition probabilities of which come from maximum likelihood estimates: sxy s xy sxy sxy s s where s xy is the number of times nucleotide y follows nucleotide x in the + regions. Note that the normalization term in the denominator sums these number of occurrences over all possible choices of the second nucleotide. Then, given training data from CpG (+) and non CpG (-) regions, we can compute the transition probabilities. Let s assume that we get the following values for each region. z xz z xz + A C G T A 0.180 0.274 0.426 0.120 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182 _ A C G T A 0.300 0.205 0.285 0.210 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292

BIOINFORMATICS MARKOV CHAINS FOR CPG ISLANDS We can make some observations from these tables, e.g., note that the transition probability from C to G is much higher in the + region than it is in the region, as we would expect. Also note that each row adds up to 1, as these are the probabilities, i.e., the first row shows the probabilities of each nucleotide following an A. + A C G T _ A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302 G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208 T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292 Given a particular sequence s = [x1, x2,, xn], we can compute the probability of that sequence under each model. Let s call these p(s +) and p(s -). The log-odds ratio of the two can then be used to determine whether the sequence s is coming from the CpG or not: N sxi 1xi p s S s log log in 1 p s sxi 1xi i 1 N N sx i 1xi log xi 1xi sxi 1xi i 1 i 1 log-likelihood ratios

MARKOV CHAINS FOR CPG ISLANDS So, how do we use this? First take a look at the histogram of scores generated by computing S(s) for all sequences in + and regions, normalized with respect to their length. _ + DE We can see clear separation between the two groups at 0 bits. Then, those sequences whose score is greater than 0 are most likely to come from CpG regions. Let s take a look at the CGCG sequence. The score (sum of log likelihoods) for this sequence is 0.27 0.34 0.27 S CGCG log log log 1.75 0.44 1.75 0 0.08 0.25 0.08 therefore, we conclude that the sequence CGCG must come from a CpG island.

MARKOV CHAINS FOR GENE FINDING? Recall that we started this discussion with two questions: Q Given a short sequence of nucleotides, how can we determine whether it comes from a CpG island? Q Given a long sequence, how can we find the CpG islands in it? We answered the first one, but what about the second one? Note that this is essentially the same question as asking given a long sequence, how can we find the genes in it? We can use a modified approach, essentially following the same process: Consider windows of (short) sequences, say of 100 nucleotides around each nucleotide of the long sequence Calculate the log-odds ratios based scores for each window Identify the windows with positive scores. Merge intersecting windows to then determine which segments of the long sequence correspond to the CpG islands (or genes for that matter).

MARKOV CHAINS: WHAT IS NOT TO LIKE? Well, the problem is that CpG islands are of variable lengths, but have sharp boundaries. And also, why use a window of 100? If we use a too small window, then every occurrence of CG start appearing as an island If we use a too wide window, we miss the sharp boundaries where CpG islands begin and end. A better way is to combine the two Markov chains (one for each + and regions) into a single model, where the combined model allows to switch from one chain to the other with some specified probability. But this introduces additional complexities: we now have two states corresponding to the same nucleotide symbols We resolve this by relabeling the states as A +, C +, G +, T + and A -, C -, G -, T - Our combined model may then look like (drum roll )

MARKOV CHAINS HIDDEN MARKOV MODELS DE Note that in addition to transitions between every nucleotide of different sets (+ and -), not shown in the figure, are another complete set of transitions within each set, as shown earlier in the Markov chain. The transition probabilities within each group are set close to their original values, with a small probability to allow switching to the other group (e.g. from a + state to state) This reflects the fact that once the sequence is in one region (say + region), the next nucleotide is more likely to be in the same region than it is in the opposite region, but with a small yet positive probability allowing to switch to the other region. This is the hidden Markov model.but wait what exactly is hidden here?

MARKOV CHAINS HIDDEN MARKOV MODELS Note that there is no longer a one-to-one correspondence between the states and the symbols. In other words, it is no longer possible to tell what state (region, + or -) the system is in just by looking at the observed sequence Specifically, when we receive the sequence, we receive ACGGC, not A + C + G - G - C +, hence we do not know whether each nucleotide is coming from the + region or the region. The C could come either from the + or the region. Hence, the states are hidden, as the sequence does not tell how it was generated.

HIDDEN MARKOV MODELS This brings us to the kind of problems HMMs can help us solve: In general, three types of problems can be addressed by HMMs 1. Evaluation: Given an HMM and a sequence of observations, s, what is the probability p(s) that the sequence s is generated by this HMM? 2. Decoding: Given an HMM and a sequence of observations, s, what is the most likely set of states (path) that produced s according to this HMM? Determining whether a given sequence came from a CpG island or not is a decoding problem. 3. Learning: Given an HMM and a sequence of observations, s, what are the model parameters (transition and emission probabilities ) that will optimize, i.e., maximize p(s)?

HIDDEN MARKOV MODELS A hidden Markov model (HMM) is formally described as follows: We have a set of states with a transition probability between each state. Let s call the transition probability from state i to state j as s ij. Needless to say, Σ j s ij = 1 In standard HMM terminology, the states are indicated with the letter π and the transition probabilities by a ij. So state i is indicated as π i. We have an alphabet of observable symbols. In our case these are either the nucleotides or the AAs. Again in HMM terminology, the symbols are usually indicated with the letter x. For each state k and each symbol b, we have an emission probability, ek b Pxi b i k, the probability of observing symbol b when the HMM is in state k. Necessarily, then, we have Σ b e k (b) = 1. For our nucleotide model, we have e A+ (A)=1, e A+ (C)=0, e A+ (G)=0, e A+ (T)=0 Similarly, e C+ (A)=0, e C+ (C)=1, e C+ (G)=0, e C+ (T)=0, and so forth for all other states. The emission as well as transition probabilities between + and states are computed from training data where the exact locations of CpG non-cpg transitions are known. This is called HMM training.

HIDDEN MARKOV MODELS HMMs can conveniently be represented graphically as follows: State i a ij State j a ii a jj e i (b) a ji e j (b) RP

HIDDEN MARKOV MODELS In our case, we have something like this (the numbers are simulated) CpG 0.25 Non-CpG 0.75 A: 0.161 C: 0.339 G: 0.375 T: 0.125 0.15 A: 0.248 C: 0.246 G: 0.298 T: 0.208 0.85 RP

HMM A CLASSIC EXAMPLE The Occasionally Dishonest Casino Problem A casino uses two sets of dice, one fair and the other loaded, according to the following model. 0.95 Fair 1: 0.167 2: 0.167 3: 0.167 4: 0.167 5: 0.167 6:0.167 0.05 Loaded 1: 0.100 2: 0.100 3: 0.100 4: 0.100 5: 0.100 6: 0.500 0.10 RP 0.90 We observe the following sequence: 2145451214616562215566212156261661554621545136412521456126454215 Which die was used for each observation? In other words, what is the most likely path of states (fair vs. loaded) that produced the above sequence of observations?

BIOINFORMATICS HIDDEN MARKOV MODELS Given an HMM with the transition and emission probabilities, one can easily generate a sequence from this HMM: Choose the first state, π1, according to the transition probabilities a0i, where state 0 is considered the beginning state. In this first state, emit an observation (a nucleotide) according to the emission probability eπ1. Choose a new state, π2, according to the transition probabilities aπ1i, followed by an observation from this state. Continue in this manner until the entire sequence is generated. The joint probability of an observed sequence x and a state sequence π, i.e., obtaining a particular sequence of x on a particular path π is then N P x, a0 1 e i xi a i i 1 i 1 Probability of emitting xi while in state πi where a0π1 is the transition probability from the start state (say, state 0) to the initial state of π1. Customarily, πn+1 = 0 (END) Probability of transitioning from state πi to state πi+1

HIDDEN MARKOV MODELS Given such a model, imagine how the sequence CGCG might have been generated CGCG might have been generated by the state sequence C + G - C - G + Then the probability of CGCG being generated by C + G - C - G + is CGCG, 0,C 1 C,G 1 G,C 1 C,G 1 G,0 1 P a a a a a Transition to START + + - - - - + + Transition to END Note that the 1 come from the fact that the probability of a C being emitted by state C + (or C -, for that matter) is 1, whereas the probability of emitting any other letter is 0. Recall: e C± (A)=0, e C± (C)=1, e C± (G)=0, e C± (T)=0, The problem, of course, is we normally do not know the path, i.e., the hidden states. We only observe the sequence CGCG. How do we know what path it came from? More specifically, given a sequence, what is the most likely path (set of states) that generated the observed sequence? In other words, we want to find arg max P x, arg max P x why?

HIDDEN MARKOV MODELS Going back to our CGCG example: CGCG could have come from C + G + C + G +, C - G - C - G -, C + G - C + G -, etc. Which is the most likely path that generated CGCG? Each state sequence has a different probability. Since transitioning from one state to another (+-) usually has very low probability, the third one is least likely. The middle sequence is also significantly less likely than the first one, because it includes two C G transitions which have a low probability in the state. + A C G T _ A C G T A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210 C 0.171 0.368 0.274 0.188 G 0.161 0.339 0.375 0.125 T 0.079 0.355 0.384 0.182 C 0.322 0.298 0.078 0.302 G 0.248 0.246 0.298 0.208 T 0.177 0.239 0.292 0.292 So, the most likely path is the first one. But wait, we have other problems We just looked at three possible paths above. There are a lot more of them, and computing P(x,π) for each one is computationally unfeasible. Solution?

BIOINFORMATICS DYNAMIC PROGRAMMING FOR HMMS Recall our problem: given a sequence, what is the most likely path (set of states) that generated the observed sequence? In other words, we want to find arg max x, The most probable path π* can be found recursively: Let vk(i) be the probability of the most probable path ending in state k and with observation i, i.e., the probability of the most probable path π = π1 πi that generates the observations x1,,xi, and ends in state πi=k. If vk(i-1) is known for observation i-1, it can be computed for observation i as follows: vl i el xi max k vk i 1 akl Probability of most probable path ending in state l with observation i Probability of symbol xi being seen at state l This is known as the Viterbi Algorithm. Transition probability from state k to state l

So what does this equation say? It says two things: 1. The most probable path for generating the observations x 1,,x i, and ending in state l has to emit x i in state l (this is where the e l (x i ) comes in), and 2. It has to contain the most probable path for generating x 1,,x i-1 ending in any state k, followed by the transition from state k to state l (hence the transition probability a kl ) This is a dynamic programming problem, and as in previous cases, we need to maintain pointers to the best path. This algorithm is called the Viterbi Decoding Algorithm. Do all calculations in log-space to avoid underflow due to repeated product of many small numbers. VITERBI ALGORITHM max 1 v i e x v i a l l i k k kl Initialize: v 0 1, v 0 0, k 0 0 Main recursion for i 1,, N k max 1 arg max 1 v i e x v i a l l i k k kl Pointer l v i a end Termination * P x, max v N a i k k kl The state k that maximizes this product arg max v N a k k k 0 * L k k k 0 Trace Back * * i1 Pointer i, i N,,1

HMMGENERATE() hmmgenerate Hidden Markov model states and emissions (Statistics Toolbox) [seq,states] = hmmgenerate(len,trans,emis) takes a known Markov model, specified by transition probability matrix TRANS and emission probability matrix EMIS, and uses it to generate i) a random sequence seq of emission symbols and ii) a random sequence states of states. The length of both seq and states is len. TRANS(i,j) is the probability of transition from state i to state j. EMIS(k,l) is the probability that symbol l is emitted from state k. Note The function hmmgenerate begins with the model in state 1 at step 0, prior to the first emission. The model then makes a transition to state i 1, with probability T 1,i1, and generates an emission a k1 with probability E i1,k1. hmmgenerate returns i 1 as the first entry of states, and a k1 as the first entry of seq. hmmgenerate(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmgenerate(...,'statenames',statenames) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states. Since the model always begins at state 1, whose transition probabilities are in the first row of TRANS, in the following example, the first entry of the output states is be 1 with probability 0.95 and 2 with probability 0.05. Examples trans = [0.95,0.05; 0.10,0.90]; emis = [ 1/6 1/6 1/6 1/6 1/6 1/6; 1/10 1/10 1/10 1/10 1/10 1/2]; [seq,states] = hmmgenerate(1000,trans,emis); %Generate a sequence 1000 states and emissions according to trans and emis matrices. [seq,states] = hmmgenerate(1000,trans,emis, 'Symbols',{'one','two','three','four','five','six'}, 'Statenames',{'fair';'loaded'})

HMMVITERBI() hmmviterbi Hidden Markov model most probable state path (Statistics Toolbox) STATES = hmmviterbi(seq,trans,emis) given a sequence, seq, calculates the most likely path through the hidden Markov model specified by transition probability matrix, TRANS, and emission probability matrix EMIS. TRANS(i,j) is the probability of transition from state i to state j. EMIS(i,k) is the probability that symbol k is emitted from state i. Note The function hmmviterbi begins with the model in state 1 at step 0, prior to the first emission. hmmviterbi computes the most likely path based on the fact that the model begins in state 1. hmmviterbi(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmviterbi(...,'statenames',statenames) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states. Examples trans = [0.95,0.05; 0.10,0.90]; emis = [ 1/6 1/6 1/6 1/6 1/6 1/6; 1/10 1/10 1/10 1/10 1/10 1/2]; [seq,states] = hmmgenerate(1000,trans,emis); % Generate a sequence 1000 states and emissions according to trans and emis matrices. estimatedstates = hmmviterbi(seq,trans,emis); % Estimate the states from the actual observations (seq) and hmm matrices, trans and emis. [seq,states] = hmmgenerate(1000,trans,emis, 'Statenames',{'fair';'loaded'}); estimatesstates = hmmviterbi(seq,trans,emis, 'Statenames',{'fair';'loaded'});

BIOINFORMATICS FORWARD ALGORITHM The probability of observing a particular sequence P(x) under a given HMM model can also be obtained using a similar algorithm, called forward. Because multiple states can result in the identical sequence, the total probability is obtained by summing over all possible paths that result in the same sequence P x P x, Because the number of possible paths that result in any given sequence x increases exponentially with the length of the sequence, computing the probability of each is not practical. But an algorithm very similar to Viterbi, that replaces maximization operation with the sum (or probabilities) The quantity corresponding to Viterbi s vk(i), is replaced with fk(i), which represents the probability of the observed sequence x, up to and including xi, that ends with the state πi = k; i.e., f k i P x1, Initialize: f0 0 1,, xi, i k f k 0 0, k 0 Main recursion for i 1,, N fl i el xi k f k i 1 akl end Termination P x, * k f k N ak 0

BIOINFORMATICS BACKWARD ALGORITHM We have seen two algorithms so far: 1. 2. f k i P x1, Viterbi: Finds the most likely path, given the sequence and HMM probabilities. Forward: Finds the probability of observing a given sequence We may also be interested in the most probable state for an observation xi, given the entire observed sequence x., xi, i k, bk i P xi 1,, xn i k Initialize: i=n bk N ak 0, k Main recursion for i N 1,,1 bk i l akl el xi 1 bl i 1 More specifically, we want to obtain the probability that observation xi came from state k given the entire observed sequence. Note the conditioning: this is the posterior probability, P(πi = k x), of state k at time i when the entire emitted sequence is known. The probability of producing the entire sequence x1,, xn, with the ith symbol produced by state k: end Termination (technically this step not necessary b/c P(x) is found by the forward algorithm) P x l a0l el x1 bl 1 Posterior calculation P i k x P x, i k P x1,, xi, i k P xi 1,, xn x1, P x1,, xi, i k P xi 1,, xn i k fk i bk i f k i bk i P x, xi, i k b/c everything after state i depends on the then current state k (why?)

POSTERIOR DECODING So, when may we need to know these posterior probabilities? Sometimes, we are interested not in the state sequence itself, but some property of this sequence. Consider the function x i x G i P k g k k where g(k) may be 0 for certain states and 1 for certain other states. Then G(i x) is precisely the posterior probability of the i th observation coming from a state in the specified set. Recall the CpG island example, where we can define g(k)=1 for k{a +, C +, G +, T + } and g(k)=0 for k{a -, C -, G -, T - }. Then G(i x) is the posterior probability that the i th observation is in a CpG island.

HMM PARAMETER ESTIMATION Of course, everything we have discussed so far assumes that we know the HMM parameters: specifically, the transition and emission probabilities. How do we obtain these probabilities? There are two cases: When the state sequence is known Maximum likelihood estimator Easier When the state sequence is unknown Baum-Welch Algorithm Well by method of elimination not easier!

HMM PARAMETER ESTIMATION MAXIMUM LIKELIHOOD ESTIMATE If the sequence / path are known for all the examples, as is often the case if there is prelabelled training data, we can simply count the relative ratio of the number of times each particular transition and emission occurs. The maximum likelihood estimate (MLE) of these probabilities is defined as the values that would maximize the log-likelihood of the observed sequence, : 1 1 J J J j,, θ log,, θ log θ L x x P x x P x j1 kl akl ek b l kl and are given by: where represents all HMM parameters, A kl is the number of transitions from state k to state l in the training data, and E k (b) is the number of emissions of observation b from state k in the training data. All MLE estimates suffer from potential over fitting due to insufficient data: what if the particular transition or observation is not represented in the small training set? Leads to undefined probabilities (both numerator and denominator may be zero) Solution: Add a fudge factor, a pseudocount r kl and r k (b) to A kl and E k (b), respectively. Can be based on prior knowledge / bias of the natural probabilities of these quantities. b b A Ek A E b k

BIOINFORMATICS HMM PARAMETER ESTIMATION BAUM-WELCH If the exact path / sequence is not known, then the probabilities must be estimated using an iterative process. Estimate Akl and Ek(b) by considering some probable paths for the training sequences based on the current values of akl and ek(b). Then use (see previous slide) to obtain the new values of akl and ek(b) Iterate This is Baum-Welch algorithm, a special case of the more general (and well known) approach called Expectation Maximization (EM Algorithm). It is guaranteed to reach a local maximum of the log-likelihood. However, since there may and usually are many local maxima, the solution depends on the initial values of the iteration. BW computes Akl and Ek(b) as the expected number of times each transition or emission will be seen, given the training sequence (of observations only). The probability that akl is used at position i of sequence x is where represents all model parameters of f k i akl el xi 1 bl i 1 P i k, i 1 l x, θ the HMM. P x From this, we can obtain total expected number of times 1) akl is used in the sequence, and 2) letter b appears in state k by summing over all positions and over all training sequences j: Akl j 1 j j j f i a e x b i 1 k kl l l i 1 P x j i Ek b j 1 f k j i bl j i j P x i xij b

HMM PARAMETER ESTIMATION BAUM-WELCH So then, here is the complete BW Algorithm Initialize Pick arbitrary (random) model parameters Main Recursion Set all A and E variables to their pseudocount values r For each sequence in the training data j=1,,j Calculate f k (i) for sequence j using forward algorithm Calculate b k (i) for sequence j using backward algorithm Add contribution of sequence j to A using and to E using Calculate the new model parameters of the HMM using Termination Stop if the change in log-likelihood 1 1 J J J j,, θ log,, θ log θ L x x P x x P x is less than a threshold, or the maximum # of iterations is reached. j1

HMMESTIMATE() hmmestimate Hidden Markov model parameter estimates from emissions and states [TRANS,EMIS] = hmmestimate(seq, states) calculates the maximum likelihood estimate of the transition, TRANS, and emission, EMIS, probabilities of a hidden Markov model for sequence, seq, with known states, states. hmmestimate(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmestimate(...,'statenames',statenames) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states. hmmestimate(...,'pseudoemissions',pseudoe) specifies pseudocount emission values in the matrix PSEUDO. Use this argument to avoid zero probability estimates for emissions with very low probability that might not be represented in the sample sequence. PSEUDOE should be a matrix of size m-by-n, where m is the number of states in the hidden Markov model and n is the number of possible emissions. If the ik emission does not occur in seq, you can set PSEUDOE(i,k) to be a positive number representing an estimate of the expected number of such emissions in the sequence seq. hmmestimate(...,'pseudotransitions',pseudotr) specifies pseudocount transition values. You can use this argument to avoid zero probability estimates for transitions with very low probability that might not be represented in the sample sequence. PSEUDOTR should be a matrix of size m-by-m, where m is the number of states in the hidden Markov model. If the ij transition does not occur in states, you can set PSEUDOTR(i,j) to be a positive number representing an estimate of the expected number of such transitions in the sequence states. Pseudotransitions and Pseudoemissions If the probability of a specific transition or emission is very low, the transition might never occur in the sequence states, or the emission might never occur in the sequence seq. In either case, the algorithm returns a probability of 0 for the given transition or emission in TRANS or EMIS. You can compensate for the absence of transition with the 'Pseudotransitions' and 'Pseudoemissions' arguments. The simplest way to do this is to set the corresponding entry of PSEUDO or PSEUDOTR to 1. For example, if the transition ij does not occur in states, set PSEUOTR(i,j) = 1. This forces TRANS(i,j) to be positive. If you have an estimate for the expected number of transitions in a sequence of the same length as states, and the actual number of transitions ij that occur in seq is substantially less than what you expect, you can set PSEUOTR(i,j) to the expected number. This increases the value of TRANS(i,j). For transitions that do occur in states with the frequency you expect, set the corresponding entry of PSEUDOTR to 0, which does not increase the corresponding entry of TRANS. If you do not know the sequence of states, use hmmtrain to estimate the model parameters.

HMMTRAIN() hmmtrain Hidden Markov model parameter estimates from emissions [ESTTR,ESTEMIT] = hmmtrain(seq,trguess,emitguess) estimates the transition and emission probabilities for a hidden Markov model using the Baum-Welch algorithm. seq can be a row vector containing a single sequence, a matrix with one row per sequence, or a cell array with each cell containing a sequence. TRGUESS and EMITGUESS are initial estimates of the transition and emission probability matrices. TRGUESS(i,j) is the estimated probability of transition from state i to state j. EMITGUESS(i,k) is the estimated probability that symbol k is emitted from state i. hmmtrain(...,'algorithm',algorithm) specifies the training algorithm, 'BaumWelch' or 'Viterbi'. The default algorithm is 'BaumWelch'. hmmtrain(...,'symbols',symbols) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions. hmmtrain(...,'tolerance',tol) specifies the tolerance used for testing convergence of the iterative estimation process. The default is 1e-4 and applies to any of i) the log likelihood that the input sequence seq generated by the currently estimated values of the transition and emission matrices ; ii) the change in the norm of the transition matrix, normalized by the size of the matrix; iii) the change in the norm of the emission matrix, normalized by the size of the matrix hmmtrain(...,'maxiterations',maxiter) specifies the maximum number of iterations for the estimation process. The default maximum is 100. hmmtrain(...,'verbose',true) returns the status of the algorithm at each iteration. hmmtrain(...,'pseudoemissions',pseudoe) specifies pseudocount emission values for the Viterbi training algorithm. Use this argument to avoid zero probability estimates for emissions with very low probability that might not be represented in the sample sequence. PSEUDOE should be a matrix of size m-by-n, where m is the number of states in the hidden Markov model and n is the number of possible emissions. If the emission does not occur in seq, you can set PSEUDOE(i,k) to be a positive number representing an estimate of the expected number of such emissions in the sequence seq. hmmtrain(...,'pseudotransitions',pseudotr) specifies pseudocount transition values for the Viterbi training algorithm. Use this argument to avoid zero probability estimates for transitions with very low probability that might not be represented in the sample sequence. PSEUDOTR should be a matrix of size m-by-m, where m is the number of states in the hidden Markov model. If the transition does not occur in states, you can set PSEUDOTR(i,j) to be a positive number representing an estimate of the expected number of such transitions in the sequence states. If you know the states corresponding to the sequences, use hmmestimate to estimate the model parameters.

LAB 3 Implement the Viterbi / Baum Welch based HMM for the dishonest casino problem. 100% credit for self-implementation, 60% for using built-in Matlab HMM functionalities only. Project Idea (UG): Find three real world relevant bioinformatics problems that can be solved by the Viterbi algorithm Forward algorithm Backward algorithm / Posterior decoding Baum-Welch Algorithm Project Idea: Find how HMMs can be used for pairwise (UG) / multiple alignment (G), identify at least three bioinformatics applications and solve them using HMMs Project Idea: Study profile HMMs (phmms), identify at least three relevant bioinformatics applications and solve them using HMMs (G/UG)