O 3 O 4 O 5. q 3. q 4. Transition

Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in a series of statistical papers by L. E. Baum et al. They began to be used extensively in the area of bioinformatics (to model gene and protein behavior) in the mid 1980 s. General Idea: Similarly to a Markov chain, a hidden Markov model is a statistical model for a sequence of observations which are dependent on each other in a systematical way. A hidden Markov model consists of two parts: a sequence of states q 1, q 2, q 2,... and a sequence of emitted observations O 1, O 2, O 3,.... Generally, it is assumed that the observations can be observed (duh!) but that the states cannot be observed in practice. That s the hidden part of Hidden Markov model. O 1 O 2 O 3 O 4 O 5 Emission q 1 q 2 q 3 q 4 q 5 Transition The state variables q perform transitions from one state to another according to a discrete finite Markov chain. Each state emits observations O from a finite alphabet with a probability distribution that may depend on the state q. To fully describe a hidden Markov model, one needs the following components: A set of N states {S 1, S 2,..., S N }. An alphabet of M distinct observation symbols A = {a 1, a 2,..., a M }. The transition probability matrix for the states P = (p ij ) = P (q t+1 = S j q t = S i ) The emission probabilities (that may depend on the current state q t = S i ). For each state S i and a A define b i (a) = P (S i emits symbol a) these emission probabilities may be arranged in form of an N M matrix B. The initial distribution vector π = (π 1,..., π N ) for the first state. π i = P (q 1 = S i ) 36

Example: You have two coins - a fair coin in your right hand and a biased coin that tosses heads with probability 0.8 in your left hand. You will toss the coins repeatedly (always starting with the right hand) switching hands according to a Markov chain with transition probability matrix P (for right, left) and call out the results ( R L ) R 0.2 0.8 P = L 0.7 0.3 (a) What are the states and the observations in this example? Write down the state space and the observation alphabet. (b) How many (free) parameters does this HMM have? Write them down. (c) Suppose the coins are tossed three times and we observe T HT. What is the probability of this outcome for this model? (d) Given this outcome, find the most likely sequence of hidden states that have emitted it. The collection of parameters of a HMM is denoted λ = (P, B, π). In practice, the values of these parameters are usually unknown. Sometimes a structure, for instance the distribution of B may be inferred (apart from parameter values) from the biological setting. 37

Today, Hidden Markov models are used very frequently in biological applications. Examples of their use include: Gene Finding: A chromosome consists of regions that code for proteins (coding regions) and regions whose function is today still largely unknown (non-coding regions). However, DNA has a grammatical structure, that is the composition of base pairs is slightly different in coding and non-coding regions. For instance, it is known that the beginning regions of mammalian genes (promotors) are rich in CG combinations. The program Genescan relies on a hidden Markov model to divide a given nucleotide sequence into coding and non-coding regions. Studying copy number variations. During replication or reproduction whole regions in the genome are sometimes deleted or duplicated. That leads to some genes being present more than once in a genome. The number of times a gene is present is called the copy number of that gene. There may be evolutionary pressure for or against genes in higher copy numbers. Copy numbers can vary amongst individuals and may vary even between identical twins. Microarray technology can be used to measure copy number variation and hidden Markov models are used to reconstruct which portions of the genome are duplicated or deleted in a specific individual. Protein Family Characterization: A protein family is a group of evolutionary related proteins. Proteins in the same family often have similar threedimensional structures, similar functions, and similar amino acid sequences. Currently, there are over 60,000 defined protein families. How can a Hidden Markov model help in defining protein families? Starting with training sequences of proteins that have known similarities in sequence and/or function a hidden Markov model is fitted that describes how the sequences evolved from each other. The parameters of the HMM are estimated from the training data. Knowing the statistical properties of the model allows to perform hypothesis tests for new protein structures of unknown family: versus H 0 : new protein does not belong to family H a : new protein does belong to family The architecture of the HMM used for protein family modeling is described in great detail in Krogh et al (J.Mol Biol, 1994). Consider the following four amino acid sequences that may have evolved from a common ancestor: Protein 1 M P L H L T Q D E L D V Protein 2 I P H H F A Q D E L S S Protein 3 I P L H A A Y Q N L S W Protein 4 V V T H M A Q N F V D L 38

Evolutionary events that any amino acid in the ancestral sequence may have undergone include Matches - same distribution of amino acids on all sequences Deletion - the amino acid is deleted. Insertion - a new amino acid is inserted between two existing ones Consider the following HMM for a short ancestral sequence with only four amino acids. D1 D2 D3 D4 Begin M1 M2 M3 M4 End I1 I2 I3 I4 I5 Here, the M-states are the match states. Each match state has a distinct distribution for the 20-letter amino acid alphabet. Corresponding to each match state, there is a deletion state D that emits a dummy variable δ (or - that stands for deletion of an amino acid). On either side of a match state there is an insertion state I which generates an amino acid again from the 20-letter amino acid alphabet but with a distribution characterized by the state I. The Begin and End states do not produce emissions. (a) How many states does this HMM have? (b) What are the model parameters P and B in this example? (c) What would the model parameters have to be so that this model produces completely random sequences? Always the same sequence? 39

Hidden Markov Models in Gene Finding In most (non-bacterial) genomes, genes are split into several coding and non-coding regions. The beginning and end of coding sequences within the genome are flanked by start and stop codons. The start codon is preceded by a promotor region. This is a DNA sequence to which the RNA polymerase (which facilitates the copying of DNA into RNA) can bind. The DNA sequences at the splice sites, where two different regions meet, are quite characteristic (usually GT at the 5 end and AG at the 3 end) and properties of the nucleotide sequence (for instance CG content) can differ within these regions. Interactions of proteins with promoter sites can block the promoter and effectively make it impossible for the RNA polymerase to bind. This can cause the gene to be (at least temporarily) turned off. This concept will become very important for us, when we study microarray technology and its applications in a few weeks. When DNA is transcribed into RNA, it is first transcribed into pre-mrna which contains both the introns and exons. RNA splicing is the process of removing the introns and joining the exons to form a continuous coding sequence. The goal of a gene finding algorithm is to automatically detect the coding regions, given a long sequence of DNA. The algorithm may classify the regions into upstream, start codon, intron, exon, stop coding, downstream and intergenic regions. There are several different HMM based algorithms whose goal it is to identify coding regions in DNA. Genescan uses the hidden state structure shown in the figure to the right. The states move from non-coding intergenic regions to promoters to start codons and then to one or more exons (interspersed with introns if there is more than one exon). The model specifies the probabilities with which each state (blue circle or red diamond in figure) emits the nucleotides A,T,C,G. (Burge & Karlin, J. Mol. Biol., 1997) 40

Statistical Inference for Hidden Markov Models There are three different types of questions that can be asked (and answered) in the context of hidden Markov models. (1) Given the parameters λ = (P, B, π) of the model, efficiently calculate the probability of some given output sequence. One algorithm that can efficiently compute P (O λ) is called the Forward Algorithm. (2) Given the parameters λ = (P, B, π) of the model, find the sequence of hidden states that is most likely to have generated a specific sequence of observations. The algorithm that performs this task is called the Viterbi Algorithm. It finds argmax Q P (Q O) (3) The hardest task is to estimate the values of the model parameters while not knowing the hidden states in order to maximize the likelihood of a given sequence of observations. In effect, both the model parameters and the hidden states have to be estimated from the data in order to make the model likelihood as large as possible. argmax λ P (O λ) This task is sometimes referred to in the literature as Machine Learning. The general solution for this problem is called the EM-Algorithm ( E stands for expectation, and M stands for maximization). The special case of the EM-algorithm for hidden Markov models is called the Baum-Welch method. The Forward Algorithm Recall that the task of the Forward algorithm is to compute the probability of a specific sequence of outputs (assuming the model structure of the HMM is fixed and the model parameters are known). Naively this could be done by using the law of total probability and by considering all possible hidden state sequences Q P (O λ) = all Q P (O Q, λ)p (Q λ) However, considering all possible state sequences that may have lead to an observed emission sequence O quickly makes the search space very large, particularly if the observation sequence is long. It would require summing N T products of 2T terms each (N is the size of the state space and T is the length of the sequence of observations). Thus, an iterative procedure is clearly preferable. 41

The Forward Algorithm iteratively computes the joint probability that at time t the Markov chain is in state S i S and that the observations along the way were O 1,..., O t. α i (t) = P (O 1, O 2,..., O t, q t = S i λ), t = 1, 2,..., T This allows for an efficient computation of the terminal probabilities α i (T ) = P (O 1, O 2,..., O T, q T = S i λ) The probability to observe any particular sequence O = (O 1,..., O T ) of emissions can then be written as N P (O) = α i (T ) i=1 This calculation still requires recursive computation of N α s, but it is much less costly than an exhaustive search of all possible state sequences. To recursively compute α i (T ) for i = 1,..., N, we need an initialization α i (1) = We also need an induction step in which α i (t + 1) is formulated as a function of previous α s. Then P (O) = N α i (T ) The Forward algorithm provides a solution for problem (1): Calculating the probability of a given output sequence O. It can be carried out in T N 2 computations. i=1 42

Example: Recall the coin tossing example from page 37. In this example, the state space was S = {right, left}, the emissions were A = {H, T } and the given model parameters were ( R L ) ( H T ) R 0.2 0.8 R 0.5 0.5 π = (1, 0), P =, B = L 0.7 0.3 L 0.8 0.2 Use the forward algorithm to compute the probability of observing the sequence THT. The Backward Algorithm In the forward algorithm, we started by considering the possible values of the first state and working our way forward from there. Similarly, one could also start by considering the possible values for the last state q T and work backwards. That is, we find the probability of the ending sequence (O t+1,..., O T ) given the hidden state occupied at time t. Define β i (t) = P (O t+1,..., O T q t = S i, λ) Note, that in this definition, β is a conditional probability whereas the α s were defined as joint probabilities. The initial β is defined as β i (T ) = 1, for all i S The β-terms are now defined recursively backwards in time: β i (t) = P (O t+1,..., O T q t = S i, λ) = N P (O t+1,..., O T, q t+1 = S j q t = S i, λ) j=1 = N P (O t+1,..., O T q t = S i, q t+1 = S j, λ)p (q t+1 = S j q t = S i, λ) j=1 = N P (O t+1,..., O T q t+1 = S j, λ)p (q t+1 = S j q t = S i, λ) j=1 = N P (O t+2,..., O T q t+1 = S j, λ)p (O t+1 q t+1 = S j, λ)p (q t+1 = S j q t = S i, λ) j=1 = N β j (t + 1)b j (O t+1 )p ij j=1 43

The Viterbi Algorithm Recall, that the goal of the Viterbi algorithm is to find the most likely sequence of states that have produced a given output sequence. Here, we assume that the parameters λ of the hidden Markov model are known. We want to find argmax Q P (Q O, λ) = argmax Q P (Q, O, λ) Just as the forward and backward algorithms the Viterbi algorithm is defined recursively. Let v i (t) be the probability of the most likely state sequence of the first t observations that ends in state S i. That is The sequences are initialized by defining v i (t) = max 1 j N (P (O t q t = S i )p ji v j (t 1)) v i (1) = P (O 1 q 1 = S i )π i The Viterbi path x 1,..., x T is defined as the sequence of states q t = S j that maximize the v i (t) expression. That is x T = argmax 1 i N v i (T ) x t = argmax 1 j N (P (O t q t = S i )p ji v j (t 1)) As for the forward and backward algorithm, the complexity of this algorithm is O(T N 2 ). Example: Recall the coin tossing example from page 37. In this example, the state space was S = {right, left}, the emissions were A = {H, T } and the given model parameters were ( R L ) ( H T ) R 0.2 0.8 R 0.5 0.5 π = (1, 0), P =, B = L 0.7 0.3 L 0.8 0.2 Use the Viterbi algorithm to find the most likely state sequence that gave rise to the sequence THT. 44

The EM-Algorithm Recall, that the goal of the EM-algorithm is to estimate the (possibly numerous) parameters of a hidden Markov model from the sequence of observations without knowing the explicit sequence of hidden states. The original EM-algorithm was developed for maximum likelihood estimation of parameters in probability models with missing data. (Dempster, Laird, Rubin, 1977). Consider the following scenario: We have a probability model that generates observations x. The model has parameters θ. There may be some missing data y. In the hidden Markov model context we have Notation Description HMM analog x observed data output sequence O 1,..., O T y missing data state sequence q 1,..., q T θ model parameters λ = (π, P, B) The likelihood of a set of parameter values θ is defined as the probability to observe a given set of outcomes given those parameter values. L(θ x) = P (x θ) Since joint probabilities are usually computed as products with many terms, it is more convenient in many cases to work with log-likelihood functions rather than with likelihood functions directly. A maximum-likelihood parameter estimate is the value ˆθ that maximizes the likelihood function. The same value will also maximize the log-likelihood function, since log(x) is monotone. That is, we want to find ˆθ to maximize log P (x θ) = log P (x, y θ) y Here the sum should be taken over all possible values of the missing data. We will now provide a heuristic justification (not a strict proof) for how the EM-algorithm works. First, note that and write P (x, y θ) = P (y x, θ) P (x θ) log P (x θ) = log P (x, y θ) log P (y x, θ) Suppose θ t is a current estimate (not necessarily the optimal estimate) of the parameter vector θ. Multiply both sides of the above equation with P (y x, θ t ) and sum over all possible values of y. One the left side of the equation, nothing will change, since log P (x θ)p (y x, θ t ) = log P (x θ) P (y x, θ t ) = log P (x θ) 1 = log P (x θ) y y 45

On the other side, we ll have log P (x θ) = y P (y x, θ t ) log P (x, y θ) y P (y x, θ t ) log P (y x, θ) Give the first term in this equation a name Q(θ θ t ) = y P (y x, θ t ) log P (x, y θ) One can show that maximizing Q(θ θ t ) with respect to θ always increases the likelihood function P (x θ) (compared to P (x θ t )). Note, that here Q(θ θ t ) is the expected value of log P (x, y θ) with respect to y. The name of the EM algorithm comes from the two steps that are now carried out in an alternating fashion: Initialization: Pick an initial value for θ, say θ 0. E-step: For the current θ estimate, say θ t, compute Q(θ θ t ), that is E y [log P (x, y θ t )]. M-step: Maximize Q(θ θ t ) with respect to θ. Call the new argmax θ t+1. Repeat: Check whether termination criterion is met and if not go back to E-step with updated θ t+1. The EM-algorithm can be shown to converge to a local maximum of the likelihood function P (x θ). Note, that a local maximum is not necessarily equal to the global maximum. It is usually a good idea to restart the algorithm with different initial values θ 0. The algorithm should be terminated either after a fixed number of steps or after the likelihood function does not change much anymore (% change less than some threshold). The Baum-Welch Algorithm In the specific context of hidden Markov models the observed data are the outputs O = (O 1,..., O T ), the missing data are the states q = (q 1,..., q T ) and the model parameters are λ = (π, P, B). Next, we need to introduce two new quantities. Define γ i (t) = P (q t = S i O, λ) Note, that because of the Markov property we can write P (O, q t = S i λ) = 46

so that γ i (t) = Also define In a similar way one can show that ξ ij (t) = P (q t = S i, q t+1 = S j O, λ) ξ ij (t) = α i (t)p ij b j (O t+1 )β j (t + 1) N N α i (t)p ij b j (O t+1 )β j (t + 1) i=1 j=1 Note, that both γ i (t) as well as ξ ij (t) involve the forward probabilities α i (t) as well as the backward probabilities β j (t). Why do we need all this complicated notation? Let s take a look for what the two new parameters γ and ξ represent. γ i (t) is the probability that the hidden state is S i at time t, given the complete sequence of observations and the current set of parameters. ξ ij (t) is the probability that from time t to time t + 1 the hidden states will transition from i to j, given the complete sequence of observations and current set of parameters. To estimate the initial distribution π we would like an estimate for the probability that the hidden state is in state i at time t = 1. Note, that ˆπ i = γ i (1) = P (q 1 = S i O, λ) where λ is the current set of parameter estimates. Furthermore, T 1 t=1 T 1 t=1 Hence γ i (t) is the expected number of times the hidden chain will be in state i, whereas ξ ij (t) is the expected number of times the hidden chain transitions from i to j. Finally, ˆp ij = ˆbj (o) = T 1 t=1 T 1 ξ ij (t) γ i (t) t=1 O t=o γ j (t) T γ j (t) t=1 47

Initialization: Pick an initial value for λ, say λ 0. Often, the entries in π, P, and B are chosen to all be equally likely. If the entries are chosen to be zero, they will remain zero during the algorithm. E-step: For the current λ estimate, say λ t, compute first α i (t), β i (t) (for i = 1,..., N and t = 1,... T ). Then use them to compute γ i (t) and ξ ij (t) (for i, j = 1,..., N and t = 1,..., T ). M-step: Compute ˆπ i, ˆp ij, and ˆb i (o) for i, j = 1,..., N and o A. Repeat: Compute model likelihood P (O λ). Check, whether termination criterion is met and if not go back to E-step with updated λ t+1. The HMM package in R For hidden Markov models numerous R packages have become available in the first decade of this century for simulation of data and execution of the algorithms we discussed (Forward, Backward, Viterbi, Baum-Welch, EM). The most prominent examples of such packages are HMM andhiddenmarkov. The different packages have many overlapping functionalities. Example: The HMM package can simulate data from a hidden Markov model with specified parameters. Let the state space be S = {1, 2} and the emission alphabet A = {a, b, c}. Specify the initial state distribution π, the transition probability matrix P and the emission probability matrix B as follows: ( ) ( ) 0.2 0.8 0.1 0.2 0.7 π = (0.5, 0.5), P =, B = 0.7 0.3 0.4 0.4 0.2 In R, one initiates (i.e., defines parameters, state space, and emission alphabet) with the command inithmm(). Then, the command simhmm() can be used to simulate a sequence of n observations generated from the HMM previously defined. The results are random, that means if you repeat the command, you may get different values. 48

Given the parameters of a HMM (in the form of an inithmm() object) the forward() algorithm finds the probability of observing a specific output sequence together with the state of the last observed output. The probabilities are reported on a log-scale. The backward() algorithm finds the probabilities of observing future outputs, given the hidden state at time t and complete sequence of observations. Recall, that the viterbi() algorithm finds the most likely sequence of states that generated a given sequence of observations. The user has to provide model parameters in the form of an inithmm() object for this algorithm. The baumwelch() algorithm finds both the estimated state sequence as well as estimates of the parameters (π, P, B) of the model, given an output sequence and initial parameter estimates. The initial parameter estimates are often noninformative matrices, in which every transition or emission is equally likely. 49