Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Examples of Hidden Markov Models

Size: px

Start display at page:

Download "Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Examples of Hidden Markov Models"

Morris Fitzgerald
5 years ago
Views:

1 Steven R. Dunbar Department of Mathematics 203 Avery Hall University of Nebraska-Lincoln Lincoln, NE Voice: Fax: Topics in Probability Theory and Stochastic Processes Steven R. Dunbar Examples of Hidden Markov Models Rating Student: contains scenes of mild algebra or calculus that may require guidance. 1

2 Section Starter Question Key Concepts 1. Hidden Markov Models are useful representations of situations ranging from bioinformatics to speech recognition, and language processing. 2. A Hidden Markov Model consists of a Markov chain among states and expression of a signal or observation from each state. The states are hidden. 3. With Hidden Markov Models we usually have only the observations or signals, not all the necessary information for complete representation. From the observations, we wish to find the most likely states. The words most likely indicates that we must consider possible measures of optimality. So Hidden Markov Models are a modeling and statistical problem, and in some ways an inverse problem. That accounts for calling these Hidden Markov Models and not considering them from the point of view of Markov processes. Vocabulary 1. In a Markov chain process, if each state emits a random signal or observation from a set of possible signals while the process states themselves are unobservable, then we say the process is a hidden Markov chain model. 2

3 2. A standard mathematical example of a general Hidden Markov Model is an urn and ball model. Mathematical Ideas Toy Examples of Hidden Markov Models A Variable Factory A production process in a factory is either in a good state (call it state 0) or in a poor state (state 1). If the process is in state 0 during some period then, independent of the past, with probability 0.9 it will be in state 0 during the next period and with probability 0.1 it will be in state 1. Once in state 1 it remains in that state forever. Suppose the factory produces a single item in each period and that each item produced when the process is in state 0 is of acceptable quality with probability 0.99, while each item produced when the process is in state 1 is of acceptable quality with probability If the status, either acceptable or unacceptable, of each successive item is observable, while the process states are unobservable, then we say the process is a hidden Markov chain model. The state of the factory is 0 or 1 and the signal is the quality of the item produced with value either a or u, depending on whether the item is acceptable or unacceptable. The transition probabilities of the underlying Markov chain are The signal probabilities are A = ( 0 1 ) P [a 0] = 0.99 P [u 0] = 0.01 P [a 1] = 0.96 P [u 1] =

4 State Conditional Probability Product Probability 000 (0.8)(0.99)(0.9)(0.01)(0.9)(0.99) (0.8)(0.99)(0.9)(0.01)(0.1)(0.96) (0.8)(0.99)(0.1)(0.04)(0.0)(0.99) (0.8)(0.99)(0.1)(0.04)(1.0)(0.96) (0.2)(0.96)(0.0)(0.01)(0.9)(0.99) (0.2)(0.96)(0.0)(0.01)(0.1)(0.96) (0.2)(0.96)(1.0)(0.04)(0.0)(0.99) (0.2)(0.96)(1.0)(0.04)(1.0)(0.96) Table 1: State sequence conditional probability products and probability of the observed sequence given the state sequence. so the emission matrix is B = ( 0 1 ) Suppose that the probability of starting in state 0 is 0.8 or π = [0.8, 0.2]. Suppose a sequence of three observed articles are (a, u, a). Then given each possible state sequence, the probability of the corresponding observed sequence is in Table 1. Notice that even without the 0 entries some products can combine to make the calculations more efficient. A Paleontological Temperature Model We want to determine the average annual temperature at a particular place on earth over a sequence of years in the distant past. For simplicity, we consider that there were only annual average temperatures, hot and cold. Suppose that modern evidence indicates the probability of a hot year followed by another hot year is 0.7 and the probability of a cold year followed by another cold year is 0.6, independent of the temperature in prior years. Assume that these probabilities held in the distant past as well. A probability transition matrix summarizing the information is A = ( H C ) H C

5 Also suppose that current research indicates a correlation between the size of tree growth rings and temperature. Again for simplicity, we consider only three different tree ring sizes, small designated as S, medium designated as M, and large designated as L, the observable signal of the average annual temperature. Suppose that based on available evidence, the probabilistic relationship between annual temperature and tree ring sizes is B = ( S M L ) H C For this system, the state is the annual average temperature, either H or C. The transition from one state to another is a Markov chain. However, these are hidden states, since we can t directly observe the temperature in the past. Although we can t observe the state or temperature in the past, we can observe the size of tree rings. From this evidence, we would like to determine the most likely temperature state in past years. The occasionally cheating casino In a hypothetical dishonest casino, the casino uses a fair die most of the time, but occasionally the casino secretly switches to a loaded die, and later the casino switches back to the fair die. A probabilistic process determines the switching back-and-forth from loaded die to fair die and back again after each toss of the die, with the switch from fair-to-loaded occurring with probability 0.05 and from loaded-to-fair with probability 0.1. In addition, assume that the loaded die will come up six with probability 0.5 and the remaining five numbers with probability 0.1 each. The transmission matrix is A = and the emission probability matrix is B = ( 0 1 ) ( ) F 1/6 1/6 1/6 1/6 1/6 1/6. L 1/5 1/5 1/5 1/5 1/5 1/2 If you can see only the sequence of rolls (the sequence of observations or signals) you do not know which rolls used a loaded die and which used a 5

6 fair die, because the casino hides the state. This is an example of a Hidden Markov Model. Standard Mathematical Examples Urn and ball model A standard general mathematical example is an urn and ball model. Then are N urns, each filled with colored balls with M possible colors for the balls. Generate the observation sequence by initially choosing one of the N urns, randomly according to an initial probability distribution, randomly selecting a ball, recording its color, replacing the ball, and then choosing a new urn according to a transition probability distribution associated with the current urn. Then at each time, the signal or observation is the color of the selected ball. The hidden states are the urns. Coin Flip Models Consider the following coin tossing experiment: You are in a room with a curtain through which you cannot see what is happening. On the other side of the curtain a person is tossing a coin, or maybe one of several coins. The other person will not tell us exactly what is happening, only the result of each coin flip. This is a sequence of hidden coin flips. Thus we can only use the results of the coin tosses, say O = HHHT T HH... T with H for heads and T for tails. Take first the case that the proportion of heads and tails are equal, statistically speaking, without any obvious patterns or organization to occurrences of heads and tails. How do we build a Hidden Markov Model to best explain the observed sequence of heads and tails? Figure 2 shows the first possible model. This simplest one-fair-coin model, has two states, each state is directly associated with heads or tails. The probability of being in the state generating a head would be 0.5 and equally for being in the state generating a tail. This model is not truly hidden because each observation directly defines the state. This is a degenerate example of a hidden Markov model which is exactly the same as the classic stochastic process of repeated Bernoulli trials. 6

7 a 00 a 10 a 11 Urn 0 a 01 a 02 a 20 a 21 a 12 Urn 1 Urn 2 a 22 Observations: Figure 1: Schematic diagram of an urn and ball model with N = 3 urns and M = 6 colors. 7

8 P [H] = 1 P [T ] = 0 P [H] = 0 P [T ] = 1 Figure 2: State diagram for the one coin model. A second possible Hidden Markov Model for the observations is a twofair-coin model, see Figure 3. Associate each state with a fair coin, so the probability of generating a head in each state is p = 0.5. In this special case, called the two-fair-coins model, the probabilities associated with remaining in or leaving each of the two states form a probability transition matrix whose entries are unimportant because the observable sequences from the two-fair coins model are statistically indistinguishable in each of the states. That means this two-fair-coin model is indistinguishable from the one-fair-coin model in a statistical sense and so this is another degenerate example of a Hidden Markov Model. Other Hidden Markov Models which can account for a observed sequence of equal proportions of heads and tails are possible. Take the two-compensating-biased-coins model as a model of what happens behind the curtain. The model has two different states corresponding to 2 different coins. In one state, the coin is biased toward heads, say with P [H] = p > 0.5. In the other state, the coin is biased towards tails with P [H] = 1 p < 0.5. The state transition probabilities all equal 0.5. See Figure 4. This could accomplished by the person behind the curtain having two biased coins and a third fair coin with the biased coins associated to the faces of the fair coin respectively. The person behind the barrier flips the fair coin to decide which biased coin to use and then flips the chosen biased coin to generate the observed outcome. Note that with complementary biased coin probabilities indicated in Figure 4 the long term averages of heads or tails would be statistically indistinguishable 8

9 1 a 00 a a 11 1 a 11 P [H 0] = 0.5 P [T 0] = 0.5 P [H 1] = 0.5 P [T 1] = 0.5 Figure 3: Two fair coin model from either the one-fair-coin model or the two-fair-coin model. It is clear that other more complicated models with three or more coins could also be constructed. In this special case that the proportion of heads and tails are equal statistically speaking, without any obvious patterns or organization to occurrences of heads and tails, there would have to be some compelling physical reason to choose a multiple-hidden-state model over the simpler and equivalent one-fair-coin model. Now in another direction imagine that the observed sequence is a very long sequence of heads, then followed by another long sequence of tails of some random length, interrupted by a head, followed again by yet another long random sequence of tails Then we might use a two-biased-coins model. with two biased coins with biased switching between the states of the coins as a possible model for the observed sequence. Of course, such a sequence of many heads followed by many tails could conceivably come from a fair coin. The choice between a one-fair-coin or two-biased-coin model would be a choice justified by the likelihoods of the observations under the models, or possibly by other external modeling considerations. However other higher-order statistics of the two-biased-coins model such as the probability of runs of heads, should be distinguishable from the onefair-coin or the two-fair coin model. Another Hidden Markov Model could be the three-biased-coins model. In the first state, the coin is slightly biased to heads, in the second state the coin is slightly biased toward tails, in the third state the coin is some other 9

10 P [H 0] = p P [T 0] = 1 p P [H 1] = 1 p P [T 1] = p Figure 4: Two compensating biased coins model 1 a 00 a a 11 1 a 11 P [H 0] = p 0 P [T 0] = 1 p 0 P [H 1] = p 1 P [T 1] = 1 p 1 Figure 5: Two biased coin model 10

11 P [H 0] = p 0, P [T 0] = 1 p 0 P [H 1] = p 1, P [T 1] = 1 p 1 a 01 a a 11 a 10 a 20 a 21 a 02 a 12 2 a 22 P [H 2] = p 2, P [T 2] = 1 p 2 Figure 6: Three coin model 11

12 distribution, maybe fair, maybe biased in some direction. A Markov chain determines transition probabilities among the three states. See Figure 6. The sequence of observations depends on the biases and the transition probabilities. The simple statistics and higher-order statistics of the observations would be correspondingly influenced and would suggest the appropriateness of this choice of model. Several important points emerge from the possibility of different models for the observed outputs of the coin tossing experiment behind the curtain. The first is that there is no mathematical reason to stop at using either 1, 2, or even 3 coin models. An important part of the modeling process is to decide on the number of states N for the model. Without some a priori information, this choice is often difficult to make and may involve physical intuition or even trial and error before settling on the appropriate model size. Another important point is the length of the observation sequence. With a too short observation sequence, we may not be able to understand the number or kind of states. With insufficient data, some Hidden Markov Models may not be statistically different. The statistics of the observation sequence will also guide the choice of a model, as in the runs of heads and tails suggesting a two-biased-coin model over a one-fair-coin model. A third point is the optimal estimation of the model parameters from the observations, that is, the probabilities of heads and tails in each state and the transition probabilities between states. The choice of what optimal means is a mathematical modeling choice in the broad sense. After making the choice of optimal, estimation of the parameters becomes a statistical problem. Finally, this emphasizes the title of the subject as Hidden Markov Models. If the Hidden Markov Model is completely specified, then one might as well make a larger-state ordinary Markov chain from it. For instance, the one-coin model when completely specified would simply be the Markov chain of standard Bernoulli sequences, well studied in classic probability theory. The completely specified two-coin model, whether fair, compensating or biased, easily becomes a four-state Markov chain. The three-coin model, again whether fair or biased, makes a 6-state 12

13 Markov chain. In all situations, the classic results of Markov chains would apply to predict long term averages and stationary distributions, rates of convergence to stationary, and other consequences. Here, the problem is that we have only the observations or signals, not all the necessary information. From that, we wish to best determine the underlying states and probabilities. The word best indicates that we must consider possible measures of optimality. So this is a modeling and statistical problem, and in some ways an inverse problem. That accounts for calling these Hidden Markov Models and not considering them from the point of view of Markov chains. Realistic Hidden Markov Models CpG Islands In the human genome the dinucleotide CG (frequently written CpG to distinguish it from the CG base-pair across two strands) is rarer than expected from the independent probabilities of C and G, for reasons of chemistry that transform the C into a T. For biologically important reasons, the chemical transformation is suppressed in short regions of the genome, such as around the promoters or start regions of many genes. In these regions, we see significantly more CpG dinucleotides than elsewhere. Such regions are called CpG islands. The islands are typically a few hundred to a few thousand bases long. Given a short stretch of genomic sequence, how would we decide if it comes from a CpG island or not? Second, given a long piece of sequence, how would we find the CpG islands in it, if any exist? Here, the Model has two states, CpG islands, and non-islands. In each state, the probabilities of expressing CpG base-pairs are different. Profile HMMs in Bioinformatics For a pair (or more) of proteins, an important question is: How are the proteins similar? The goals are to detect and measure overall similarity between protein amino acid sequences, find proteins with similar functions in different organisms by finding similar subsequences of amino acids called conserved sequences and to detect conserved sequences and evolution of conserved sequences. Alignment is the method for answering these questions. 13

14 There are two types of alignment: A global alignment is an alignment of the full length of two sequences. A local alignment is an alignment of part of one sequence to part of another sequence. For possibly distantly related sequences, it might be more sensible to make local alignments of subregions of high similarity, not the whole sequence. A sample toy alignment is in Figure 7 A C - - A C - T G T T A G A C G G A G C T - T C A C Figure 7: Toy example of gapped alignment of DNA sequences. Alignment allows amino acid matches and mismatches along columns with corresponding scores based on chemistry and biology. In order to make alignments, we also allow introduction of gaps in either of the protein amino acid sequences. Introducing gaps when making alignments adds penalty scores. A common task in bioinformatics is to obtain a cluster of related sequences, e.g. from a database, and then to align those sequences using multiple sequence alignment algorithms. The clustering reflects the insights of the biology community as to which proteins belong within the same family. The outcome of the clustering process is a set of distinct protein families. This is the first step in most phylogenetic analyses. Heuristic algorithms are generally used to create multiple sequence alignments. There are large databases of proteins and alignments, some created with Hidden Markov Models, some provide Hidden Markov Model data, see below. An example of an actual multiple sequence alignment (MSA) is in Figure 8. A profile HMM (phmm) is a particular Hidden Markov Model with states, signals, transition matrix, and emission matrix summarizing a multiple sequence alignment. The goal is use this Hidden Markov Model information about the MSA to align a new query sequence. Profile HMMs have three states for each alignment position, i.e. each column in the multiple sequence alignment. Three outcomes are possible when aligning each residue of the query sequence with the MSA: the query residue may align (match) with the next residue of the MSA; it may correspond to an insertion (new residue) relative to the MSA; and 14

15 Figure 8: Alignment of acidic ribosomal protein P0 from several organisms. 15

16 it may correspond to a deletion (a gap) relative to the MSA. There are heuristic rules assigning MSA columns as match states, for example, the MSA has a match column if less than half of the characters are gaps. The length of a phmm is the number of columns in the MSA assigned to match states. Each match state in the phmm has its corresponding set of emission probabilities, generated from counting the frequencies of each amino acid in the corresponding column. Insertions, i.e. portions of the query sequence that do not match anything in the multiple alignment, correspond to an insert state. As in the case of the match states, each insert state has its own set of emission probabilities. The insert state emission probabilities are typically generated using the distribution of amino acids over the entire MSA. A delete state is possible for each of the positions in the MSA. The delete state is an example of a silent state in the model, as it does not emit any residues. Let l denote the number of match locations. Then the associated profile HMM has 3l + 3 states in the underlying Markov process, namely: a start state S; an end state E; l match states M 1,..., M l ; l delete states D 1,..., D l ; and and l + 1 insert states I 0,... I l. Figure 9: Schematic diagram of the transitions in a profile HMM 16

17 Thus, a phmm typically has many more states than the previous examples of Hidden Markov Models with only a handful of states. The connections between the states is highly structured and more complicated than the other examples of Hidden Markov Models. The set of emissions, 20 or fewer, is about the same order of magnitude as in the other examples. A typical application of a profile HMM is the following: Start with collection of protein families (clusters) F 1... F k, where all proteins within a family have the same length after assigning gaps as necessary. For each family F i, construct a corresponding profile HMM, λ(f i ) in the notation of the next section. The objective is to assign a newly sequenced protein to one of the k families. Then compute the likelihood P [O λ(f i )] of the gap-aligned new protein for each of the k profile HMMs. The new protein is then assigned to the family for which the likelihood is maximum. This is the scoring problem of Hidden Markov Models. Language Analysis Translation Language translation is a classic application of Hidden Markov Models, originating with Cave and Neuwirth, see [4] for history and additional details. Suppose you do not understand English, but you do know something about punctuation. You obtain a large body of English text, such as the Brown Corpus. Henry Kučera and W. Nelson Francis at Brown University compiled The Brown University Standard Corpus of Present-Day American English as a general corpus (text collection) in the field of corpus linguistics. The Brown Corpus contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in The Brown Corpus is one of many such corpuses, available through the Natural Language Toolkit, see nltk.org. With knowledge of Hidden Markov Models, but no knowledge of English, you would like to determine some basic properties of this mysterious writing system. Can you partition the characters into sets so that characters in each set are different in some statistically significant way? First remove all punctuation and numbers and convert all letters to lower case. This leaves 26 distinct letters and the space, for a total of 27 symbols. The observations are the series of characters found in the resulting text. Then test the hypothesis that English text has an underlying Markov chain with two states. For each of these two hidden states, assume that the 27 symbols appear according to a fixed probability distribution. This sets up a 17

18 Hidden Markov Model with N = 2 and M = 27 where the state transition probabilities and the observation probabilities from each state are unknown. Results of a case study, [4] using the first 50,000 observations from the Brown Corpus of letters converted to lower case and the space are in Table 2. Without having any assumption about the nature of the two states, the results sort into two familiar categories! The probabilities tell us that the one hidden state contains the vowels while the other hidden state contains the consonants. Interestingly, space is more like a vowel and y is a consonant. The Hidden Markov Model deduces the statistically significant difference between vowels and consonants without knowing anything about the English language. Cave and Neuwirth obtain further results by allowing more than two hidden states. They are able to obtain and sensibly interpret the results for models with up to 12 hidden states. This example has further applications to automatic language translation. This example suggests Hidden Markov Models may be applicable to cryptanalysis. In fact, a Hidden Markov Model has been applied to secret messages such as Hamptonese, the Voynich Manuscript and the Kryptos sculpture at the CIA headquarters but without too much success, [4]. Partly the reasons for success or failure depend on the quality of the transcriptions and partly on the assumptions that the cipher text is a plaintext in an unknown language, and not steganography, or even just babbling (glossolalia). Speech Recognition A classic example and practical application of Hidden Markov Models is speech recognition, especially isolated word recognition. Hidden Markov Models were developed in the 1960s and 1970s for satellite communication. Andrew Viterbi made a key contribution to the theory in They were later adapted for language analysis, translation and speech recognition in the 1970s and 1980s by Bell Labs and IBM [2]. Interest in HMMs for speech recognition seems to have peaked in the late 1980s. Speech recognition takes place in several steps: 1. Feature analysis a spectral or temporal analysis of the speech signal to decompose the continuous sound sample into discrete observations of speech sounds for the Hidden Markov Model. 18

19 State 0 State 1 a b c d e f g h i j k l m n o p q r s t u v w x y z space Table 2: Emission probabilities of letters from the two states, from [4] 19

20 2. Unit matching the speech signal parts are matched to words or phonemes with a Hidden Markov Model. 3. Lexical analysis if the units are phonemes, combine the units into recognized words with either a deterministic or a probabilistic finite state network. If the units are words, this step can generally be eliminated. 4. Syntactic analysis with a grammar, group words into proper sequences. If single word like yes or no, or digit sequences, this step is minimal or completely eliminated. 5. Semantic analysis interpret the words or word sequences for the task model. Concentrating on the second step of unit matching, assume we have a vocabulary of V words to recognize. We have a training set of L tokens of each word. We also have an independent observation set. We use the observations from the set of L tokens to estimate the optimum parameters for each word, creating model λ v for the vth vocabulary word, 1 v V. For each unknown word in the observation sequence O = O 0 O 1... O T 1 and for each word model λ v we calculate P v = P [O λ v ]. We choose the word whose model probability is highest v = argmax[p v ]. 1 v V For example, we could train an Hidden Markov Model, say λ 0 to recognize the spoken word no and train another Hidden Markov Model, say λ 1 to recognize the spoken word yes. (This is the step we will later call the solution to Problem 3.) Then given an unknown spoken word, we can use the Hidden Markov Model to score this word against λ 0 and also against λ 1 to decide if the spoken word is more likely no, yes or neither. (This is the problem we will later call Problem 1.) Notice that this application does not uncover the hidden states (which we will later call Problem 2) but such information might provide additional insight into the underlying speech model. The Hidden Markov Model for speech recognition is very efficient. For isolated word recognition with the Viterbi Algorithm, a vocabulary of V = 100 words with an N = 5 state model, and 40 observations, it takes about 10 5 computations (additions/multiplications) for a single word recognition. 20

21 It is hard to determine what current (2017) speech recognition applications are based on. Explanations are clouded with buzzwords and hype, with no theory. Common buzzwords surrounding current (2017) speech recognition are artificial intelligence, machine learning, neural networks, and deep learning, but there does not seem to be a connection to Hidden Markov Models. Sources The variable factory example is adapted from Sheldon M. Ross, Introduction to Probability Models, Section 4.11, pages , Academic Press, 2006, 9th Edition. The paleontological temperature model is adapted from A Revealing Introduction to Hidden Markov Models, by Mark Stamp. The cheating casino and CpG Islands example is adapted from Biological Sequence Analysis, by R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Chapter 3, pages The urn and ball model is adapted from An Introduction to Hidden Markov Models, by L.R. Rabiner and B. H. Juang, The language analysis example is adapted from A Revealing Introduction to Hidden Markov Models, by Mark Stamp. The speech recognition example is adapted from A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition by L. R. Rabiner. Algorithms, Scripts, Simulations Algorithm Scripts 21

22 Problems to Work for Understanding Reading Suggestion: References [1] L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4 16, January hidden markov models. [2] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3): , February hidden Markov model, speech recognition. [3] Sheldon M. Ross. Introduction to Probability Models. Academic Press, 9th edition, [4] Mark Stamp. A revealing introduction to hidden markov models. stamp/rua/hmm.pdf, December

23 Outside Readings and Links: I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable effort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk. I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reflects the thoughts, interests and opinions of its author. They do not explicitly represent official positions or policies of my employer. Information on this website is subject to change without notice. Steve Dunbar s Home Page, to Steve Dunbar, sdunbar1 at unl dot edu Last modified: Processed from L A TEX source on April 12,

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models. , I. Toy Markov, I. February 17, 2017 1 / 39 Outline, I. Toy Markov 1 Toy 2 3 Markov 2 / 39 , I. Toy Markov A good stack of examples, as large as possible, is indispensable for a thorough understanding