Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Notation and Problems of Hidden Markov Models

Size: px

Start display at page:

Download "Topics in Probability Theory and Stochastic Processes Steven R. Dunbar. Notation and Problems of Hidden Markov Models"

Amos Brooks
6 years ago
Views:

1 Steven R. Dunbar Department of Mathematics 203 Avery Hall University of Nebraska-Lincoln Lincoln, NE Voice: Fax: Topics in Probability Theory and Stochastic Processes Steven R. Dunbar Notation and Problems of Hidden Markov Models Rating Mathematically Mature: proofs. may contain mathematics beyond calculus with 1

2 Section Starter Question What is meant by an ergodic Markov chain? Key Concepts 1. For Hidden Markov Models, the first problem is the evaluation problem: given a model and a sequence of observations, how can we compute the probability that the model produced the observed sequence? We can also view this problem as: how do we score or evaluate the model? 2. Given the model λ = (A, B, π) and a sequence of observations O, find an optimal state sequence after defining optimal in some way for the model. In other words, we want to uncover the hidden part of the Hidden Markov Model. 3. The solution of Problem 3 attempts to optimize the model parameters to best describe how the observed sequence comes about. The observed sequence used to solve Problem 3 is a training sequence since it trains the model. This training problem is the crucial one for most applications of hidden Markov models since it creates best models for real phenomena. 4. We usually use an optimality criterion to discriminate which sequence best matches the observations. Two optimality criteria are common, and the choice of criterion determines the algorithm and the consequent revealed state sequence. 2

3 Vocabulary 1. The evaluation problem: given a model and a sequence of observations, how can we compute the probability that the observed the model produces the sequence? 2. The estimation problem or discovery problem is the attempt to uncover the hidden state sequence. 3. The observed sequence used to solve Problem 3 is a training sequence since it trains the model. Mathematical Ideas Elements of the Hidden Markov Model 1. The model has a finite number, say N, of states. Each state emits one of several possible signals having measurable, distinctive properties. In many practical situations, the states have a physical significance. 2. At each discrete clock time, t, the model enters a new state, or possibly remains in the current state. The entered state depends on a transition probability distribution based on the previous state through the Markov chain property, that is, independently of earlier Markov chain states and signals. Often the Markov chain is ergodic so that all states can connect to any other state. In some applications, other interconnections of states, e.g. increasing or left-to-right, are of interest. 3. After each transition, the model emits one of M observation output symbols or signals according to a probability distribution depending on the current state. Each state has a fixed probability distribution for its signals regardless of how and when the chain enters the state. There are thus N such observation probability distributions. Sometimes we 3

4 say M is the alphabet size. The observation symbols correspond to the physical output of the modeled system. Alternative names for Hidden Markov Models, especially in communications engineering and signal processing are Markov sources or probabilistic functions of Markov chains, emphasizing the emission of observation signals from the states. Example. The urn-and-ball model is a standard example of the previous elements. The model has N urns, each filled with a large number of colored balls. Each ball is one of M possible colors. The urn-and-ball model generates an observation sequence by choosing one of the N urns, according to an initial probability distribution, selecting a ball from the initial urn, recording the color, replacing the ball, and then choosing a new urn according to transition probability distribution associated with the current urn. A Hidden Markov Model is a double stochastic process with an underlying stochastic process that is not observable but can only be glimpsed through another stochastic process that produces the set of observations. Each state has a probability distribution over the possible output tokens. Therefore the sequence of tokens generated by an Hidden Markov Model gives some information about the sequence of states. The usual notation for a discrete-time discrete-observation Hidden Markov Model is T = length of observation sequence, N = number of states in the model, M = number of observation symbols, Q = {q 0, q 1,..., q N 1 } = distinct states of the Markov process, X = {x 0, x 1,..., x T 1 } = sequence of states visited by the Markov chain, V = {0, 1, 2,..., M 1} = set of possible observations, A = (a ij ), where a ij = P [q j q i ] is the state transition probability matrix, B = (b ij ) = (b i (j)), where b ij = P [j q i ] is the observation transition probability matrix, π = {π j } = initial state distribution at time 0, O = (O 0, O 1,... O T 1 ) = the observation sequence. Remark. The choice of indexing beginning at 0 is arbitrary. It is chosen here to conform with Stamp [4]. The choice of indexing beginning at 0 is 4

5 consistent with the initial value of dynamical systems at time 0. It is also convenient for programming languages such as python, perl and C that begin indexing at 0. Using the Model, the process for generating an observation sequence O = (O 0, O 1,... O T 1 ) is: 1. Set t = Choose an initial state x t {q 0,... q N 1 } according to the initial state distribution π. 3. Choose O j according to b i (j), the symbol probability distribution in state i = x t. 4. Set t = t Choose a new state x t {q 0,... q N 1 } according to the probability transition matrix A. 6. Return to step 3 if t < T ; otherwise terminate the process. Markov process: x A 0 x A 1 x A 2... A x T 1 B B B B Observations: O 0 O 1 O 2... O T 1 Figure 1: Schematic diagram of a Hidden Markov Model The independence assumptions in this notation are P [O n = j x t = q i ] = b i (j) P [O n = j x 0, O 0,..., x n 1, O n 1, x n = q i ] = b i (j) The matrix A is an n n row stochastic matrix with probabilities a ij independent of t, so we are assuming a stationary Markov process. The interpretation of A is a ij = P [ state q j at t + 1 state q i at t ]. 5

6 The standard notation in Hidden Markov Models for the N M matrix B is b i (j) = P [ observation j at t state q i at t ]. As with A, the matrix B is row stochastic and the probabilities are independent of t. The triple λ = (A, B, π) defines an Hidden Markov Model. Although O = (O 0, O 1, O 2,..., O n 1 ) is not a Markov chain (explain why!) conditional on the current state O n, the sequence O n, x n+1, O n+1,... of future signals and states is independent of the sequence x o, O 0,..., x n 1, O n 1. As an example, for an example state sequence of length 4 with corresponding observations X = (x 0, x 1, x 2, x 3 ) O = (O 0, O 1, O 2, O 3 ) let π 0 be the probability of starting in state x 0, b 0 (O 0 ) is the probability of initially observing O 0 and a 01 is the probability of transitioning from state x 0 to state x 1. Continuing and multiplying the conditional probabilities, the probability of this state sequence with these observations is P [X, O] = π 0 b x0 (O 0 )a x0,x 1 b x1 (O 1 )a x1,x 2 b x2 (O 2 )a x2,x 3 b x3 (O 3 ). The three problems 1. Given the model λ = (A, B, π) and a sequence of observations O, find P [O λ]. That is, we want to determine the likelihood of the observed sequence O, given the model. Problem 1 is the evaluation problem: given a model and a sequence of observations, how can we compute the probability that the observed the model produces the sequence. We can also view the problem as: how do we score or evaluate the model. If we think of the case in which we have several competing models, the solutions of problem 1 allows us to choose the model that best matches the observations. 2. Given the model λ = (A, B, π) and a sequence of observations O, find an optimal state sequence after defining optimal in some way for the model. In other words, we want to uncover the hidden part of the Hidden Markov Model. 6

7 Problem 2 is the estimation problem or discovery problem in which we attempt to uncover the hidden state sequence. We usually use an optimality criterion to discriminate which sequence best matches the observations. Two optimality criteria are common, and so the choice of criterion is a strong influence on the revealed state sequence. See the next section for specific examples of the two optimality criteria. 3. Given an observation sequence O and the dimensions N and M, find the model λ = (A, B, π) that maximizes the probability of O. This can interpreted as training a model to best fit the observed data. We can also view this as search in the parameter space represented by A, B and π. The solution of Problem 3 attempts to optimize the model parameters so as best to describe how the observed sequence comes about. The observed sequence used to solve Problem 3 is a training sequence since it trains the model. This training problem is the crucial one for most applications of hidden Markov models since it creates best models for real phenomena. Rabiner [2] credits Jack Ferguson of the Institute for Defense Analysis with introducing the three fundamental problems for Hidden Markov Models. Remark. The notation P [O λ] is standard and traditional in the theory and application of Hidden Markov Models. It is similar to notation occurring in statistics with parametric tests, indicating that that the model depends on the values of the parameters. The notation P [O λ] is not strictly related to conditional probabilities in the mathematical probability sense. A mathematician s typical notation would indicate the parameter dependence with subscripts, so we would write P λ [O] instead. However, the notation P [O λ] is so strongly embedded in the literature of Hidden Markov Models that I will continue to use it, except occasionally in proofs where mathematical conditional probabilities do occur and the notation would mix meanings. Example Computation Consider the variable factory example from the previous section. For brevity, represent the factory good state with q 0 = 0 and the bad state with q 1 = 1. Suppose that for some particular three periods, the observation of factory 7

8 outputs is a, u, a. We can exhaustively compute the probabilities of this observation sequence from various state sequences, see Table 1. We want to determine the most likely state sequence of the Markov process over these three periods. We could reasonably define most likely as the state sequence with the highest probability from among all possible state sequences of length three. Later we will use Dynamic Programming to efficiently find this particular sequence. On the other hand, we might reasonably define most likely as the state sequence that maximizes the expected number of correct states. In the next section, we will use the Viterbi Algorithm associated with Hidden Markov Models to find this sequence. The Dynamic Programming and Viterbi Algorithm solutions are not necessarily the same. As a single example, compute the probability of starting in state 0, expressing observation a, transitioning from state 0 to state 0, expressing observation u, transitioning to state 1, and finally expressing observation a. P [001] = (0.8)(0.99)(0.9)(0.01)(0.1)(0.96) = Similarly, we can directly compute the probability of each of the remaining 7 possible state sequences of length 3. The results are in Table 1 with the probabilities in the last column normalized by the total probability of observing (a, u, a) so they sum to 1. To find the optimal state sequence in the Dynamic Programming sense, choose the sequence with the highest probability, namely 111. To find the optimal sequence in the Hidden Markov Model sense, we choose the most probable symbol at each position. To this end, we sum the probabilities in Table 1 that have a 0 in the first position. Doing so, we find the normalized probability of 0 in the first position is and hence the probability of a 1 in the first position is the complementary probability The Hidden Markov Model approach therefore chooses the first element of the sequence to be 0. Repeat this calculation for each time step of the sequence, obtaining the probabilities in Table 2. From Table 2, the optimal sequence in the Hidden Markov Model sense is 0, 1, 1. The optimal Dynamic Programming solution differs from the optimal Hidden Markov Model sequence. Sources This section is adapted from: A Revealing Introduction to Hidden Markov Models by Mark Stamp, December 11, 2015; and An Introduction to Hid- 8

9 000 (0.8)(0.99)(0.9)(0.01)(0.9)(0.99) (0.8)(0.99)(0.9)(0.01)(0.1)(0.96) (0.8)(0.99)(0.1)(0.04)(0.0)(0.99) (0.8)(0.99)(0.1)(0.04)(1.0)(0.96) (0.2)(0.96)(0.0)(0.01)(0.9)(0.99) (0.2)(0.96)(0.0)(0.01)(0.1)(0.96) (0.2)(0.96)(1.0)(0.04)(0.0)(0.99) (0.2)(0.96)(1.0)(0.04)(1.0)(0.96) sum Table 1: State sequence probabilities for the Variable Factory with output (a, u, a). Period/Obs 0 (a) 1 (u) 2 (a) P [0] P [1] Table 2: Hidden Markov Model probabilities. den Markov Models, by L. R. Rabiner and B. H. Juang. The calculations for the Variable Factory are extended from Sheldon M. Ross, Introduction to Probability Models, Section 4.11, pages , Academic Press, 2006, 9th Edition. Algorithms, Scripts, Simulations Algorithm Scripts 9

10 Problems to Work for Understanding 1. Verify all calculations in Table 1 2. Verify all calculations in Table 2 3. Create exhaustive state similar tables for the example of average annual temperatures with observations of tree-ring growth with observation sequence S, M, S, L. Reading Suggestion: References [1] L. R. Rabiner and B. H. Juang. An Introduction to Hidden Markov Models. IEEE ASSP Magazine, pages 4 16, January hidden markov models. [2] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77(3): , February hidden Markov model, speech recognition. [3] Sheldon M. Ross. Introduction to Probability Models. Academic Press, 9th edition, [4] Mark Stamp. A revealing introduction to hidden markov models. stamp/rua/hmm.pdf, December

11 Outside Readings and Links: I check all the information on each page for correctness and typographical errors. Nevertheless, some errors may occur and I would be grateful if you would alert me to such errors. I make every reasonable effort to present current and accurate information for public use, however I do not guarantee the accuracy or timeliness of information on this website. Your use of the information from this website is strictly voluntary and at your risk. I have checked the links to external sites for usefulness. Links to external websites are provided as a convenience. I do not endorse, control, monitor, or guarantee the information contained in any external website. I don t guarantee that the links are active at all times. Use the links here with the same caution as you would all information on the Internet. This website reflects the thoughts, interests and opinions of its author. They do not explicitly represent official positions or policies of my employer. Information on this website is subject to change without notice. Steve Dunbar s Home Page, to Steve Dunbar, sdunbar1 at unl dot edu Last modified: Processed from L A TEX source on April 14,

Selected Topics in Probability and Stochastic Processes Steve Dunbar. Partial Converse of the Central Limit Theorem

Steven R. Dunbar Department of Mathematics 203 Avery Hall University of Nebraska-Lincoln Lincoln, NE 68588-0130 http://www.math.unl.edu Voice: 402-472-3731 Fax: 402-472-8466 Selected Topics in Probability