Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1
Modeling Sequential Data Suppose that we have weather data for several days. x 1, x 2,, x Each x i is a 2D column vector. x n = 1 if it rains on day n. x n = 0 if it does not rain on day n (we call that a "sunny day"). We want to learn a model that predicts if it is going to rain or not on a certain day, based on this data. What options do we have? Lots, as usual in machine learning. 2
Predicting Rain Assuming Independence One option is to assume that the weather in any day is independent of the weather in any previous day. Thus, p(x n x 1,, x n 1 = p(x n ) Then, how can we compute p(x)? If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains. p(x = 1) = n=1 So, the probability that it rains on any day is simply the fraction of days in the training data when it rained. 3 x n
Predicting Rain Assuming Independence If the weather of the past tells us nothing about the weather of the next day, then we can simply use the data to calculate how often it rains. p(x = 1) = n=1 So, the probability that it rains on any day is simply the fraction of days in the training data when it rained. Advantages of this approach: x n Easy to apply. Only one parameter estimated. Disadvantages: ot using all the information in the data. Past weather does correlate with the weather of the next day. 4
Predicting Rain Modeling Dependence The other extreme is to assume that the weather of any day depends on the weather of the K previous days. Thus, we have to learn the whole probability distribution p(x n x 1,, x n 1 = p(x n x n K,, x n 1 Advantages of this approach: Builds a more complex model, that can capture more information about how past weather influences the weather of the next day. Disadvantages: The amount of data that is needed to reliably learn such a distribution is exponential to K. Even for relatively small values of K, like K = 5, you may need thousands of training data to learn the probabilities reliably. 5
Predicting Rain Markov Chain p(x n x 1,, x n 1 = p(x n x n K,, x n 1 This probabilistic model, where an observation depends on the preceding K observations, is called an K-th order Markov Chain. K = 0 leads to a model that is too simple and inaccurate (the weather of any day does not depend on the weather of the previous days). A large value of K may require more training data than we have. Choosing a good value of K depends on the application, and on the amount of training data. 6
Predicting Rain 1 st Order Markov Chain It is very common to use 1 st Order Markov Chains to model temporal dependencies. p(x n x 1,, x n 1 = p(x n x n 1 For the rain example, learning this model requires consists of estimating four values: p(x n = 0 x n 1 = 0 : probability of a sunny day after a sunny day. p(x n = 1 x n 1 = 0 : probability of a rainy day after a sunny day. p(x n = 0 x n 1 = 1 : probability of a sunny day after a rainy day. p(x n = 1 x n 1 = 1 : probability of a rainy day after a rainy day. 7
Visualizing a 1 st Order Markov Chain p(rain after rain) p(sun after sun) Rainy day p(sun after rain) Sunny day p(rain after sun) This is called a state transition diagram. There are two states: rain and no rain. There are four transition probabilities, defining the probability of the next state given the previous one. 8
Hidden States In our previous example, a state ("rainy day" or "sunny day") is observable. When that day comes, you can observe and find out if that day is rainy or sunny. In those cases, the learning problem can be how to predict future states, before we see them. There are also cases where the states are hidden. We cannot directly observe the value of a state. However, we can observe some features that depend on the state, and that can help us estimate the state. In those cases, the learning problem can be how to figure out the values of the states, given the observations. 9
Tree Rings and Temperatures Tree growth rings are visible in a cross-section of the tree trunk. Every year, the tree grows a new ring on the outside. Source: Wikipedia Counting the rings can tell us about the age of the tree. The width of each ring contains information about the weather conditions that year (temperature, moisture, ). 10
Modeling Tree Rings At this point, we stop worrying about the actual science of how exactly tree ring width correlates with climate. For the sake of illustration, we will make a simple assumption. The tree ring tends to be wider when the average temperature for that year is higher. So, the trunk of a 1,000 year-old tree gives us information about the mean temperature for each of the last 1,000 years. How do we model that information? We have two sequences: Sequence of observations: a sequence of widths: x 1, x 2,, x. Sequence of hidden states: a sequence of temperatures: z 1, z 2,, z. We want to find the most likely sequence of state values z 1, z 2,, z, given the observations x 1, x 2,, x. 11
Modeling Tree Rings We have two sequences: Sequence of observations: a sequence of widths: x 1, x 2,, x. Sequence of hidden states: a sequence of temperatures: z 1, z 2,, z. We want to find the most likely sequence of state values z 1, z 2,, z, given the observations x 1, x 2,, x. Assume that we have training data. Other sequences of tree ring widths, for which we know the corresponding temperatures. What can we learn from this training data? One approach is to learn p(z n x n ): the probability of the mean temperature z n for some year given the ring width x n for that year. Then, for each z n we pick the value maximizing p(z n x n ). Can we build a better model than this? 12
Hidden Markov Model The previous model simply estimated p(z x). It ignored the fact that the mean temperature in a year depends on the mean temperature of the previous year. Taking that dependency into account we can estimate temperatures with better accuracy. We can use the training data to learn a better model, as follows: Learn p(x z): the probability of a tree ring width given the mean temperature for that year. Learn p(z n z n 1 ): the probability of mean temperature for a year given the mean temperature for the previous year. Such a model is called a Hidden Markov Model. 13
Hidden Markov Model A Hidden Markov Model (HMM) is a model for how sequential data evolves. An HMM makes the following assumptions: States are hidden. States are modeled a st order Markov Chains. That is: p(z n z 1,, z n 1 = p(z n z n 1 Observation x n is conditionally independent of all other states and observations, given the value of state z n. That is: p(x n x 1,, x n 1, x n+1,, x, z 1,, z n 1 = p(x n z n 14
Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. In the tree ring example, the states can be intervals of temperatures. For example, s k can be the state corresponding to the mean temperature (in Celsius) being in the k, k + 1 interval. 15
Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. An initial state probability function π k = p(z 1 = s k ). π k defines the probability that, when we are given a new set of observations x 1,, x, the initial state z 1 is equal to s k. For the tree ring example, π k can be defined as the probability that the mean temperature in the first year is equal to k. 16
Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. An initial state probability function π k = p(z 1 = s k ). A state transition matrix A, of size K K, where A k,j = p z n = s j z n 1 = s k ) Values A k,j are called transition probabilities. For the tree ring example, A i,j is the conditional probability that the mean temperature for a certain year is j, if the mean temperature in the previous year is k. 17
Hidden Markov Model Given the previous assumptions, an HMM consists of: A set of states,, s K. An initial state probability function π k = p(z 1 = s k ). A state transition matrix A, of size K K, where A k,j = p z n = s j z n 1 = s k ) Observation probability functions, also called emission probabilities, defined as: φ k x = p x n = x z n = s k ) For the tree ring example, φ k x is the probability of getting ring width x in a specific year, if the temperature for that year is k. 18
Visualizing the Tree Ring HMM Assumption: temperature discretized to four values, so that we have four state values. The vertices show the four states. The edges show legal transitions between states. 19
Visualizing the Tree Ring HMM The edges show legal transitions between states. Each directed edge has its own probability (not shown here). This is a fully connected model, where any state can follow any other state. An HMM does not have to be fully connected. 20
Joint Probability Model A fully specified HMM defines a joint probability function p(x, Z). X is the sequence of observations x 1,, x. Z is the sequence of hidden state values z 1,, z. p X, Z = p x 1,, x, z 1,, z = p z 1,, z p x n z n ) n=1 Why? Because of the assumption that x n is conditionally independent of all other observations and states, given z n. 21
Joint Probability Model A fully specified HMM defines a joint probability function p(x, Z). p X, Z = p z 1,, z p x n z n ) n=1 = p z 1 p z n z n 1 ) n=2 n=1 p x n z n ) Why? Because states are modeled a st order Markov Chains, so that p(z n z 1,, z n 1 = p(z n z n 1. 22
Joint Probability Model A fully specified HMM defines a joint probability function p(x, Z). p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 p z 1 is computed using values π k. p z n z n 1 ) is computed using transition matrix A. p x n z n ) is computed using observation probabilities φ k x. 23
Modeling the Digit 2 Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2". Here is one possible model: We represent the shape of the digit "2" as five line segments. Each line segment corresponds to a hidden state. This gives us five hidden states. We will also have a special end state, which signifies "end of observations". 24
Modeling the Digit 2 Suppose that we want to model the motion of a hand, as it traces on air the shape of the digit "2". Here is one possible model: We represent the shape of the digit "2" as five line segments. Each line segment corresponds to a hidden state. We end up with five states, plus the end state. This HMM is a forward model: If z n = s k, then z n+1 = s k or z n+1 = s k+1. This is similar to the monotonicity rule in DTW. 25
Modeling the Digit 2 This HMM is a forward model: If z n = s k, then z n+1 = s k or z n+1 = s k+1. Therefore, A k,j = 0, except when k = j or k + 1 = j. Remember, A k,j = p z n = s j z n 1 = s k ). The feature vector at each video frame n can be the displacement vector: The difference between the pixel location of the hand at frame n and the pixel location of the hand at frame n 1. 26
Modeling the Digit 2 So, each observation x n is a 2D vector. It will be convenient to describe each x n with these two numbers: Its length l n, measured in pixels. Its orientation θ n, measured in degrees. Lengths l n come from a Gaussian distribution with mean μ l,k and variance σ l,k that depend on the state s k. Orientations θ n come from a Gaussian distribution with mean μ θ,k and variance σ θ,k that also depend on the state s k. x 9 x 10 27
Modeling the Digit 2 The decisions we make so far are often made by a human designer of the system. The number of states. The topology of the model (fully connected, forward, or other variations). The features that we want to use. The way to model observation probabilities (e.g., using Gaussians, Gaussian mixtures, histograms, etc). Once those decisions have been made, the actual probabilities are typically learned using training data. The initial state probability function π k = p(z 1 = s k ). The transition matrix A, where A k,j = p z n = s j z n 1 = s k ) The observation probabilities, φ k x = p x n = x z n = s k ) 28
Modeling the Digit 2 The actual probabilities are typically learned using training data. The initial state probability function π k = p(z 1 = s k ). The transition matrix A, where A k,j = p z n = s j z n 1 = s k ). The observation probabilities φ k x = p x n = x z n = s k ). Before we see the algorithm for learning these probabilities, we will first see how we can use an HMM after it has been trained. Thus, after all these probabilities are estimated. To do that, we will look at an example where we just specify these probabilities manually. 29
Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. In this case: We have five states,,. How do we define π k? x 9 x 10 30
Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. In this case: We have five states,,. How do we define π k? π 1 = 1, and π k = 0 for k > 1. x 9 x 10 31
Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. How do we define transition matrix A? x 9 x 10 32
Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. We need to decide on values for each A k,k. In this model, we spend more time on states and than on the other states. This can be modeled by having higher values for A 4,4 and A 5,5 than for A 1,1, A 2,2, A 3,3. This way, if z n =, then z n+1 is more likely to also be, and overall state lasts longer than states,,. x 9 x 10 33
Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. We need to decide on values for each A k,k. In this model, we spend more time on states and than on the other states. Here is a set of values that can represent that: A 1,1 = 0.4 A 2,2 = 0.4 A 3,3 = 0.4 A 4,4 = 0.8 A 5,5 = 0.7 x 9 x 10 34
Defining the Probabilities An HMM is defined by specifying: A set of states,, s K. An initial state probability function π k. A state transition matrix A. Observation probability functions. Here is the resulting transition matrix A: 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.4 0.6 0.0 0.0 0.0 0.0 0.0 0.8 0.2 0.0 0.0 0.0 0.0 0.0 0.7 0.3 0.0 0.0 0.0 0.0 0.0 0.0 x 9 x 10 35
Defining the Probabilities As we said before, each observation x n is a 2D vector, described by l n and θ n : l n is the length, measured in pixels. θ n is the orientation, measured in degrees. We can model p(l n ) as a Gaussian l, with: mean μ l = 10 pixels. variance σ l = 1.5 pixels. Both the mean and the variance do not depend on the state. We can model θ n as a Gaussian θ with: mean μ θ,k that depends on the state s k. Obviously, each state corresponds to moving at a different orientation. x 9 x 10 variance σ θ = 10 degrees. This way, σ θ does not depend on the state. 36
Defining the Probabilities We define observation probability functions φ k as: φ k x n = 1 σ l 2π e x μ 2 l 2 σ 2 1 l σ θ 2π e x μ θ,k 2 2 σ θ 2 For the parameters in the above formula, we (manually) pick these values: x 9 x 10 μ l = 10 pixels. σ l = 1.5 pixels. σ θ = 10 degrees. μ θ,1 = 45 degrees. μ θ,2 = 0 degrees. μ θ,3 = 60 degrees. μ θ,4 = 120 degrees. μ θ,5 = 0 degrees. 37
Defining the Probabilities As we said before, each observation x n is a 2D vector, described by l n and θ n : l n is the length, measured in pixels. θ n is the orientation, measured in degrees. We define observation probability functions φ k as: φ k x n = l l n θ,k θ n x 9 x 10 φ k x n = 1 σ l 2π e x μ 2 l 2 σ 2 1 l σ θ 2π e x μ θ,k 2 2 σ θ 2 ote: in the above formula for φ k x n, the only part that depends on the state s k is μ θ,k. 38
An HMM as a Generative Model If we have an HMM whose parameters have already been learned, we can use that HMM to generate data randomly sampled from the joint distribution defined by the HMM: p X, Z = p z 1 p z n z n 1 ) n=2 n=1 p x n z n ) We will now see how to jointly generate a random obsevation sequence x 1,, x and a random hidden state sequence z 1,, z, based on the distribution p X, Z defined by the HMM. 39
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = Step 1: pick a random z 1, based on initial state probabilities π k. Remember: π k = p(z 1 = s k ) What values of z 1 are legal in our example? 40
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = Step 1: pick a random z 1, based on initial state probabilities π k. Remember: π k = p(z 1 = s k ) What values of z 1 are legal in our example? π k > 0 only for k = 1. Therefore, it has to be that z 1 =. 41
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = ext step: pick a random x 1, based on observation probabilities φ k x. Which φ k should we use? 42
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = Z = ext step: pick a random x 1, based on observation probabilities φ k x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. In Matlab, you can do this with this line: l1 = randn(1)*sqrt(1.5) + 10 43
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.35,? ) Z = ext step: pick a random x 1, based on observation probabilities φ k x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. Result (obviously, will differ each time): 8.3 pixels. 44
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random x 1, based on observation probabilities φ k x. We choose a θ 1 randomly from Gaussian θ,1, with mean 45 degrees and variance 10 degrees. Result (obviously, will differ each time): 54 degrees 45
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random z 2. What distribution should we draw z 2 from? 46
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random z 2. What distribution should we draw z 2 from? We should use p(z 2 z 1 = ). Where is that stored? 47
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z = ext step: pick a random z 2. What distribution should we draw z 2 from? We should use p(z 2 z 1 = ). Where is that stored? On the first row of state transition matrix A. 48
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z =, ext step: pick a random z 2, from distribution p(z 2 z 1 = ). The relevant values are: A 1,1 = p(z 2 = z 1 = = 0.4 A 1,2 = p(z 2 = z 1 = = 0.6 Picking randomly we get z 2 =. 49
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = (8.4,54) Z =, ext step? 50
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,? ) Z =, ext step: pick a random x 2, based on observation density φ 2 x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. Result: 10.8 pixels. 51
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,2) Z =, ext step: pick a random x 2, based on observation density φ 2 x. We choose a θ 2 randomly from Gaussian θ,2, with mean 0 degrees and variance 10 degrees. Result: 2 degrees 52
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,2) Z =, ext step? 53
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, (10.8,2) Z =,, ext step: pick a random z 3, from distribution p(z 3 z 2 = ). The relevant values are: A 2,2 = p(z 3 = z 2 = = 0.4 A 2,3 = p(z 3 = z 2 = = 0.6 Picking randomly we get z 3 =. 54
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, (11.3,? ) Z =,, ext step: pick a random x 3, based on observation density φ 2 x. We choose an l 1 randomly from Gaussian l, with mean 10 pixels and variance 1.5 pixels. Result: 11.3 pixels. 55
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, (11.3, 3) Z =,, ext step: pick a random x 3, based on observation density φ 2 x. We choose a θ 2 randomly from Gaussian θ,2, with mean 0 degrees and variance 10 degrees. Result: -3 degrees 56
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, (11.3, 3) Z =,,, ext step: pick a random z 4, from distribution p(z 4 z 2 = ). The relevant values are: A 2,2 = p(z 4 = z 3 = = 0.4 A 2,3 = p(z 4 = z 3 = = 0.6 Picking randomly we get z 4 =. 57
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, 11.3, 3, Z =,,,, Overall, this is an iterative process. pick randomly a new state z n = s k, based on the state transition probabilities. pick randomly a new observation x n, based on observation density φ k x. When do we stop? 58
Generating Random Data p X, Z = p z 1 p z n z n 1 ) p x n z n ) n=2 n=1 X = 8.4,54, 10.8,2, 11.3, 3, Z =,,,,, Overall, this is an iterative process. pick randomly a new state z n = s k, based on the state transition probabilities. pick randomly a new observation x n, based on observation density φ k x. We stop when we get z n =, since is the end state. 59
An Example of Synthetic Data The textbook shows this figure as an example of how synthetic data is generated. The top row shows some of the training images used to train a model of the digit "2". ot identical to the model we described before, but along the same lines. The bottom row shows three examples of synthetic patterns, generated using the approach we just described. What do you notice in the synthetic data? 60
An Example of Synthetic Data The synthetic data is not very realistic. The problem is that some states last longer than they should and some states last shorter than they should. For example: In the leftmost synthetic example, the top curve is too big relative to the rest of the pattern. In the middle synthetic example, the diagonal line at the middle is too long. In the rightmost synthetic example, the top curve is too small relative to the bottom horizontal line. 61
An Example of Synthetic Data Why do we get this problem of disproportionate parts? As we saw earlier, each next state is chosen randomly, based on transition probabilities. There is no "memory" to say that, e.g.,, if the top curve is big (or small), the rest of the pattern should be proportional to that. This is the price we pay for the Markovian assumption, that the future is independent of the past, given the current state. The benefit of the Markovian assumption is efficient learning and classification algorithms, as we will see. 62
HMMs: ext Steps We have seen how HMMs are defined. Set of states. Initial state probabilities. State transition matrix. Observation probabilities. We have seen how an HMM defines a probability distribution p X, Z. We have also seen how to generate random samples from that distribution. ext we will see: How to use HMMs for various tasks, like classification. How to learn HMMs from training data. 63