Hidden Markov Models (HMMs) Reading Assignments R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 3.10, hard-copy). L. Rabiner, "A tutorial on HMMs and selected applications in speech recognition", Proceedings of IEEE, vol. 77, pp. 257-286, 1989 (hard-copy). Case Studies F. Samaria, "Face segmentation for identification using HMMs", British Machine Vision Conference, pp. 399-408, 1993 (on-line). A. Nefian and M. Hayes III, "Face recognition using an embedded HMM", Intel, 1999 (on-line).
-2- Hidden Markov Models (HMMs) Time dependencies -HMMs are appropriate for problems that have an inherent temporality. *speech recognition *gesture recognition *human activity recognition -Apattern is the result of a time process which has a number of states. -States at time t are influenced directly by states in previous time steps. Definition of first-order Markov models -They are represented by a graph where every node corresponds to a state ω i. -The graph can be fully-connected with self-loops. -Links between nodes ω i and ω j are associated with a transition probability: P(ω (t + 1) = ω j /ω (t) = ω i ) = a ij which is the probability of having state ω j at time t + 1given that the state at time t was ω i (first-order model). -The following constraints should be satisfied: Σ a ij = 1for all i j -Markov models are fully described by their transition probabilities a ij
How tocompute the probability P(ω T ) of a sequence of states ω T? -3- -Given asequence of states ω T = (ω (1), ω (2),..., ω (T )), the probability that the model generated ω T is equal to the product of the corresponding transition probabilities: P(ω T ) = Σ T P(ω (t)/ω 1)) where P(ω (1)/ω (0)) P(ω (1)) is the prior probability on the first state. t=1 Example: if ω 6 = (ω 1, ω 4, ω 2, ω 2, ω 1, ω 4 ), then P(ω 6 ) = P(ω 1 )P(ω 4 /ω 1 )P(ω 2 /ω 4 )P(ω 2 /ω 2 )P(ω 2 /ω 1 )P(ω 1 /ω 4 ) = a 1 a 14 a 42 a 22 a 21 a 14 -The last state ω (T )iscalled the absorbing state and is denoted as ω 0 (i.e., a state which if entered, is never left: a 00 =1) Definition of first-order hidden Markov models -Weaugment the model such that when it is in state ω (t) italso emits some symbol v(t) (visible states) among a set of possible symbols. - For every sequence of -hidden- states, there is an associated sequence of visible states: ω T = (ω (1), ω (2),..., ω (T )) -> V T = (v(1), v(2),..., v(t ))
-4- -When the model is in state ω j at time t, the probability of emitting a visible state v k at that time is denoted as P(v(t) = v k /ω (t) = ω j ) = b jk -The following constraints should be satisfied: Σ b jk = 1for all j k Coin toss example -You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening. -Onthe other side of the barrier is another person who is performing a coin (or multiple coin) toss experiment. -The other person will tell you only the result of the experiment, not how he obained that result!! V T = HHTHTTHH...T = v(1), v(2),..., v(t ) Problem: build an HMM model to explain the observed sequence of heads and tails. 1-fair coin model -There are 2 states, each associated with either heads (state1) or tails (state2) -The observation sequence uniquely defines the states (model is not hidden).
-5-2-fair coins model -There are 2 states but neither state is uniquely associated with either heads or tails (each state can be associated with a different fair coin). -Athird coin is used to decided which of the biased coins to flip. 2-biased coins model -There are 2 states with each state associated with a biased coin. -Athird coin is used to decided which of the biased coins to flip. 3-biased coins model -There are 3 states with each state associated with a biased coin. -Wedecide which coin to flip using some way (e.g., other coins).
-6- Hidden Markov models and finite-state machines -The two models are basically equivalent! -When the transitions from state to state are probabilistic, we call them HHMs. Some definitions Causal HMM: the probabilities depend only upon previous states. Ergodic HMM: every one of the states has a non-zero probability of occurring given some starting state. Central issues in HMMs Evaluation problem: Determine the probability that a particular sequence of visible states V T was generated by a given model. Decoding problem: Given a sequence of visible states V T,determine the most likely sequence of hidden states ω T that led to those observations. Learning problem: Given aset of visible observations, determine a ij and b jk.
-7- Evaluation -Inpractice, we have several HMMs, one for each class and we classify a test pattern by choosing the model with the highest probability. HMM 1 P(V T ) V T HMM 2 P(V T ) MAX HMM k... HMM N P(V T ) -The probability that a model produces V T can be computed using the theorem of total probability: P(V T ) = r max Σ P(V T /ω r T )P(ω r T ) r=1 where ω r T = (ω (1), ω (2),..., ω (T )) is one one of the possible sequences and r max = c T for a model with c states ω 1, ω 2,..., ω c. -The second term P(ω T r )can be written as follows: P(ω r T ) = P(ω (1)) Π T P(ω (t)/ω 1)) -The first term P(V T /ω T r )can be written: t=1 P(V T /ω r T ) = Π T P(v(t)/ω (t)) t=1
-8- -Combining the two terms together: P(V T ) = P(ω (1)) r max Σ T r=1 t=1 Π P(v(t)/ω (t)) P(ω (t)/ω (t 1)) Computational complexity -Given a ij and b jk,itstraightforward to compute the P(V T ) -This computation, however, has O(Tc T )requirements!
-9- Recursive computation of P(V T ) (HMM Forward) Input: V T = (v(1), v(2),..., v(t )) -Let α i (t) represent the probability that the HMM is in hidden state ω i at step t, assuming that the first t elements of V T have been generated: α i (t)=p(v(1), v(2),..., v(t), ω (t) = ω i ) -Wecan compute α j (t + 1), j = 1, 2,..., c, asfollows: c i=1 α j (t + 1) = P(v(1), v(2),..., v(t), v(t + 1), ω (t + 1) = ω j ) = Σ P(v(1), v(2),..., v(t), ω (t) = ω i )P(v(t + 1)/ω (t + 1) = ω j )P(ω (t + 1) = ω j /ω (t) = ω i ) or α j (t + 1) = Σ c α i (t)b jv(t+1) a ij i=1 j = 1, 2,..., c
-10- Initialize ω (1) (known initial state) Set α i (0) = 1 if i = ω (1) and α i (0) = 0 if i ω (1) (prior state probability) for(t=1; t<=t t++) for j=1 to c do α j (t) = Σ c α i (t 1)a ij b jv(t) ) i=1 P(V T ) = α 0 (T ) -The complexity of this algorithm is only O(c 2 T )!! An example a ij = 1 0. 2 0. 2 0. 8 0 0. 3 0. 5 0. 1 0 0. 1 0. 2 0. 0 0 0. 4, b jk = 0. 1 0. 1 1 0 0 0 0 0. 3 0. 1 0. 5 0 0. 4 0. 1 0. 2 0 0. 1 0. 7 0. 1 0 0. 2 0. 1 0. 2 t=1 α 0 (1) = α 0 (0) P(v(1)/ω 0 ) P(ω 0 /ω 0 ) + α 1 (0) P(v(1)/ω 0 ) P(ω 0 /ω 1 ) + α 2 (0) P(v(1)/ω 0 ) P(ω 0 /ω 2 ) + α 3 (0) P(v(1)/ω 0 ) P(ω 0 /ω 3 ) = 0 α 1 (1) = a 0 (0) P(v(1)/ω 1 ) P(ω 1 /ω 0 ) + α 1 (0) P(v(1)/ω 1 ) P(ω 1 /ω 1 ) + α 2 (0) P(v(1)/ω 1 ) P(ω 1 /ω 2 ) + α 3 (0) P(v(1)/ω 1 ) P(ω 1 /ω 3 ) = 0. 09 α 2 (1) = a 0 (0) P(v(1)/ω 2 ) P(ω 2 /ω 0 ) + α 1 (0) P(v(1)/ω 2 ) P(ω 2 /ω 1 ) + α 2 (0) P(v(1)/ω 2 ) P(ω 2 /ω 2 ) + α 3 (0) P(v(1)/ω 2 ) P(ω 2 /ω 3 ) = 0. 01
-11- α 3 (1) = a 0 (0) P(v(1)/ω 3 ) P(ω 3 /ω 0 ) + α 1 (0) P(v(1)/ω 3 ) P(ω 3 /ω 1 ) + α 2 (0) P(v(1)/ω 3 ) P(ω 3 /ω 2 ) + α 3 (0) P(v(1)/ω 3 ) P(ω 3 /ω 3 ) = 0. 2 -Similarly for t = 2, 3, 4; final answer P(V T ) = Σ c α i (T 1) = α 0 (T ) = 0. 0011 i=1
-12- The backward algorithm (HMM backward) Input: V T = (v(1), v(2),..., v(t )) -Let β i (t + 1) represent the probability that the HMM is in hidden state ω i at step t + 1, and will generate the remainder of the target sequence, i.e., t + 1,..., T : β i (t + 1)=P(v(t + 1), v(t + 2),..., v(t )/ω (t) = ω i ) -Wecan compute β j (t), j = 1, 2,..., c, asfollows: c i=1 β j (t) = P(v(t), v(t + 1), v(t + 2),..., v(t )/ω (t) = ω j ) = Σ P(v(t + 1), v(t + 2),..., v(t )/ω (t) = ω i )P(v(t + 1)/ω (t + 1) = ω i )P(ω (t + 1) = ω i /ω (t) = ω or β j (t) = Σ c β i (t + 1)b iv(t+1) a ji i=1 j = 1, 2,..., c Initialize ω (T ) β i (T ) = 1 if i = ω (T ) and β i (T ) = 0 if i ω (T ) for(t=t-1; t>=0; t--) for i=1 to c do β j (t) = Σ c β i (t + 1)a ji b iv(t+1) i=1 P(V T ) = β i (0) where ω i is the known initial state
-13- Decoding - We need to use an optimality criterion to solve this problem (i.e., there are several possible ways solving this problem since there are various optimality criteria we could use). Algorithm 1: choose the states ω (t) which are individually most likely (i.e., maximize the expected number of correct individual states). *Ifwedefine γ i (t) = P(ω (t) = ω i /V T ), then: γ i (t) = P(ω (t) = ω i, V T ) P(V T ) = α i(t)β i (t) P(V T ) *Using γ i (t), the individually most likely state ω (t) attime t is: ω (t)=arg max i [γ i (t)], 1 t T
-14- Algorithm 2 (easy): at each time step t, find the state that has the highest probability of having come from the previous step and generated the observed visible state v(t) --uses the forward algorithm with minor changes. Initialize ω (1) a i (0) = 1 if i = ω (1) and a i (0) = 0 if i ω (1) Path=empty for (t=1; t<=t; t++) { for j=1 to c do a j (t) = Σ c a i (t 1)a ij b jv(t) ) i=1 j =arg max j a j (t) Append ω j to Path } return Path
-15- -There is no guarantee that the path is a valid one (local optimization). -The path might imply a transition that is not allowed by the model.
-16- Algorithm 3 (Viterbi algorithm - most widely used) find the single best sequence, i.e., maximize P(ω T /V T ) *Equivalent to maximizing P(ω T, V T )since: P(ω T /V T ) = P(ω T, V T ) P(V T ) *Wewill compute the probability P(ω T, V T )recursively: *Let us define δ i (t) asbeing the highest probability along a single path, at time t, with the path ending at ω i : δ i (t) = max ω (1),ω (2),..,ω (t 1) P(ω (1), ω (2),.., ω (t 1), ω (t) = ω i, v(1), v(2),..., v(t)) *Using induction we have: δ j (t) = [max i δ i (t 1)a ij ] b jv(t),1 j c *Toretrieve the best state sequence we need to keep track of the argument i that maximizes the above equation: ψ j (t) = argmax i [δ i (t 1)a ij ], 1 j c
-17- Step1: Initialization δ i (1) = b jv(1), ψ i (1) = 0, 1 i c 1 i c Step2: Recursion δ j (t) = max i [δ i (t 1)a ij ] b jv(t), ψ j (t)=argmax i [δ i (t 1)a ij ], 2 t T,1 j c 2 t T,1 j c Step3: Termination P * = max i [δ i (T )] ω * (T ) = argmax i [δ i (T )] Step4: Path backtracking ω * (t) = ψ ω * (t+1)(t + 1)
-18- Rate invariance -Let s consider the problem of gesture recognition: *The duration of the same gesture can vary from person to person. *The duration of the same gesture can vary for the same person. -HMM models address this issue: *Transition probabilities incorporate probabilistic structure of the durations. *Post processing can be used to delete repeated states e.g., (ω 1, ω 1, ω 3, ω 2, ω 2, ω 2 )can be converted to (ω 1, ω 3, ω 2 )
-19- Learning -Determine the transition probabilities a ij and b jk from a set of training examples (i.e., maximize the probability of the observation sequences). -There is no known way to solve for a maximum likelihood model analytically. The forward-backward algorithm (Baum-Welch algorithm) *Let s define ξ ij (t) = P(ω (t) = ω i, ω (t + 1) = ω j /V T ), then: ξ ij (t) = α i(t)a ij b jv(t+1) β j (t + 1) P(V T ) *Wecan write γ i (t) asfollows: γ i (t) = Σ c ξ ij (t) *The expected number of times that ω i is visited: T t=1 j=1 Σ γ i (t) *The expected number of transitions made from ω i : T 1 Σ γ i (t) t=1
-20- *The expected number of transitions from ω i to ω j : T 1 Σ ξ ij (t) t=1 *wecan re-estimate α ij as the ratio of the expected number of transition from ω i to ω j,divided by the expected number of transitions out of state ω i : ˆα ij = T Σ 1 ξ ij (t)/ T Σ 1 γ i (t) t=1 *wecan re-estimate b jv(t) as the ratio of the expected number of times of being in state ω j and observing v(t), divided by the expected number of times being in state ω j : t=1 ˆb jv(t) = T 1 Σ γ j (t)/ T Σ 1 γ j (t) t=1,v(t) t=1 Difficulties with using HMMs -How do wedecide on the number of states of the model? -What about the size of observation sequence? * Should be sufficiently long to guarantee that all state transitions will appear a sufficient number of times. *Alarge number of training data is necessary to learn the HMM parameters.