Hidden Markov Models Hamid R. Rabiee 1
Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However most of the times these states are not directly observable, but there exist some observations giving information about the sequence of states. We assume that, conditioned on state Z k the observation X K is independent of other states and observations. 2
HMM Applications Some applications of HMM are: Speech recognition and processing Recognizing spoken words and phrases Speech synthesis Many applications in this field Text processing Parsing raw records into structured records Part-of-Speech Tagging Bioinformatics Protein sequence prediction Financial Stock market forecasts (price pattern prediction) Comparison shopping services 3
Example (Tracking) Recall the tracking problem discussed in previous slides. We modeled the object movement by a Markov process as: S t = A t S t 1 + v t ; v t ~N(0, Σ); S t = (X t, Y t, X t, Y t ) We don't know the exact location of the object, though there exists some observable information about the state sequence (obtained via sensors over time). The observation equation depends on the sensor. A simple case is: X t = CS t + Dw t ; w t ~N(0, Σ e ) Where C and D are two fixed matrices and w t is the sensor error at time t (modeled as a Gaussian additive noise). For solving the tracking problem we need to find P S 1, S 2, X 1, X 2,. 4
Specification of an HMM Components N - number of states Q = {q 1 ; q 2 ; ;q T } - set of states over time M - number of symbols (observables) O = {o 1 ; o 2 ; ;o T } - set of symbols over time A - the state transition probability matrix a ij = P(q t+1 = j q t = i) B- observation probability distribution b j (k) = P(o t = k q t = j) 1 k M π - the initial state distribution Full HMM is thus specified as a triplet: λ = (A,B,π) 5
Central problems in HMM modelling Problem 1: Evaluation: Probability of occurrence of a particular observation sequence, O = {o 1,,o k }, given the model - P(O λ) Complicated hidden states Useful in sequence classification Problem 2: Decoding: Optimal state sequence to produce given observations, O = {o 1,,o k }, given model Optimality criterion Useful in recognition problems Problem 3: Learning: Determine optimum model, given a training set of observations 6
Problem 1: Naïve solution State sequence Q = (q 1, q T ) Assume independent observations: Observations are mutually independent, given the hidden states. (Joint distribution of independent variables factorises into marginal distributions of the independent variables.) Observe that: And finally: T P(O q, P(o q, ) b (o )b (o )...b (o ) i1 t t q1 1 q2 2 qt T P(q ) a a...a q q q q q q q 1 1 2 2 3 T1 T P(O ) P(O q, )P(q ) q 7
Problem 1: Efficient solution Forward algorithm: A Dynamic Programming approach Define auxiliary forward variable α: t(i) P(o 1,...,o t qt i, ) α t (i) is the probability of observing a partial sequence of observables o 1, o t such that at time t, state q t =i Recursive algorithm: Initialise: (i) b (o ) 1 i i 1 Calculate: Obtain: N (j) [ (i)a ] b (o ) t1 t ij j t1 i1 N P(O ) i1 T (i) 8
Problem 1: Alternative solution Backward algorithm: Again Dynamic Programming Define auxiliary forward variable β: t (i) : the probability of observing a sequence of observables o t+1,,o T given state q t =i at time t, and Recursive algorithm: Initialise: Calculate: Terminate: (i) P(o,o,...,o q i, ) t t1 t2 T t (j) 1 T N t ( i ) t ( j ) a ( ) 1 ijb j ot 1 p(o ) j 1 N i1 1 (i) 9
Problem 2: Decoding Choose state sequence to maximise probability of observation sequence Viterbi algorithm - inductive algorithm that keeps the best state sequence at each instance Utilizes dynamic programming State sequence to maximise: (i) maxp(q,q,...,q i,o,o,...o ) Define auxiliary variable δ: P(q,q,...q O, ) 1 2 T t 1 2 t 1 2 t q δ t (i) the probability of the most probable path ending in state q t =i 10
Problem 2: Decoding Recurrent property: Algorithm: To get state seq, need to keep track of the argument that maximises this, for each t and j. Done via the array ψ t (j). 1. Initialise: 2. Recursion: 3. Terminate: (j) max( (i)a )b (o ) 1 i i 1 t1 t ij j t1 i (i) b (o ), 1 i N (i) 0 1 (j) max( (i)a )b (o ) t t1 ij j t 1 i N (j) arg max( (i)a ) t t1 ij 1 i N P max (i) 1 i N q arg max (i) T P* gives the state-optimised probability Q* is the optimal state sequence (Q* = {q1*,q2*,,qt*}) 4. Backtrack state sequence: q t = ψ t+1 q t+1 T 1 i N T 2 t T,1 j N ; t = T 1, T 2,, 1 11
Problem 3: Learning Training HMM to encode observation sequence such that HMM should identify a similar observation sequence in future Maximum likelihood criterion: Find λ=(a,b,π), maximising P(O λ) General algorithm: Initialise: λ 0 Compute new model λ, using λ 0 and observed sequence O Then o Repeat steps 2 and 3 until: logp(o ) logp(o 0) d We don t cover the learning algorithms in this course. 12
Word Recognition Example Typed word recognition, assume all characters are separated. Character recognizer outputs probability of the image being particular character, P (image character). a b c 0.5 0.03 0.005 z 0.31 Hidden state Observation 13
Word Recognition Example Hidden states of HMM = characters. Observations = typed images of characters segmented from the image. Note that there is an infinite number of observations. Observation probabilities = character recognizer scores. Transition probabilities will be defined differently in two subsequent models. 14
0.5 0.03 0.4 0.6 Word Recognition Example If lexicon is given, we can construct separate HMM models for each lexicon word. Amherst a m h e r s t Buffalo b u f f a l o Here recognition of word image is equivalent to the problem of evaluating few HMM models. This is an application of Evaluation problem. 15
Word Recognition Example We can construct a single HMM for all words. Hidden states = all characters in the alphabet. Transition probabilities and initial probabilities are calculated from language model. Observations and observation probabilities are as before. a f m o r t b h e s v Here we have to determine the best sequence of hidden states, the one that most likely produced word image. (an application of Decoding problem) 16
Acknowledgement Thanks to Jafar Muhammadi for preparation of slides. 17
Further Reading L. R. Rabiner, "A tutorial on Hidden Markov Models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, pp. 257-286, 1989. R. Dugad and U. B. Desai, "A tutorial on Hidden Markov models," Signal Processing and Artifical Neural Networks Laboratory, Dept of Electrical Engineering, Indian Institute of Technology, Bombay Technical Report No.: SPANN-96.1, 1996. Andrew W. Moore, HMM tutorial, www.autonlab.org/tutorials. 18