Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models for continuous speech recognition Continuous density HMMs Implementation issues 6.345 Automatic Speech Recognition HMM
Information Theoretic Approach to ASR Speech Generation Speech Recognition Text Generation Speech Production Acoustic Processor A Linguistic Decoder W noisy communication channel Recognition is achieved by maximizing the probability of the linguistic string, W, given the acoustic evidence, A, i.e., choose the linguistic sequence W ˆ such that P(Ŵ A) =m ax P(W A) W 6.345 Automatic Speech Recognition HMM
Information Theoretic Approach to ASR From Bayes rule: P(W A) = P(A W )P(W ) P(A) Hidden Markov modelling (HMM) deals with the quantity P(A W ) Change in notation: A O W λ P(A W ) P(O λ) 6.345 Automatic Speech Recognition HMM 3
HMM: An Example 0 Consider 3 mugs, each with mixtures of state stones, and The fractions for the i th mug are a i and a i,and a i + a i = Consider urns, each with mixtures of black and white balls The fractions for the i th urn are b i (B) and b i (W); b i (B)+ b i (W) = The parameter vector for this model is: λ = {a 0,a 0,a,a,a,a,b (B),b (W),b (B),b (W)} 6.345 Automatic Speech Recognition HMM 4
HMM: An Example (cont d) 0 Observation Sequence: O = {B, W, B, W, W, B} State Sequence: Q = {,,,,, } Goal: Given the model λ and the observation sequence O, how can the underlying state sequence Q be determined? 6.345 Automatic Speech Recognition HMM 5
Elements of a Discrete Hidden Markov Model N : number of states in the model states, s = {s,s,...,s N } state at time t, q t s M: number of observation symbols (i.e., discrete observations) observation symbols, v = {v,v,...,v M } observation at time t, o t v A = {a ij }: state transition probability distribution a ij = P(q t+ = s j q t = s i ), i,j N B = {b j (k)}: observation symbol probability distribution in state j b j (k) = P(v k at t q t = s j ), j N, k M π = {π i }: initial state distribution π i = P(q = s i ), i N Notationally, an HMM is typically written as: λ = {A, B,π} 6.345 Automatic Speech Recognition HMM 6
HMM: An Example (cont d) For our simple example: [ ] [ ] a π = {a 0,a 0 }, A = a b,and B = (B) b (W) a a b (B) b (W) State Diagram -state 3-state a a a 3 a {b (B), b (W)} {b (B), b (W)} 6.345 Automatic Speech Recognition HMM 7
Generation of HMM Observations. Choose an initial state, q = s i, based on the initial state distribution, π. For t = to T : Choose o t = v k according to the symbol probability distribution in state s i,b i (k) Transition to a new state q t+ = s j according to the state transition probability distribution for state s i,a ij 3. Increment t by, return tostep if t T ; else, terminate a 0i a ij q q..... q T b i (k) o o o T 6.345 Automatic Speech Recognition HMM 8
Representing State Diagram by Trellis 3 s s s 3 0 3 4 The dashed line represents a null transition, where no observation symbol is generated 6.345 Automatic Speech Recognition HMM 9
Three Basic HMM Problems. Scoring: Given an observation sequence O = {o,o,..., o T } and a model λ = {A, B,π}, how do we compute P(O λ), the probability of the observation sequence? ==> The Forward-Backward Algorithm. Matching: Given an observation sequence O = {o,o,..., o T },how do we choose a state sequence Q = {q,q,..., q T } which is optimum in some sense? ==> The Viterbi Algorithm 3. Training: How do we adjust the model parameters λ = {A, B, π} to maximize P(O λ)? ==> The Baum-Welch Re-estimation Procedures 6.345 Automatic Speech Recognition HMM 0
Computation of P(O λ) P(O λ) = P(O, Q λ) allq P(O, Q λ) = P(O Q,λ)P(Q λ) Consider the fixed state sequence: Q = q q...q T P(O Q,λ) = P(Q λ) = b q (o )b q (o )...b qt (o T ) π q a q q a q q 3...a qt q T Therefore: P(O λ) = π q b q (o )a q q b q (o )...a qt q T b qt (o T ) q,q,...,q T Calculation required T N T (there are N T such sequences) For N = 5,T = 00 00 5 00 0 7 computations! 6.345 Automatic Speech Recognition HMM
The Forward Algorithm Let us define the forward variable, α t (i), as the probability of the partial observation sequence up to time t and state s i at time t, given the model, i.e. It can easily be shown that: α t (i) = P(o o...o t,q t = s i λ) α (i) = π i b i (o ), i N By induction: N P(O λ) = α T (i) i= N α t+ (j) =[ α t (i) a ij ]b j (o t+ ), i= t T j N Calculation is on the order of N T. For N = 5,T = 00 00 5 computations, instead of 0 7 6.345 Automatic Speech Recognition HMM
Forward Algorithm Illustration s a j s i α t (i) a ij s j a jj α t+ (j) a Nj s N t t+ T 6.345 Automatic Speech Recognition HMM 3
The Backward Algorithm Similarly, let us define the backward variable, β t (i), asthe probability of the partial observation sequence from time t + to the end, given state s i at time t and the model, i.e. β t (i) = P(o t+ o t+...o T q t = s i,λ) It can easily be shown that: β T (i) =, i N and: By induction: N β t (i) = j= N P(O λ) = π i b i (o )β (i) i= t = T,T,..., a ij b j (o t+ ) β t+ (j), i N 6.345 Automatic Speech Recognition HMM 4
Backward Procedure Illustration s a i s i β t (i) a ii a ij s j β t+ (j) a in s N t t+ T 6.345 Automatic Speech Recognition HMM 5
Finding Optimal State Sequences One criterion chooses states, q t, which are individually most likely This maximizes the expected number of correct states Let us define γ t (i) as the probability of being in state s i at time t, given the observation sequence and the model, i.e. γ t (i) = P(q t = s i O,λ) N i= γ t (i) =, t Then the individually most likely state, q t,attime t is: q t = argmax γ t (i) i N t T Note that it can be shown that: γ t (i) = α t (i)β t (i) P(O λ) 6.345 Automatic Speech Recognition HMM 6
Finding Optimal State Sequences The individual optimality criterion has the problem that the optimum state sequence may not obey state transition constraints Another optimality criterion is to choose the state sequence which maximizes P(Q, O λ); This can be found by the Viterbi algorithm Let us define δ t (i) as the highest probability along a single path, at time t, which accounts for the first t observations, i.e. δ t (i) = max q,q,...,q t P(q q...q t,q t = s i,o o...o t λ) By induction: δ t+ (j) =[max δ t (i)a ij ]b j (o t+ ) i To retrieve the state sequence, we must keep track of the state sequence which gave the best path, at time t, to state s i We do this in a separate array ψ t (i) 6.345 Automatic Speech Recognition HMM 7
The Viterbi Algorithm. Initialization:. Recursion: δ (i) = π i b i (o ), ψ (i) = 0 i N 3. Termination: δ t (j) = max t (i)a ij ]b j (o t ), i N t T j N ψ t (j) = argmax[δ t (i)a ij ], t T j N i N P qt 4. Path (state-sequence) backtracking: Computation N T =max [δ T (i)] i N = argmax[δ T (i)] i N q t = ψ t+ (q t + ), t = T,T,..., 6.345 Automatic Speech Recognition HMM 8
The Viterbi Algorithm: An Example.8..7.5.3.4.5.5.3.7.3.5 3 P(a) P(b).. O={a a b b} s s s 3 0 a a b b.40.40.0.0..0..0.09.0.09.0.0.0.0.0.0.5.5.35.0.0.0.35.0.0 0 3 4 6.345 Automatic Speech Recognition HMM 9
The Viterbi Algorithm: An Example (cont d) 0 a aa aab aabb s.0 s,a.4 s,a.6 s,b.06 s,b.006 s, 0.08 s, 0.03 s, 0.003 s, 0.0003 s s, 0. s,a. s,a.084 s,b.044 s,b.0044 s,a.04 s,a.04 s,b.068 s,b.00336 s, 0.0 s, 0.0084 s, 0.0068 s, 0.000336 s 3 s, 0.0 s,a.03 s,a.035 s,b.094 s,b.00588 s 0 a a b b.0 0.4 0.6 0.06 0.006 s 0. 0. 0.084 0.068 0.00336 s 3 0.0 0.03 0.035 0.094 0.00588 0 3 4 6.345 Automatic Speech Recognition HMM 0
Matching Using Forward-Backward Algorithm 0 a aa aab aabb s.0 s,a.4 s,a.6 s,b.06 s,b.006 s, 0.08 s, 0.03 s, 0.003 s, 0.0003 s s, 0. s,a. s,a.084 s,b.044 s,b.0044 s,a.04 s,a.066 s,b.0364 s,b.008 s, 0.033 s, 0.08 s, 0.0054 s, 0.0056 s 3 s, 0.0 s,a.03 s,a.0495 s,b.0637 s,b.089 s 0 a a b b.0 0.4 0.6 0.06 0.006 s 0. 0.33 0.8 0.0540 0.056 s 3 0.0 0.063 0.0677 0.069 0.0056 0 3 4 6.345 Automatic Speech Recognition HMM
Baum-Welch Re-estimation Baum-Welch re-estimation uses EM to determine ML parameters Define ξ t (i,j) as the probability of being in state s i at time t and state s j at time t +, given the model and observation sequence Then: ξ t (i,j) = P(q t = s i,q t+ = s j O,λ) ξ t (i,j) = γ t (i) = j= Summing γ t (i) and ξ t (i,j), we get: T t= T t= α t (i)a ij b j (o t+ )β t+ (j) P(O λ) N ξ t (i,j) γ t (i) = expected number of transitions from s i ξ t (i,j) = expected number of transitions from s i to s j 6.345 Automatic Speech Recognition HMM
Baum-Welch Re-estimation Procedures s s i α t (i) a ij s j β t+ (j) s N t- t t+ t+ T 6.345 Automatic Speech Recognition HMM 3
Baum-Welch Re-estimation Formulas π = expected number of times in state s i at t = = γ (i) a ij = expected number of transitions from state s i to s j expected number of transitions from state s i T ξ t (i,j) t= = T γ t (i) t= b j (k) = expected number of times in state s j with symbol v k expected number of times in state s j T t= ot =v k γ t (j) = T γ t (j) t= 6.345 Automatic Speech Recognition HMM 4
Baum-Welch Re-estimation Formulas If λ = (A, B,π) is the initial model, and λ = ( A, B, π) is the re-estimated model. Then it can be proved that either:. The initial model, λ, defines a critical point of the likelihood function, in which case λ = λ, or. Model λ is more likely than λ in the sense that P(O λ) > P(O λ), i.e., we have found a new model λ from which the observation sequence is more likely to have been produced. Thus we can improve the probability of O being observed from the model if we iteratively use λ in place of λ and repeat the re-estimation until some limiting point is reached. The resulting model is called the maximum likelihood HMM. 6.345 Automatic Speech Recognition HMM 5
Multiple Observation Sequences Speech recognition typically uses left-to-right HMMs. These HMMs can not be trained using a single observation sequence, because only a small number of observations are available to train each state. To obtain reliable estimates of model parameters, one must use multiple observation sequences. In this case, the re-estimation procedure needs to be modified. Let us denote the set of K observation sequences as O = {O (), O (),..., O (K) } where O (k) = {o (k),o (k),...,o (k) T k } is the k-th observation sequence. Assume that the observations sequences are mutually independent, we want to estimate the parameters so as to maximize K K P(O λ) = P(O (k) λ) = P k k= k= 6.345 Automatic Speech Recognition HMM 6
Multiple Observation Sequences (cont d) Since the re-estimation formulas are based on frequency of occurrence of various events, we can modify them by adding up the individual frequencies of occurrence for each sequence a ij = K T k k= t= K T k ξ t k (i,j) γ t k (i) = K T k P α k t (i)a ij b j (o (k) k= k t= K T k k= t= k= P k t= α t k (i)β k t (i) t+)β k t+(j) b j (l) = = K Tk γ k t (j) K T k K k= t= (k) o t =vl k=t= γ t k (j) K T k k= P k t= (k) o t =vl T k k= P k t= α t k (i)β k t (i) αt k (i)β k t (i) 6.345 Automatic Speech Recognition HMM 7
Phone-based HMMs Word-based HMMs are appropriate for small vocabulary speech recognition. For large vocabulary ASR, sub-word-based (e.g., phone-based) models are more appropriate. 6.345 Automatic Speech Recognition HMM 8
Phone-based HMMs (cont d) The phone models can have many states, and words are made up from a concatenation of phone models. 6.345 Automatic Speech Recognition HMM 9
Continuous Density Hidden Markov Models A continuous density HMM replaces the discrete observation probabilities, b j (k), by a continuous PDF b j (x) A common practice is to represent b j (x) as a mixture of Gaussians: M b j (x) = c jk N[x,µ jk, Σ jk ] j N k= where c jk is the mixture weight M c jk 0 ( j N, k M, and c jk =, j N ), k= N is the normal density, and µ jk and Σ jk are the mean vector and covariance matrix associated with state j and mixture k. 6.345 Automatic Speech Recognition HMM 30
Acoustic Modelling Variations Semi-continuous HMMs first compute a VQ codebook of size M The VQ codebook is then modelled as a family of Gaussian PDFs Each codeword is represented by a Gaussian PDF, and may be used together with others to model the acoustic vectors From the CD-HMM viewpoint, this is equivalent to using the same set of M mixtures to model all the states It is therefore often referred to as a Tied Mixture HMM All three methods have been used in many speech recognition tasks, with varying outcomes For large-vocabulary, continuous speech recognition with sufficient amount (i.e., tens of hours) of training data, CD-HMM systems currently yield the best performance, but with considerable increase in computation 6.345 Automatic Speech Recognition HMM 3
Implementation Issues Scaling: to prevent underflow Segmental K-means Training: to train observation probabilities by first performing Viterbi alignment Initial estimates of λ: to provide robust models Pruning: to reduce search computation 6.345 Automatic Speech Recognition HMM 3
References X. Huang, A. Acero, and H. Hon, Spoken Language Processing, Prentice-Hall, 00. F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 997. L. Rabiner and B. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 993. 6.345 Automatic Speech Recognition HMM 33