Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010 Graz, Austria Tel.: +43 316 873 4436 E-Mail: rank@inw.tugraz.at 2004 1
Word Recognition Given: Word dictionary: W = {W 1,..., W L } Time-series of features from unknown word: X = {x 1,..., x N } Wanted: Most probable spoken word: W l W Target: Minimization of word error rate (WER) 2004 2
Word Recognition/Dynamic Time Warping Alignment of observed and reference pattern 2004 3
Word Recognition/Dynamic Time Warping Pattern matching Dynamic time warping (DTW): Dynamic programming, complexity: O(SN) 2004 4
Word Recognition/Markov Models Word recognition by maximizing the probability of Markov model Θ l of word W l for the observed time-series of feature vectors X: l = argmax l P (Θ l X) = argmax l P (X Θ l ) P (Θ l ) P (X) 2004 5
Markov Model/Graz Weather Today s weather Tomorrow s weather Transition probabilities: 0.8 0.05 0.15 0.2 0.6 0.2 0.2 0.3 0.5 0.8 0.5 State transition diagram: Sunny 0.2 0.05 0.15 0.2 Rainy 0.3 Foggy 0.2 0.6 2004 6
Markov Model A Markov Model is specified by The set of states S = {s 1, s 2,..., s Ns }. and characterized by The prior probabilities π i = P (q 1 = s i ) Probabilities of s i being the first state of a state sequence. Collected in vector π. (The prior probabilities are often assumed equi-probable, π i = 1/N s.) The transition probabilities a ij = P (q n+1 = s j q n = s i ) probability to go from state i to state j. Collected in matrix A. The Markov model produces A state sequence Q = {q 1,..., q N }, over time 1 n N. q n S 2004 7
Hidden Markov Model Additionally, for a Hidden Markov model we have Emission probabilities: and we get for continuous observations, e.g., x R D : b i (x) = p(x n q n = s i ) pdfs of the observation x n at time n, if the system is in state s i. Collected as a vector of functions B(x). Often parametrized, e.g, by mixtures of Gaussians. for discrete observations, x {v 1,..., v K }: b i,k = P (x n = v k q n = s i ) Probabilities for the observation of x n = v k, if the system is in state s i. Collected in matrix B. Observation sequence: X = {x 1, x 2,..., x N } HMM parameters (for fixed number of states N s ) thus are Θ = (A, B, π) 2004 8
HMM/Graz Weather The above weather model turns into a hidden Markov model, if we can not observe the weather directly. Suppose you were locked in a room for several days, and you can only observe if a person is carrying an umbrella (v 1 = ) or not (v 2 = ). Example emission probabilities could be: Weather Probability of umbrella Sunny b 1,1 = 0.1 Rainy b 2,1 = 0.8 Foggy b 3,1 = 0.3 Since there are only two possible states for the discrete observations, the probabilities for no umbrella are b i,2 = 1 b i,1. 2004 9
HMM/Discrete vs. Continuous Features Discrete features/ emission probability: HMM: Continuous features/ emission probability: 2004 10
Markov Model/Left-to-Right Models 2004 11
HMM/Trellis Trellis: Model description over time State 1 b 1,k a 1,1 b 1,k b 1,k...... b 1,k a 1,2 b 2,k b 2,k b 2,k b 2,k State 2 a 1,3...... State 3 b 3,k b 3,k b 3,k...... b 3,k Sequence: x 1 x 2 x i x N n = 1 n = 2 n = i n = N time 2004 12
HMM/Trellis example b, = 0.9 b, = 0.9 STATES a, = 0.15 b, = 0.7 a, = 0.2 Sequence: x 1 = x 2 = x 3 = n=1 n=2 n=3 time Joint likelihood for observed sequence X and state sequence (path) Q: P (X, Q Θ) = π b, a, b, a, b, = 1/3 0.9 0.15 0.7 0.2 0.9 2004 13
HMM/Parameters Parameters {π, A, B} are probabilities: positive π i 0, a i,j 0, b i,k 0 or b i (x) 0 normalization conditions N s i=1 π i = 1, N s j=1 a i,j = 1, K b i,k = 1 or k=1 X b i (x) dx = 1 2004 14
Hidden Markov Models: 3 Problems The three basic problems for HMMs: Given a HMM with parameters Θ = (A, B, π), efficiently compute the production probability of an observation sequence X P (X Θ) =? (1) Given model Θ, what is the hidden state sequence Q that best explains an observation sequence X Q = argmax Q P (X, Q Θ) =? (2) How do we adjust the model parameters to maximize P (X Θ) ˆΘ = (Â, ˆB, ˆπ) =?, P (X ˆΘ) = max Θ P (X Θ) (3) 2004 15
HMM/Prod. Probability Problem 1: Production probability Given: HMM parameters Θ Given: Observed sequence X (length N) Wanted: Probability P (X Θ), for X being produced by Θ Probability of a certain state sequence P (Q Θ) = P (q 1,..., q N Θ) = π q1 N n=2 a qn 1,q n Emission probabilities for the state sequence P (X Q, Θ) = P (x 1,..., x N q 1,..., q n, Θ) = N n=1 b qn,x n Joint probability of hidden state sequence and observation sequence N P (X, Q Θ) = P (X Q, Θ) P (Q Θ) = π q1 b q1 (x 1 ) a qn 1,q n b qn,x n 2004 16 n=2
HMM/Prod. Probability Production probability P (X Θ) = P (X, Q Θ) = Q Q N Q Q N ( π q1 b q1 (x 1 ) N n=2 a qn 1,q n b qn,x n ) Exponential complexity O(2N N N s ) use recursive algorithm (complexity linear in N N s ): 2004 17
HMM/Prod. Probability Forward algorithm Computation of forward probabilities α n (j) = P (x 1,..., x n, q n = s j Θ) Initialization: for all j = 1... N s α 1 (j) = π i b j,x1 Recursion: for all n > 1 and all j = 1... N s ( Ns ) α n (j) = α n 1 (i) a i,j b j,xn i=1 Termination: P (X Θ) = N s j=1 α N (j) 2004 18
HMM/Prod. Probability Backward algorithm Computation of backward probabilities β n (i) = P (x n + 1,..., x N q n = s i, Θ) Initialization: for all i = 1... N s β N (i) = 1 Recursion: for all n < N and all i = 1... N s β n (i) = N s j=1 a i,j b j,xn+1 β n+1 (j) Termination: P (X Θ) = N s π j b j,x1 β 1 (j) j=1 2004 19
HMM/Prod. Probability Forward algorithm Backward algorithm s 1 s 2 α n 1 (1) α n 1 (2) a 1,j a 2,j a i,2 a i,1 b 1,xn+1 s 1 β n+1 (1) b 2,xn+1 s 2 β n+1 (2) STATES s 3 α n 1 (3) a 3,j b j,xn s j α n (j) s i β n (i) b a 3,xn+1 i,3 s 3 β n+1 (3) a Ns,j a i,ns b Ns,x n+1 s Ns α n 1 (N s ) s Ns β n+1 (N s ) n 1 n n n + 1 time time At each time n α n (j) β n (j) = P (X, q n = s j Θ) is the joint probability of the observation sequence X and all state sequences (paths) passing through state s j at time n, and P (X Θ) = N s j=1 α n (j) β n (j) 2004 20
HMM/State Sequence Problem 2: Hidden state sequence Given: HMM parameters Θ Given: Observed sequence X (length N) Wanted: A posteriori most probable state sequence Q Viterbi algorithm a posteriori probabilities P (Q X, Θ) = P (X, Q Θ) P (X Θ) Q is the optimal state sequence if P (X, Q Θ) = max Q Q N P (X, Q Θ) =: P (X Θ) Viterbi algorithm computes δ n (j) = max Q Q n P (x 1,..., x n, q 1,..., q n Θ) for q n = s j 2004 21
HMM/Viterbi Algorithm Viterbi Algorithm Computation of optimal state sequence Initialization: for all j = 1... N s δ 1 (j) = π j b j,x1, ψ 1 (j) = 0 Recursion: for n > 1 and all j = 1... N s δ n (j) = max(δ n 1 a i,j ) b j,xn, i ψ n (j) = argmax(δ n 1 (i) a i,j ) i Termination: P (X Θ) = max(δ N (j)), qn = argmax(δ N (j)) j Backtracking of optimal state sequence: q n = ψ n+1 (q n+1), n = N 1, N 2,..., 1 2004 22 j
HMM/Viterbi Algorithm δ = δ 1 (1) δ 2 (1) δ 2 (1) δ N (1) δ 1 (2) δ 2 (2) δ 2 (2) δ N (2) δ 1 (3) δ 2 (3) δ 2 (3) δ N (3) δ 1 (4) δ 2 (4) δ 2 (4) δ N (4) ψ =???? Example: For our weather HMM Θ, find the most probable hidden weather sequence for the observation sequence X = {x 1 =, x 2 =, x 3 = } 1. Initialization (n = 1): δ 1 ( ) = π b, = 1/3 0.9 = 0.3 ψ 1 ( ) = 0 δ 1 ( ) = π b, = 1/3 0.2 = 0.0667 ψ 1 ( ) = 0 δ 1 ( ) = π b, = 1/3 0.7 = 0.233 ψ 1 ( ) = 0 2004 23
HMM/Viterbi Algorithm 2. Recursion (n = 2): We calculate the likelihood of getting to state from all possible 3 predecessor states, and choose the most likely one to go on with: δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, ψ 2 ( ) = = max(0.3 0.8, 0.0667 0.2, 0.233 0.2) 0.1 = 0.024 The likelihood is stored in δ 2, the most likely predecessor in ψ 2. The same procedure is executed with states and : δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, ψ 2 ( ) = = max(0.3 0.05, 0.0667 0.6, 0.233 0.3) 0.8 = 0.056 δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, ψ 2 ( ) = = max(0.3 0.15, 0.0667 0.2, 0.233 0.5) 0.3 = 0.0350 2004 24
HMM/Viterbi Algorithm δ 2 ( ) = max(δ 1 ( ) a,, δ 1 ( ) a,, δ 1 ( ) a, ) b, δ 1 = 0.3 ψ 2 ( ) = STATES Sequence: x 1 = x 2 = x 3 = n = 1 n = 2 n = 3 time 2004 25
HMM/Viterbi Algorithm Recursion (n = 3): δ 3 ( ) = max(δ 2 ( ) a,, δ 2 ( ) a,, δ 2 ( ) a, ) b, = max(0.024 0.8, 0.056 0.2, 0.035 0.2) 0.1 = 0.0019 ψ 3 ( ) = δ 3 ( ) = max(δ 2 ( ) a,, δ 2 ( ) a,, δ 2 ( ) a, ) b, = max(0.024 0.05, 0.056 0.6, 0.035 0.3) 0.8 = 0.0269 ψ 3 ( ) = δ 3 ( ) = max(δ 2 ( ) a,, δ 2 ( ) a,, δ 2 ( ) a, ) b, = max(0.0024 0.15, 0.056 0.2, 0.035 0.5) 0.3 = 0.0052 ψ 3 ( ) = 2004 26
HMM/Viterbi Algorithm δ 3 ( ) = 0.0019 STATES δ 3 ( ) = 0.0269 δ 3 ( ) = 0.0052 Sequence: x 1 = x 2 = x 3 = n = 1 n = 2 n = 3 time 2004 27
HMM/Viterbi Algorithm 3. Termination The globally most likely path is determined, starting by looking for the last state of the most likely sequence. P (X Θ) = max(δ 3 (i)) = δ 3 ( ) = 0.0269 q 3 = argmax(δ 3 (i)) = 4. Backtracking The best sequence of states can be read from the ψ vectors. n = N 1 = 2: q 2 = ψ 3 (q 3) = ψ 3 ( ) = n = N 1 = 1: q 1 = ψ 2 (q 2) = ψ 2 ( ) = The most likely weather sequence is: Q = {q 1, q 2, q 3} = {,, }. 2004 28
HMM/Viterbi Algorithm Backtracking: δ 3 ( ) = 0.0019 STATES δ 3 ( ) = 0.0269 δ 3 ( ) = 0.0052 Sequence: x 1 = x 2 = x 3 = n = 1 n = 2 n = 3 time 2004 29
HMM/Parameter Estimation Problem 3: Parameter estimation for HMMs Given: HMM structure (N s states, K observation symbols) Given: Training sequence X = {x 1,..., x N } Wanted: optimal parameter values ˆΘ = {ˆπ, Â, ˆB} P (X ˆΘ) = max Θ P (X Θ) = max Θ Q Q N P (X, Q ˆΘ) Baum-Welch Algorithm or EM (Expectation-Maximization) Algorithm Iterative optimization of parameters Θ ˆΘ In the terminology of the EM algorithm we have observable variables: observation sequence X hyper-parameters: state sequence Q 2004 30
HMM/Parameter Estimation Transition probabilities for s i s j at time n (for given Θ): ξ n (i, j) := P (q n = s i, q n+1 = s j X, Θ) = α n(i) a i,j b j,xn+1 β n+1 (j) P (X Θ) b j,xn+1 STATES s i α n (i) a i,j s j β n+1 (j) n 1 n n + 1 n + 2 State probability for s i at time n (for given Θ): γ n (i) := P (q n = s i X, Θ) = α n(i) β n (i) P (X Θ) P (X Θ) = N s α n (i) β n (i) = time N s j=1 ξ n (i, j) (cf. forward/backward algorithm) 2004 i=1 31
HMM/Parameter Estimation Summing over time n gives expected numbers # (frequencies) for N γ n (i)... # of transitions from state s i n=1 N ξ n (i, j)... # of transitions from state s i to state s j n=1 Baum-Welch update of HMM parameters: π i = γ 1 (i)...# of state s i at time n = 1 ā i,j = bj,k = N 1 n=1 ξ n(i, j) N 1 n=1 γ n(i, j) N n=1 γ n(i, j) [x n = v k ] N n=1 γ n(i, j)... # of transitions from state s i to state s j # of transitions from state s i... # of state s i with v k emitted # of state s i 2004 32
HMM/Parameter Estimation Gaussian (normal distributed) emission probabilities: b j (x) = N (x µ j, Σ j ) Mixtures of Gaussians b j (x) = K k=1 c jk N (x µ jk, Σ jk ), Semi-continuous emission probabilities: b j (x) = K k=1 c jk N (x µ k, Σ k ), K k=1 c jk = 1 K k=1 c jk = 1 2004 33
HMM/Parameter Estimation Problems encountered for HMM parameter estimation many word models/hmm states/parameters... always too less training data! Consequences: large variance of estimated parameters large variance in objective function P (X Θ) vanishing statistics zero valued parameters â i,j, ˆb j,k, ˆΣ k, ˆΣ jk,... Possible remedies (besides using more training data): fix some parameter values tying parameter values for similar models interpolation of sensible parameters by robust parameters smoothing of probability density functions defining limits for sensible density parameters 2004 34
HMM/Parameter Estimation Parameter tying simultaneous identification of parameters for similar models forces identical parameter values reduces parameter space dimension Example (state tying): Automatic determination of states that can be tied, e.g., by mutual information 2004 35
HMM/Parameter Estimation Parameter interpolation instead of fixed tying of states: interpolate parameters of similar models P (X Θ R, Θ S, r R, r S ) = r R P (X Θ R ) + r S P (X Θ S ), r R + r S = 1 especially suited for semi-continuous emission pdfs weights r R, r S can be chosen heuristically or included in the Baum-Welch algorithm 2004 36
References R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis. Wiley&Sons, Inc., 1973. S. Bengio, An Introduction to Statistical Machine Learning EM for GMMs, Dalle Molle Institute for Perceptual Artificial Intelligence. E.G. Schukat-Talamazzini, Automatische Spracherkennung, Vieweg-Verlag, 1995. L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, Vol. 77, No. 2, pp. 257-286, 1989. 2004 37