Automatic Speech Recognition (CS753) Lecture 6: Hidden Markov Models (Part II) Instructor: Preethi Jyothi Aug 10, 2017
Recall: Computing Likelihood Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Computing Likelihood: Given an HMM l =(A,B) and an observation sequence O, determine the likelihood P(O l). Use the Forward Algorithm
Recall: Decoding best state sequence Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Decoding: Given as input an HMM l =(A,B) and a sequence of observations O = o 1,o 2,...,o T, find the most probable sequence of states Q = q 1 q 2 q 3...q T. Use the Viterbi Algorithm
Learning HMM Parameters Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B. Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm Before moving on to Baum-Welch, what is the Expectation Maximization algorithm?
EM Algorithm: Fitting Parameters to Data Parameter θ determines Pr(x, z; θ) where x is observed and z is hidden Observed data: i.i.d samples xi, i=1,, N Goal: Find arg max L( ) Initial parameters: θ 0 where L( ) = NX i=1 log Pr(x i ; ) Iteratively compute θ l as follows: NX X Q(, ` 1 )= i=1 z Pr(z x i ; ` 1 ) log Pr(x i,z; ) ` = arg max Q(, ` 1 ) Estimate θ l cannot get worse over iterations because for all θ: L( ) L( ` 1 ) Q(, ` 1 ) Q( ` 1, ` 1 ) EM is guaranteed to converge to a local optimum [Wu83]
Coin example to illustrate EM Coin 1 Coin 2 Coin 3 ρ1 = Pr(H) ρ2 = Pr(H) ρ3 = Pr(H) Repeat: Toss Coin 1 privately if it shows H: Toss Coin 2 twice else Toss Coin 3 twice The following sequence is observed: HH, TT, HH, TT, HH How do you estimate ρ1, ρ2 and ρ3?
Coin example to illustrate EM Recall, for partially observed data, the likelihood is given by: L( ) = NX log Pr(x i ; ) = NX log X z Pr(x i,z; ) i=1 i=1 where, for the coin example: each observation x i X = {HH,HT,TH,TT} the hidden variable z Z = {H,T}
Coin example to illustrate EM Recall, for partially observed data, the likelihood is given by: L( ) = NX log Pr(x i ; ) = NX log X z Pr(x i,z; ) i=1 i=1 Pr(x, z; ) =Pr(x z; )Pr(z; ) where Pr(z; ) = Pr(x z; ) = Coin 1 ρ1 = Pr(H) ( 1 if z =H Coin 2 Coin 3 ρ2 =Pr(H) 1 1 if z =T ( h 2(1 2 ) t if z =H h 3(1 3 ) t if z =T ρ3 = Pr(H) h : number of heads, t : number of tails
Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let s use EM to estimate θ = (ρ1, ρ2, ρ3) [EM Iteration, E-step] Compute quantities involved in Q(, ` 1 )= NX i=1 X where γ(z, x) = Pr(z x ;θ l -1 ) Suppose θ l -1 is ρ1 = 0.3, ρ2 = 0.4, ρ3 = 0.6: What is γ(h, HH)? What is γ(h, TT)? z (z,x i ) log Pr(x i,z; ) i.e., compute γ(z, x i ) for all z and all i = 0.16 = 0.49
Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let s use EM to estimate θ = (ρ1, ρ2, ρ3) [EM Iteration, M-step] Find θ which maximises Q(, ` 1 )= NX X i=1 z (z,x i ) log Pr(x i,z; ) 1 = 2 = P N i=1 (H,x i) N P N i=1 (H,x i)h i P N i=1 (H,x i)(h i + t i ) 3 = P N i=1 (T,x i)h i P N i=1 (T,x i)(h i + t i )
Coin example to illustrate EM ε/1 This was a very simple HMM (with observations from 2 states) ε/ρ1 H/ρ2 H T/1-ρ2 State remains the same after the first transition ε/1-ρ1 T ε/1 γ estimated the distribution of this state H/ρ3 T/1-ρ3 More generally, will need the distribution of the state at each time step EM for general HMMs: Baum-Welch algorithm (1972) predates the general formulation of EM (1977)
Baum-Welch Algorithm as EM Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation probabilities B [EM Iteration, E-step] Compute quantities involved in Q(θ,θ l -1 ) γi,t (j) = Pr(z t = j x i ;θ l -1 ) ξ i,t (j,k) = Pr(z t-1 = j, z t = k x i ;θ l -1 )
Baum-Welch Algorithm as EM Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation probabilities B [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) A j,k = B j,v = P N P Ti i=1 t=2 i,t(j, k) P N P Ti P i=1 t=2 k 0 i,t (j, k 0 ) P N P i=1 t:x it =v i,t(j) P N P Ti i=1 t=1 i,t(j)
Gaussian Observation Model So far we considered HMMs with discrete outputs In acoustic models, HMMs output real valued vectors Hence, observation probabilities are defined using probability density functions A widely used model: Gaussian distribution N (x µ, 2 )= 1 p e 1 2 2 2 2 (x µ)2 HMM emission/observation probabilities bj(x) = N(x µj, σj 2 ) where µj is the mean associated with state j and σj 2 is its variance. For multivariate Gaussians, bj(x) = N(x µj, Σj) where Σ is the covariance associated with state j
BW for Gaussian Observation Model Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation prob. B = {(µj,σj)} for all j [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) A same as with discrete outputs P N P Ti i=1 t=1 i,t(j)x it µ j = P N P Ti i=1 t=1 i,t(j) j = P N i=1 P Ti t=1 i,t(j)(x it µ j )(x it µ j ) T P N i=1 P Ti t=1 i,t(j)
Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal
Unimodal 1.0 0.8 μ= 0, μ= 0, μ= 0, μ= 2, 2 σ = 0.2, 2 σ = 1.0, 2 σ = 5.0, 2 σ = 0.5, x) 0.6 φ μ,σ 2( 0.4 0.2 0.0 5 4 3 2 1 0 1 2 3 4 5 x 1 0.8 0.6 0.4 0.2 0 3 2 1 0-1 -2-3 -3-2 -1 0 1 2 3
Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal More generally, we use a mixture of Gaussians to model multiple modes in the data
Mixture Models
Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal More generally, we use a mixture of Gaussians to model multiple modes in the data Instead of bj(x) = N(x µj, Σj) in the single Gaussian case, bj(x) now becomes: b j (x) = MX m=1 c jm N (x µ jm, jm ) where cjm is the mixing probability for Gaussian component m of state j MX m=1 c jm =1, c jm 0
BW for Gaussian Mixture Model Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation prob. B = {(µjm,σjm,cjm)} for all j,m [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) P N P Ti i=1 t=1 i,t(j, m)x it µ jm = P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m)(x it µ jm )(x it µ jm ) T jm = P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m) c jm = P N P Ti i=1 t=1 i,t(j) γi,t(j)=pr(qt=j xi) Mixing probabilities
ASR Framework: Acoustic Models Acoustic Features Acoustic Models H Triphones Context Transducer Monophones Pronunciation Model Words Language Model Acoustic models are estimated using training data: {x i, y i }, i=1 N where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words Word Sequence For each x i, y i, a composite HMM is constructed using the HMMs that correspond to the triphone sequence in y i Hello world sil hh ah l ow w er l d sil sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil
ASR Framework: Acoustic Models Acoustic Features Acoustic Models H Triphones Context Transducer Monophones Pronunciation Model Words Language Model Acoustic models are estimated using training data: {x i, y i }, i=1 N where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words Word Sequence For each x i, y i, a composite HMM is constructed using the HMMs that correspond to the triphone sequence in y i Parameters of these composite HMMs are the parameters of the constituent triphone HMMs. These parameters are fit to the acoustic data {x i }, i=1 N using the Baum-Welch algorithm (EM)
Baum Welch: In summary [Every EM Iteration] Compute θ = { Ajk, (µjm,σjm,cjm) } for all j,k,m A j,k = P N P Ti i=1 t=2 i,t(j, k) P N P Ti P i=1 t=2 k 0 i,t (j, k 0 ) P N P Ti i=1 t=1 i,t(j, m)x it µ jm = P N P Ti i=1 t=1 i,t(j, m) jm = P N i=1 P Ti t=1 i,t(j, m)(x it µ jm )(x it µ jm ) T P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m) c jm = P N P Ti i=1 t=1 i,t(j) How do we efficiently compute these quantities? Next class!