Automatic Speech Recognition (CS753)

Size: px

Start display at page:

Download "Automatic Speech Recognition (CS753)"

Dina White
5 years ago
Views:

1 Automatic Speech Recognition (CS753) Lecture 6: Hidden Markov Models (Part II) Instructor: Preethi Jyothi Aug 10, 2017

2 Recall: Computing Likelihood Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Computing Likelihood: Given an HMM l =(A,B) and an observation sequence O, determine the likelihood P(O l). Use the Forward Algorithm

3 Recall: Decoding best state sequence Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Decoding: Given as input an HMM l =(A,B) and a sequence of observations O = o 1,o 2,...,o T, find the most probable sequence of states Q = q 1 q 2 q 3...q T. Use the Viterbi Algorithm

4 Learning HMM Parameters Problem 1 (Likelihood): Given an HMM l =(A, B) and an observation sequence O, determine the likelihood P(O l). Problem 2 (Decoding): Given an observation sequence O and an HMM l = (A,B), discover the best hidden state sequence Q. Problem 3 (Learning): Given an observation sequence O and the set of states in the HMM, learn the HMM parameters A and B. Learning: Given an observation sequence O and the set of possible states in the HMM, learn the HMM parameters A and B. Standard algorithm for HMM training: Forward-backward or Baum-Welch algorithm Before moving on to Baum-Welch, what is the Expectation Maximization algorithm?

5 EM Algorithm: Fitting Parameters to Data Parameter θ determines Pr(x, z; θ) where x is observed and z is hidden Observed data: i.i.d samples xi, i=1,, N Goal: Find arg max L( ) Initial parameters: θ 0 where L( ) = NX i=1 log Pr(x i ; ) Iteratively compute θ l as follows: NX X Q(, ` 1 )= i=1 z Pr(z x i ; ` 1 ) log Pr(x i,z; ) ` = arg max Q(, ` 1 ) Estimate θ l cannot get worse over iterations because for all θ: L( ) L( ` 1 ) Q(, ` 1 ) Q( ` 1, ` 1 ) EM is guaranteed to converge to a local optimum [Wu83]

6 Coin example to illustrate EM Coin 1 Coin 2 Coin 3 ρ1 = Pr(H) ρ2 = Pr(H) ρ3 = Pr(H) Repeat: Toss Coin 1 privately if it shows H: Toss Coin 2 twice else Toss Coin 3 twice The following sequence is observed: HH, TT, HH, TT, HH How do you estimate ρ1, ρ2 and ρ3?

7 Coin example to illustrate EM Recall, for partially observed data, the likelihood is given by: L( ) = NX log Pr(x i ; ) = NX log X z Pr(x i,z; ) i=1 i=1 where, for the coin example: each observation x i X = {HH,HT,TH,TT} the hidden variable z Z = {H,T}

8 Coin example to illustrate EM Recall, for partially observed data, the likelihood is given by: L( ) = NX log Pr(x i ; ) = NX log X z Pr(x i,z; ) i=1 i=1 Pr(x, z; ) =Pr(x z; )Pr(z; ) where Pr(z; ) = Pr(x z; ) = Coin 1 ρ1 = Pr(H) ( 1 if z =H Coin 2 Coin 3 ρ2 =Pr(H) 1 1 if z =T ( h 2(1 2 ) t if z =H h 3(1 3 ) t if z =T ρ3 = Pr(H) h : number of heads, t : number of tails

9 Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let s use EM to estimate θ = (ρ1, ρ2, ρ3) [EM Iteration, E-step] Compute quantities involved in Q(, ` 1 )= NX i=1 X where γ(z, x) = Pr(z x ;θ l -1 ) Suppose θ l -1 is ρ1 = 0.3, ρ2 = 0.4, ρ3 = 0.6: What is γ(h, HH)? What is γ(h, TT)? z (z,x i ) log Pr(x i,z; ) i.e., compute γ(z, x i ) for all z and all i = 0.16 = 0.49

10 Coin example to illustrate EM Our observed data is: {HH, TT, HH, TT, HH} Let s use EM to estimate θ = (ρ1, ρ2, ρ3) [EM Iteration, M-step] Find θ which maximises Q(, ` 1 )= NX X i=1 z (z,x i ) log Pr(x i,z; ) 1 = 2 = P N i=1 (H,x i) N P N i=1 (H,x i)h i P N i=1 (H,x i)(h i + t i ) 3 = P N i=1 (T,x i)h i P N i=1 (T,x i)(h i + t i )

11 Coin example to illustrate EM ε/1 This was a very simple HMM (with observations from 2 states) ε/ρ1 H/ρ2 H T/1-ρ2 State remains the same after the first transition ε/1-ρ1 T ε/1 γ estimated the distribution of this state H/ρ3 T/1-ρ3 More generally, will need the distribution of the state at each time step EM for general HMMs: Baum-Welch algorithm (1972) predates the general formulation of EM (1977)

12 Baum-Welch Algorithm as EM Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation probabilities B [EM Iteration, E-step] Compute quantities involved in Q(θ,θ l -1 ) γi,t (j) = Pr(z t = j x i ;θ l -1 ) ξ i,t (j,k) = Pr(z t-1 = j, z t = k x i ;θ l -1 )

13 Baum-Welch Algorithm as EM Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation probabilities B [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) A j,k = B j,v = P N P Ti i=1 t=2 i,t(j, k) P N P Ti P i=1 t=2 k 0 i,t (j, k 0 ) P N P i=1 t:x it =v i,t(j) P N P Ti i=1 t=1 i,t(j)

14 Gaussian Observation Model So far we considered HMMs with discrete outputs In acoustic models, HMMs output real valued vectors Hence, observation probabilities are defined using probability density functions A widely used model: Gaussian distribution N (x µ, 2 )= 1 p e (x µ)2 HMM emission/observation probabilities bj(x) = N(x µj, σj 2 ) where µj is the mean associated with state j and σj 2 is its variance. For multivariate Gaussians, bj(x) = N(x µj, Σj) where Σ is the covariance associated with state j

15 BW for Gaussian Observation Model Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation prob. B = {(µj,σj)} for all j [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) A same as with discrete outputs P N P Ti i=1 t=1 i,t(j)x it µ j = P N P Ti i=1 t=1 i,t(j) j = P N i=1 P Ti t=1 i,t(j)(x it µ j )(x it µ j ) T P N i=1 P Ti t=1 i,t(j)

16 Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal

17 Unimodal μ= 0, μ= 0, μ= 0, μ= 2, 2 σ = 0.2, 2 σ = 1.0, 2 σ = 5.0, 2 σ = 0.5, x) 0.6 φ μ,σ 2( x

18 Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal More generally, we use a mixture of Gaussians to model multiple modes in the data

19 Mixture Models

20 Gaussian Mixture Model A single Gaussian observation model assumes that the observed acoustic feature vectors are unimodal More generally, we use a mixture of Gaussians to model multiple modes in the data Instead of bj(x) = N(x µj, Σj) in the single Gaussian case, bj(x) now becomes: b j (x) = MX m=1 c jm N (x µ jm, jm ) where cjm is the mixing probability for Gaussian component m of state j MX m=1 c jm =1, c jm 0

21 BW for Gaussian Mixture Model Observed data: N sequences, x i = (x i1,, x it i), i=1 N where x it R d Parameters θ : transition matrix A, observation prob. B = {(µjm,σjm,cjm)} for all j,m [EM Iteration, M-step] Find θ which maximises Q(θ,θ l -1 ) P N P Ti i=1 t=1 i,t(j, m)x it µ jm = P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m)(x it µ jm )(x it µ jm ) T jm = P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m) c jm = P N P Ti i=1 t=1 i,t(j) γi,t(j)=pr(qt=j xi) Mixing probabilities

22 ASR Framework: Acoustic Models Acoustic Features Acoustic Models H Triphones Context Transducer Monophones Pronunciation Model Words Language Model Acoustic models are estimated using training data: {x i, y i }, i=1 N where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words Word Sequence For each x i, y i, a composite HMM is constructed using the HMMs that correspond to the triphone sequence in y i Hello world sil hh ah l ow w er l d sil sil sil/hh/ah hh/ah/l ah/l/ow l/ow/w er/w/l l/er/d er/l/d l/d/sil sil

23 ASR Framework: Acoustic Models Acoustic Features Acoustic Models H Triphones Context Transducer Monophones Pronunciation Model Words Language Model Acoustic models are estimated using training data: {x i, y i }, i=1 N where x i corresponds to a sequence of acoustic feature vectors and y i corresponds to a sequence of words Word Sequence For each x i, y i, a composite HMM is constructed using the HMMs that correspond to the triphone sequence in y i Parameters of these composite HMMs are the parameters of the constituent triphone HMMs. These parameters are fit to the acoustic data {x i }, i=1 N using the Baum-Welch algorithm (EM)

24 Baum Welch: In summary [Every EM Iteration] Compute θ = { Ajk, (µjm,σjm,cjm) } for all j,k,m A j,k = P N P Ti i=1 t=2 i,t(j, k) P N P Ti P i=1 t=2 k 0 i,t (j, k 0 ) P N P Ti i=1 t=1 i,t(j, m)x it µ jm = P N P Ti i=1 t=1 i,t(j, m) jm = P N i=1 P Ti t=1 i,t(j, m)(x it µ jm )(x it µ jm ) T P N P Ti i=1 t=1 i,t(j, m) P N P Ti i=1 t=1 i,t(j, m) c jm = P N P Ti i=1 t=1 i,t(j) How do we efficiently compute these quantities? Next class!

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian