Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM)

Size: px

Start display at page:

Download "Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM)"

Caren Gordon
6 years ago
Views:

1 1 of 88 Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) Samudravijaya K Tata Institute of Fundamental Research, Mumbai chief@tifr.res.in 09-JAN-2009 Majority of the slides are taken from S.Umesh s tutorial on ASR (WiSSAP 2006).

2 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 2 of 88 Pattern Recognition Input Signal Processing Training Testing Model Generation Pattern Matching Output GMM: static patterns HMM: sequential patterns

3 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 3 of 88 Basic Probability Joint and Conditional probability p(a, B) = p(a B) p(b) = p(b A) p(a) Bayes rule p(a B) = p(b A) p(a) p(b) If A i s are mutually exclusive events, p(b) = i p(b A i ) p(a i ) p(a B) = p(b A) p(a) i p(b A i) p(a i )

4 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 4 of 88 Normal Distribution Many phenomenon are described by Gaussian pdf ( 1 p(x θ) = exp 1 ) 2πσ 2 2σ2(x µ)2 (1) pdf is parameterised by θ = [µ,σ 2 ] where mean = µ and variance=σ 2. A convenient pdf : second order statistics is sufficient. Example: Heights of Pygmies Gaussian pdf with µ = 4f t & std-dev(σ) = 1f t OR: Heights of bushmen Gaussian pdf with µ = 6f t & std-dev(σ) = 1f t Question:If we arbitrarily pick a person from a population what is the probability of the height being a particular value?

5 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 5 of 88 If I pick arbitrarily a Pygmy, say x, then Pr(Height of x=4 1 ) = 1 ( exp 1 ) 2π (4 1 4) 2 (2) Note: Here mean and variances are fixed, only the observations, x, change. Also see: Pr(x = 4 1 ) Pr(x = 5 ) AND Pr(x = 4 1 ) Pr(x = 3 )

6 Conversely: Given a person s height is 4 1 Person is more likely to be a pygmy than bushman. If we observe heights of many persons say 3 6, 4 1, 3 8, 4 5, 4 7, 4,6 5 and all are from same population (i.e. either pygmy or bushmen.) then more certain we are that the population is pygmy. More the observations better will be our decision WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 6 of 88

7 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 7 of 88 Likelihood Function x[0], x[1],...,x[n 1] set of independent observations from pdf parameterised by θ. Previous Example: x[0], x[1],..., x[n 1] are heights observed and θ is the mean of density which is unknown (σ 2 assumed known). L(X; θ) = p(x 0...x N 1 ;θ) = N p(x i ;θ) = i=0 1 exp (2πσ 2 ) N 2 ( 1 2σ 2 ) N (x i θ) 2 i=0 (3) L(X; θ) is a function of θ and is called Likelihood Function Given: x 0...x N 1, what can we say about value of θ, i.e. best estimate of θ.

8 Maximum Likelihood Estimation Example: We know height of a person x[0] = 4 4. Most likely to have come from which pdf θ = 3, 4 6 or 6? Maximum of L(x[0]; θ = 3 ),L(x[0]; 4 6 ) and L(x[0]; θ = 6 ) choose θ = 4 6. If θ is just a parameter, we will choose arg max θ L(x[0]; θ). WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 8 of 88

9 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 9 of 88 Maximum Likelihood Estimator Given x[0], x[1],...,x[n 1] and pdf parameterised by θ = θ 1 θ 2.. θ m 1 We form Likelihood function L(X; θ) = N p(x i ;θ) i=0 θ MLE = arg max θ L(X; θ) For height problem: can show ( θ) MLE = 1 N xi Estimate of mean of Gaussian = sample mean of measured heights.

10 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 10 of 88 Bayesian Estimation MLE θ is assumed unknown but deterministic Bayesian Approach: θ is assumed random with pdf p(θ) Prior Knowledge. p(θ x) }{{} Aposterior = p(x θ)p(θ) p(x) p(x θ) p(θ) }{{} Prior Height problem: Unknown mean is random pdf Gaussian N(γ, ν 2 ) p(µ) = 1 ( exp 1 ) 2πν 2 2ν2(µ γ)2 Then : ( µ) Bayesian = σ2 γ + nν 2 x σ 2 + nν 2 Weighted average of sample mean and a prior mean

11 Gaussian Mixture Model p(x) = α p(x N(µ 1 ; σ 1 )) + (1 α) p(x N(µ 2 ; σ 2 )) p(x) = M m=1 w m p(x N(µ m ; σ m )), wi = 1 Characteristics of GMM: Just like ANNs are universal approximators of functions, GMMs are universal approximators of densities (provided sufficient no. of mixtures are used); true for diagonal GMMs as well. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 11 of 88

12 General Assumption in GMM Assume that there are M components. Each component generates data from a Gaussian with mean µ m and covariance matrix Σ m. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 12 of 88

13 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 13 of 88 GMM Consider the following probability density function shown in solid blue It is useful to parameterise or model this seemingly arbitrary blue pdf

14 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 14 of 88 Gaussian Mixture Model (Contd.) Actually pdf is a mixture of 3 Gaussians, i.e. p(x) = c 1 N(x; µ 1, σ 1 ) + c 2 N(x; µ 2, σ 2 ) + c 3 N(x; µ 3,σ 3 ) and c i = 1 (4) pdf parameters: c 1, c 2, c 3,µ 1,µ 2,µ 3,σ 1, σ 2,σ 3

15 Observation from GMM Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind a curtain, a person picks a ball from urn If red ball generate x[i] from N(x; µ 1,σ 1 ) If blue ball generate x[i] from N(x; µ 2,σ 2 ) If green ball generate x[i] from N(x; µ 3,σ 3 ) We have access only to observations x[0], x[1],..., x[n 1] Therefore : p(x[i];θ) = c 1 N(x; µ 1,σ 1 ) + c 2 N(x; µ 2,σ 2 ) + c 3 N(x; µ 3, σ 3 ) but we do not which urn x[i] comes from! Can we estimate component θ = [c 1 c 2 c 3 µ 1 µ 2 µ 3 σ 1 σ 2 σ 3 ] T from the observations? arg max θ p(x; θ) = arg max θ N p(x i ;θ) (5) i=1 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 15 of 88

16 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 16 of 88 Estimation of Parameters of GMM Easier Problem: We know the component for each observation Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] Comp X 1 = {x[0], x[3], x[5] } belong to p 1 (x;µ 1,σ 1 ) X 2 = {x[1], x[2], x[8], x[9] } belongs to p 2 (x; µ 2, σ 2 ) X 3 = {x[3], x[6], x[7], x[10],x[11], x[12]} belongs to p 3 (x; µ 3, σ 3 ) From: X 1 = {x[0], x[3],x[5] } ĉ 1 = 3 13 and µ 1 = 1 3 σ 2 1 = 1 3 {x[0] + x[3] + x[5]} { (x[0] µ1 ) 2 + (x[2] µ 1 ) 2 + (x[5] µ 1 ) 2} In practice we do not know which observation come from which pdf. How do we solve for arg max θ p(x;θ)?

17 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 17 of 88 Incomplete & Complete Data x[0], x[1],...,x[n 1] incomplete data, Introduce another set of variables y[0], y[1],..., y[n 1] such that y[i] = 1 if x[i] p 1, y[i] = 2 if x[i] p 2 and y[i] = 3 if x[i] p 3 Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] Comp miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12] y[i] = missing data unobserved data information about component z = (x; y) is complete data observations and which density they come from p(z;θ) = p(x, y;θ) θ MLE = arg max θ p(z; θ) Question: But how do we find which observation belongs to which density?

18 Given observation x[0] and θ g, what is the probability of x[0] coming from first distribution? p (y[0] = 1 x[0];θ g ) = p(y[0]=1,x[0];θg ) p(x[0];θ g ) = p(x[0] y[0]=1;µg 1,σg 1 ) p(y[0]=1) 3j=1 p(x[0] y[0]=j;θ g ) p(y[0]=j) = p(x[0] y[0]=1,µ g 1,σg 1 ) cg 1 p(x[0] y[0]=1;µ g 1,σg 1 ) cg 1 +p(x[0] y[0]=2;µg 2,σg 2 ) cg All parameters are known we can calculate p(y[0] = 1 x[0];θ g ) (Similarly calculate p(y[0] = 2 x[0]; θ g ), p(y[0] = 3 x[0];θ g ) Which density? y[0] = arg max i p(y[0] = i x[0]; θ g ) Hard allocation WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 18 of 88

19 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 19 of 88 Parameter Estimation for Hard Allocation x[0] x[1] x[2] x[3] x[4] x[5] x[6] p(y[j] = 1 x[j];θ g ) p(y[j] = 2 x[j];θ g ) p(y[j] = 3 x[j];θ g ) Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2 Updated Parameters: ĉ 1 = 2 7 ĉ 2 = 4 7 ĉ 3 = 1 7 (different from initial guess!) Similarly (for Gaussian) find: µ i, σ i 2 for i th pdf

20 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 20 of 88 Parameter Estimation for Soft Assignment x[0] x[1] x[2] x[3] x[4] x[5] x[6] p(y[j] = 1 x[j];θ g ) p(y[j] = 2 x[j];θ g ) p(y[j] = 3 x[j];θ g ) Example: Prob. of each sample belonging to component 1 p(y[0] = 1 x[0]; θ g ), p(y[1] = 1 x[1]; θ g ) p(y[2] = 1 x[2]; θ g ), Average probability that a sample belongs to Comp.#1 is ĉ new 1 = 1 N = N p(y[i] = 1 x[i];θ g ) i= = 2.2 7

21 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 21 of 88 Soft Assignment Estimation of Means & Variances Recall: Prob. of sample j belonging to component i p(y[j] = i x[j];θ g ) Soft Assignment: Parameters estimated by taking weighted average! µ new 1 = N i=1 x i p(y[i] = 1 x[i]; θ g ) N i=1 p(y[i] = 1 x[i];θg ) (σ 2 1) new = N i=1 (x i µ 1 ) 2 p(y[i] = 1 x[i]; θ g ) N i=1 p(y[i] = 1 x[i]; θg ) These are updated parameters starting with initial guess θ g

22 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 22 of 88 Maximum Likelihood Estimation of Parameters of GMM 1. Make initial guess of parameters: θ g = c g 1,cg 2,cg 3, µg 1, µg 2,µg 3,σg 1, σg 2, σg 3 2. Knowing parameters θ g, find Prob. of sample x i belonging to j th component. p[y[i] = j x[i]; θ g ] for i = 1, 2,... N no. of observations bc new j µ new j = = 1 N for j = 1, 2,... M no. of components NX p(y[i] = j x[i]; θ g ) i=1 P N i=1 x i p(y[i] = j x[i]; θ g ) P N i=1 p(y[i] = j x[i]; θg ) 5. (σ 2 j )new = P N i=1 (x i bµ 1 ) 2 p(y[i] = j x[i]; θ g ) P N i=1 p(y[i] = j x[i]; θg ) 6. Go back to (2) and repeat until convergence

23 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 23 of 88

24 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 24 of 88

25 Live demonstration: akaho/mixtureem.html Practical Issues E.M. can get stuck in local minima. EM is very sensitive to initial conditions; a good initial guess helps; k-means algorithm is used prior to application of EM algorithm WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 25 of 88

26 Size of a GMM Bayesian Information Criterion (BIC) value of a GMM can be defined as follows: BIC(G X) = logp(x Ĝ) d 2 logn where Ĝ represent the GMM with the ML parameter configuration d represents the number of parameters in G N is the size of the dataset the first term is the log-likelihood term; the second term is the model complexity penalty term. BIC selects the best GMM corresponding to the largest BIC value by trading off these two terms. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 26 of 88

27 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 27 of 88 source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp , 2005 The BIC criterion can discover the true GMM size effectively as shown in the figure.

28 Maximum A Posteriori (MAP) Sometimes, it is difficult to get sufficient number of examples for robust estimation of parameters. However, one may have access to large number of similar examples which can be utilized. Adapt the target distribution from such a distribution. For example, adapt a speaker independent model to a new speaker using small amount of adaptation data. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 28 of 88

29 MAP Adaptation ML Estimation θ MLE = arg max θ p(x θ) MAP Estimation θ MAP = arg max θ = arg max θ = arg max θ p(θ X) p(x θ) p(θ) p(x) p(x θ) p(θ) p(θ) is the a priori distribution of parameters θ. A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 29 of 88

30 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 30 of 88 Simple Implementation of MAP-GMMs Source: Statistical Machine Learning from Data: GMM; Samy Bengio

31 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 31 of 88 Simple Implementation Train a prior model p with a large amount of available data (say, from multiple speakers). Adapt the parameters to a new speaker using some adaptation data (X). Let α = [0,1] be a parameter that describes the faith on the prior model. Adapted weight of j th mixture [ ] ŵ j = αw p j + (1 α) i p(j x i ) γ Here γ is a normalization factor such that w j = 1.

32 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 32 of 88 Simple Implementation (contd.) means ˆµ j = αµ p j + (1 α) i p(j x i) x i i p(j x i) Weighted average of sample mean and a prior mean variances ) ˆσ j = α (σ p j + µp j µp j + (1 α) i p(j x i) x i x i i p(j x i) ˆµ j ˆµ j

33 HMM Primary role of speech signal is to carry a message; sequence of sounds (phonemes) encode a sequence of words. The acoustic manifestation of a phoneme is mostly determined by: Configuration of articulators (jaw, tongue, lip) physiology and emotional state of speaker Phonetic context HMM models sequential patterns; speech is a sequential pattern Most text dependent speaker recognition systems use HMMs Text verification involves verification/recognition of phonemes WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 33 of 88

34 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 34 of 88 Phoneme recognition Consider two phonemes classes /aa/ and /iy/. Problem: Determine to which class a given sound belongs. Processing of speech signal results in a sequence of feature (observation) vectors: o 1,...,o T (say MFCC vectors) We say the speech is /aa/ if: p(aa O) > p(iy O) Using Bayes Rule AcousticM odel {}}{ p(o aa) p(aa) p(o) V s. PriorProb {}}{ p(o iy) p(iy) p(o) Given p(o aa), p(aa), p(o iy) and p(iy) which is more probable?

35 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 35 of 88 Parameter Estimation of Acoustic Model How do we find the density function p aa (.) and p iy (.). We assume a parametric model: p aa() parameterised by θ aa p ij () parameterised by θ iy Training Phase: Collect many examples of /aa/ being said Compute corresponding observations o 1,..., o Taa Use the Maximum Likelihood Principle θ aa = arg max θ aa p(o;θ aa ) Recall: if the pdf is modelled as a Gaussian Mixture Model. then we use EM Algorithm

36 Modelling of Phoneme To enunciate /aa/ in a word Our Articulators are moving from a configuration for previous phoneme to /aa/ and then proceeding to move to configuration of next phoneme. Can think of 3 distinct time periods: Transition from previous phoneme Steady state Transition to next phoneme Features for 3 time-interval are quite different Use different density functions to model the three time intervals model as p aa 1(;θ 1 aa ) p aa 2(;θ 2 aa ) p aa 3(;θ 3 aa ) Also need to model the time durations of these time-intervals transition probs. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 36 of 88

37 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 37 of 88 Stochastic Model (HMM) a11 1 a p(f) p(f) p(f) f(hz) f(hz) f(hz)

38 Ó ½ Ó ¾ Ó Ó ½¼ HMM Model of Phoneme Use term State for each of the three time periods. Prob. of o t from j th state, i.e. p aa j(o t ; θ aa j ) denoted as b j (o t ) ½ ¾ Ô ½ µ Ô ¾ µ Ô µ º º º Observation, o t, is generated by which state density? Only observations are seen, the state-sequence is hidden Recall: In GMM, the mixture component is hidden WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 38 of 88

39 Probability of Observation Recall: To classify, we evaluate Pr(O Λ) where Λ are parameters of models In /aa/ Vs /iy/ calculation: Pr(O Λ aa ) Vs Pr(O Λ iy ) Example: 2-state HMM model and 3 observations o 1 o 2 o 3 Model parameters are assumed known: Transition Prob. a 11, a 12, a 21, and a 22 model time durations State density b 1 (o t ) and b 2 (o t ). b j (o t ) are usually modelled as single Gaussian with parameter µ j, σ 2 j or by GMMs WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 39 of 88

40 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 40 of 88 Probability of Observation through one Path T = 3 observations and N = 2 nodes 8 paths thru 2 nodes for 3 observations Example: Path P 1 through states 1, 1, 1. Pr{O P 1,Λ} = b 1 (o 1 ) b 1 (o 2 ) b 1 (o 3 ) Prob. of Path P 1 = Pr{P 1 Λ} = a 01 a 11 a 11 Pr{O, P 1 Λ} = Pr{O P 1,Λ} Pr{P 1 Λ} = a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 11 b 1 (o 3 )

41 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 41 of 88 Probability of Observation Path o 1 o 2 o 3 p(o, P i Λ) P a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 11 b 1 (o 3 ) P a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 12 b 2 (o 3 ) P a 01 b 1 (o 1 ).a 12 b 2 (o 2 ).a 21 b 1 (o 3 ) P a 01 b 1 (o 1 ).a 12 b 2 (o 2 ).a 22 b 2 (o 3 ) P a 02 b 2 (o 1 ).a 21 b 1 (o 2 ).a 11 b 1 (o 3 ) P a 02 b 2 (o 1 ).a 21 b 1 (o 2 ).a 12 b 2 (o 3 ) P a 02 b 2 (o 1 ).a 22 b 1 (o 2 ).a 21 b 1 (o 3 ) P a 02 b 2 (o 1 ).a 22 b 1 (o 2 ).a 22 b 2 (o 3 ) p(o Λ) = P i P {O,P i Λ} = P i P {O P i, Λ} P {P i,λ} Forward Algorithm Avoid Repeat Calculations: Two Multiplications {}}{ a 01 b 1 (o 1 )a 11 b 1 (o 2 ).a 11 b 1 (o 3 ) + a 02 b 2 (o 1 ).a 21 b 1 (o 2 ).a 11 b 1 (o 3 ) =[a 01.b 1 (o 1 )a 11 b 1 (o 2 ) + a 02 b 2 (o 1 )a 21 b 1 (o 2 )]a 11 b 1 (o 3 ) }{{} One Multiplication

42 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 42 of 88 Forward Algorithm Recursion Let :α 1 (t = 1) = a 01 b 1 (o 1 ) Let :α 2 (t = 1) = a 02 b 2 (o 2 ) Recursion : α 1 (t = 2) = [a 01 b 1 (o 1 ).a 11 + a 02 b 2 (o 1 ).a 21 ].b 1 (o 2 ) = [α 1 (t = 1).a 11 + α 2 (t = 1).a 21 ].b 1 (o 2 )

43 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 43 of 88 General Recursion in Forward Algorithm α j (t) = [ ] αi (t 1)a ij.b j (o t ) = P {o 1, o 2,...o t,s t = j Λ} Note α j (t) Sum of probabilities of all paths ending at node j at time t with partial observation sequence o 1,o 2,...,o t The probability of the entire observation (o 1,o 2,...,o T ), therefore, is p(0 Λ) = = N P {o 1,o 2,...,o T,S T = j Λ} j=1 N α j (T) j=1 where N=No. of nodes

44 Backward Algorithm analogous to Forward, but coming from the last time instant T Example: a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 11 b 1 (o 3 ) + a 01 b 1 (o 1 ).a 11 b 1 (o 2 )a 12 b 2 (o 3 ) +... = [a 01.b 1 (o 1 ).a 11 b 1 (o 2 )].(a 11 b 1 (o 3 ) + a 12 b 2 (o 3 )) β 1 (t = 2) = p{o 3 s t=2 = 1, Λ} = p{o 3, s t=3 = 1 s t=2 = 1;Λ} + p{o 3,s t=3 = 2 s t=2 = 1, Λ} = p{o 3 s t=3 = 1, s t=2 = 1, Λ}.p{s t=3 = 1 s t=2 = 1, Λ} + p{o 3 s t=3 = 2, s t=2 = 1,Λ}.p{s t=3 = 2 s t=2 = 1, Λ} = b 1 (o 3 ).a 11 + b 2 (o 3 ).a 12 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 44 of 88

45 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 45 of 88 General Recursion in Backward Algorithm β j (t) Given that we are at node j at time t Sum of probabilities of all paths such that partial sequence o t+1,...,o T are observed β i (t) = N [a ij b j (o t+1 )] j=1 }{{} Going to each node from i th node β j (t + 1) }{{} Prob. of observation o t+2... o T given now we are in j th node at t + 1 = p{o t+1,...,o t s t = i,λ}

46 Estimation of Parameters of HMM Model Given known Model parameters, Λ: Evaluated p(o Λ) useful for classification Efficient Implementation: Use Forward or Backward Algo. Given set of observation vectors, o t how do we estimate parameters of HMM? Do not know which states o t come from Analogous to GMM do not know which component Use a special case of EM Baum-Welch Algorithm Use following relations from Forward/Backward p(o Λ) = α N (T) = β 1 (T) = N α j (t)β j (t) j=1 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 46 of 88

47 Parameter Estimation for Known State Sequence Assume each state is modelled as a single Gaussian: µ j = Sample mean of observations assigned to state j. σ 2 j = Variance of the observations assigned to state j. and Trans. Prob. from state i to j = No. of times transition was made from i to j Total number of times we made transition from i In practice since we do not know which state generated the observation So we will do probabilistic assignment. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 47 of 88

48 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 48 of 88 Review of GMM Parameter Estimation Do not know which component of the GMM generated output observation. Given initial model parameters Λ g, and observation sequence x 1,...,x T. Find probability x i comes from component j Soft Assignment p[component = 1 x i ;Λ g ] = p[component = 1, x i Λ g ] p[x i Λ g ] So, re-estimation equations are: bc j = 1 T TX p(comp = j x i ; Λ g ) i=1 bµ new j = P T i=1 x ip(comp = j x i ; Λ g ) P T i=1 p(comp = j x i; Λ g ) bσ 2 j = P T i=1 (x i bµ j ) 2 p(comp = j x i ; Λ g ) P T i=1 p(comp = j x i; Λ g ) A similar analogy holds for hidden Markov models

49 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 49 of 88 Baum-Welch Algorithm Here: We do not know which observation o t comes from which state s i Again like GMM we will assume initial guess parameter Λ g Then prob. of being in state=i at time=t and state=j at time=t+1 is τ t (i,j) = p{q t = i,q t+1 = j O,Λ g } = p{q t = i,q t+1 = j,o Λ g } p{o Λ g } where p{o Λ g } = α N (T) = i α i(t)

50 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 50 of 88 Baum-Welch Algorithm Then prob. of being in state=i at time=t and state=j at time=t+1 is τ t (i, j) = p{q t = i, q t+1 = j O,Λ g } = p{q t = i, q t+1 = j, O Λ g } p{o Λ g } where p{o Λ g } = α N (T) = i α i(t) From ideas of Forward-Backward Algorithm, numerator is p{q t = i,q t+1 = j,o Λ g } = α i (t).a ij b j (o t+1 ).β j (t + 1) So τ t (i,j) = α i(t).a ij b j (o t+1 )β j (t + 1) α N (t)

51 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 51 of 88 Estimating Transition Probability Trans. Prob. from state i to j = No. of times transition was made from i to j Total number of times we made transition from i τ t (i, j) prob. of being in state=i at time=t and state=j at time=t+1 If we average τ t (i, j) over all time-instants, we get the number of times the system was in i th state and made a transition to j th state. So, a revised estimation of transition probability is â new ij = T t=1 ( T 1 t=1 τ t(i, j) N j=1 τ t (i, j) }{{} all transitions out of i at time=t )

52 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 52 of 88 Estimating State-Density Parameters Analogous to GMM: which observation belonged to which component, New estimates for the state pdf parameters are (assuming single Gaussian) µ i = T t=1 γ i(t)o t T t=1 γ i(t) i = T t=1 γ i(t)(o t µ i )(o t µ i ) T T t=1 γ i(t) These are weighted averages weighted by Prob. of being in state j at t Given observation HMM model parameters estimated iteratively p(o Λ) evaluated efficiently by Forward/Backward algorithm

53 Viterbi Algorithm Given the observation sequence, the goal is to find corresponding state-sequence that generated it there are many possible combination (N T ) of state sequence many paths. One possible criterion : Choose the state sequence corresponding to path that with maximum probability max i P {O,P i Λ} Word : represented as sequence of phones Phone : represented as sequence states Optimal state-sequence Optimal phone-sequence Word sequence WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 53 of 88

54 Viterbi Algorithm and Forward Algorithm Recall Forward Algorithm : We found probability over each path and summed over all possible paths N T i=1 Viterbi is just special case of Forward algo. p{o,p i Λ} At each node { instead of sum of prob. of all paths choose path with max prob. In Practice: p(o Λ) approximated by Viterbi (instead of Forward Algo.) WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 54 of 88

55 Viterbi Algorithm WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 55 of 88

56 Decoding Recall: Desired transcription Ŵ obtained by maximising Ŵ = arg max W p(w O) Search over all possible W astronomically large! Viterbi Search find most likely path through a HMM Sequence of phones (states) which is most probable Mostly: most probable sequence of phones correspond to most probable sequence of words WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 56 of 88

57 Training of a Speech Recognition System HMM parameter s estimated using large databases 100 hours Parameters estimated using Maximum Likelihood Criterion WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 57 of 88

58 Recognition WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 58 of 88

59 Speaker Recognition Spectra (formants) of a given sound are different for different speakers. Spectra of 2 speakers for one frame of /iy/ Derive speaker dependent model of a new speaker by MAP adaptation of Speaker- Independent (SI) model using small amount of adaptation data; use for speaker recognition. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 59 of 88

60 References Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press, The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam Krishnan, Wiley-Interscience; 2nd edition, ISBN-10: The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam Krishnan, Wiley-Interscience; 2nd edition, ISBN-10: Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang, Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 60 of 88

61 Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990. Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge, MA., Maximum Likelihood from incomplete data via the em algorithm, J. Royal Statistical Soc. 39(1), pp. 1-38, Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp , A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes, ISCI, TR Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp , WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 61 of 88

ASR using Hidden Markov Model : A tutorial

ASR using Hidden Markov Model : A tutorial Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 samudravijaya@gmail.com Tata Institute of Fundamental Research Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11