Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM)

Size: px
Start display at page:

Download "Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM)"

Transcription

1 1 of 88 Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) Samudravijaya K Tata Institute of Fundamental Research, Mumbai chief@tifr.res.in 09-JAN-2009 Majority of the slides are taken from S.Umesh s tutorial on ASR (WiSSAP 2006).

2 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 2 of 88 Pattern Recognition Input Signal Processing Training Testing Model Generation Pattern Matching Output GMM: static patterns HMM: sequential patterns

3 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 3 of 88 Basic Probability Joint and Conditional probability p(a, B) = p(a B) p(b) = p(b A) p(a) Bayes rule p(a B) = p(b A) p(a) p(b) If A i s are mutually exclusive events, p(b) = i p(b A i ) p(a i ) p(a B) = p(b A) p(a) i p(b A i) p(a i )

4 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 4 of 88 Normal Distribution Many phenomenon are described by Gaussian pdf ( 1 p(x θ) = exp 1 ) 2πσ 2 2σ2(x µ)2 (1) pdf is parameterised by θ = [µ,σ 2 ] where mean = µ and variance=σ 2. A convenient pdf : second order statistics is sufficient. Example: Heights of Pygmies Gaussian pdf with µ = 4f t & std-dev(σ) = 1f t OR: Heights of bushmen Gaussian pdf with µ = 6f t & std-dev(σ) = 1f t Question:If we arbitrarily pick a person from a population what is the probability of the height being a particular value?

5 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 5 of 88 If I pick arbitrarily a Pygmy, say x, then Pr(Height of x=4 1 ) = 1 ( exp 1 ) 2π (4 1 4) 2 (2) Note: Here mean and variances are fixed, only the observations, x, change. Also see: Pr(x = 4 1 ) Pr(x = 5 ) AND Pr(x = 4 1 ) Pr(x = 3 )

6 Conversely: Given a person s height is 4 1 Person is more likely to be a pygmy than bushman. If we observe heights of many persons say 3 6, 4 1, 3 8, 4 5, 4 7, 4,6 5 and all are from same population (i.e. either pygmy or bushmen.) then more certain we are that the population is pygmy. More the observations better will be our decision WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 6 of 88

7 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 7 of 88 Likelihood Function x[0], x[1],...,x[n 1] set of independent observations from pdf parameterised by θ. Previous Example: x[0], x[1],..., x[n 1] are heights observed and θ is the mean of density which is unknown (σ 2 assumed known). L(X; θ) = p(x 0...x N 1 ;θ) = N p(x i ;θ) = i=0 1 exp (2πσ 2 ) N 2 ( 1 2σ 2 ) N (x i θ) 2 i=0 (3) L(X; θ) is a function of θ and is called Likelihood Function Given: x 0...x N 1, what can we say about value of θ, i.e. best estimate of θ.

8 Maximum Likelihood Estimation Example: We know height of a person x[0] = 4 4. Most likely to have come from which pdf θ = 3, 4 6 or 6? Maximum of L(x[0]; θ = 3 ),L(x[0]; 4 6 ) and L(x[0]; θ = 6 ) choose θ = 4 6. If θ is just a parameter, we will choose arg max θ L(x[0]; θ). WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 8 of 88

9 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 9 of 88 Maximum Likelihood Estimator Given x[0], x[1],...,x[n 1] and pdf parameterised by θ = θ 1 θ 2.. θ m 1 We form Likelihood function L(X; θ) = N p(x i ;θ) i=0 θ MLE = arg max θ L(X; θ) For height problem: can show ( θ) MLE = 1 N xi Estimate of mean of Gaussian = sample mean of measured heights.

10 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 10 of 88 Bayesian Estimation MLE θ is assumed unknown but deterministic Bayesian Approach: θ is assumed random with pdf p(θ) Prior Knowledge. p(θ x) }{{} Aposterior = p(x θ)p(θ) p(x) p(x θ) p(θ) }{{} Prior Height problem: Unknown mean is random pdf Gaussian N(γ, ν 2 ) p(µ) = 1 ( exp 1 ) 2πν 2 2ν2(µ γ)2 Then : ( µ) Bayesian = σ2 γ + nν 2 x σ 2 + nν 2 Weighted average of sample mean and a prior mean

11 Gaussian Mixture Model p(x) = α p(x N(µ 1 ; σ 1 )) + (1 α) p(x N(µ 2 ; σ 2 )) p(x) = M m=1 w m p(x N(µ m ; σ m )), wi = 1 Characteristics of GMM: Just like ANNs are universal approximators of functions, GMMs are universal approximators of densities (provided sufficient no. of mixtures are used); true for diagonal GMMs as well. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 11 of 88

12 General Assumption in GMM Assume that there are M components. Each component generates data from a Gaussian with mean µ m and covariance matrix Σ m. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 12 of 88

13 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 13 of 88 GMM Consider the following probability density function shown in solid blue It is useful to parameterise or model this seemingly arbitrary blue pdf

14 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 14 of 88 Gaussian Mixture Model (Contd.) Actually pdf is a mixture of 3 Gaussians, i.e. p(x) = c 1 N(x; µ 1, σ 1 ) + c 2 N(x; µ 2, σ 2 ) + c 3 N(x; µ 3,σ 3 ) and c i = 1 (4) pdf parameters: c 1, c 2, c 3,µ 1,µ 2,µ 3,σ 1, σ 2,σ 3

15 Observation from GMM Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind a curtain, a person picks a ball from urn If red ball generate x[i] from N(x; µ 1,σ 1 ) If blue ball generate x[i] from N(x; µ 2,σ 2 ) If green ball generate x[i] from N(x; µ 3,σ 3 ) We have access only to observations x[0], x[1],..., x[n 1] Therefore : p(x[i];θ) = c 1 N(x; µ 1,σ 1 ) + c 2 N(x; µ 2,σ 2 ) + c 3 N(x; µ 3, σ 3 ) but we do not which urn x[i] comes from! Can we estimate component θ = [c 1 c 2 c 3 µ 1 µ 2 µ 3 σ 1 σ 2 σ 3 ] T from the observations? arg max θ p(x; θ) = arg max θ N p(x i ;θ) (5) i=1 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 15 of 88

16 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 16 of 88 Estimation of Parameters of GMM Easier Problem: We know the component for each observation Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] Comp X 1 = {x[0], x[3], x[5] } belong to p 1 (x;µ 1,σ 1 ) X 2 = {x[1], x[2], x[8], x[9] } belongs to p 2 (x; µ 2, σ 2 ) X 3 = {x[3], x[6], x[7], x[10],x[11], x[12]} belongs to p 3 (x; µ 3, σ 3 ) From: X 1 = {x[0], x[3],x[5] } ĉ 1 = 3 13 and µ 1 = 1 3 σ 2 1 = 1 3 {x[0] + x[3] + x[5]} { (x[0] µ1 ) 2 + (x[2] µ 1 ) 2 + (x[5] µ 1 ) 2} In practice we do not know which observation come from which pdf. How do we solve for arg max θ p(x;θ)?

17 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 17 of 88 Incomplete & Complete Data x[0], x[1],...,x[n 1] incomplete data, Introduce another set of variables y[0], y[1],..., y[n 1] such that y[i] = 1 if x[i] p 1, y[i] = 2 if x[i] p 2 and y[i] = 3 if x[i] p 3 Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12] Comp miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12] y[i] = missing data unobserved data information about component z = (x; y) is complete data observations and which density they come from p(z;θ) = p(x, y;θ) θ MLE = arg max θ p(z; θ) Question: But how do we find which observation belongs to which density?

18 Given observation x[0] and θ g, what is the probability of x[0] coming from first distribution? p (y[0] = 1 x[0];θ g ) = p(y[0]=1,x[0];θg ) p(x[0];θ g ) = p(x[0] y[0]=1;µg 1,σg 1 ) p(y[0]=1) 3j=1 p(x[0] y[0]=j;θ g ) p(y[0]=j) = p(x[0] y[0]=1,µ g 1,σg 1 ) cg 1 p(x[0] y[0]=1;µ g 1,σg 1 ) cg 1 +p(x[0] y[0]=2;µg 2,σg 2 ) cg All parameters are known we can calculate p(y[0] = 1 x[0];θ g ) (Similarly calculate p(y[0] = 2 x[0]; θ g ), p(y[0] = 3 x[0];θ g ) Which density? y[0] = arg max i p(y[0] = i x[0]; θ g ) Hard allocation WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 18 of 88

19 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 19 of 88 Parameter Estimation for Hard Allocation x[0] x[1] x[2] x[3] x[4] x[5] x[6] p(y[j] = 1 x[j];θ g ) p(y[j] = 2 x[j];θ g ) p(y[j] = 3 x[j];θ g ) Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2 Updated Parameters: ĉ 1 = 2 7 ĉ 2 = 4 7 ĉ 3 = 1 7 (different from initial guess!) Similarly (for Gaussian) find: µ i, σ i 2 for i th pdf

20 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 20 of 88 Parameter Estimation for Soft Assignment x[0] x[1] x[2] x[3] x[4] x[5] x[6] p(y[j] = 1 x[j];θ g ) p(y[j] = 2 x[j];θ g ) p(y[j] = 3 x[j];θ g ) Example: Prob. of each sample belonging to component 1 p(y[0] = 1 x[0]; θ g ), p(y[1] = 1 x[1]; θ g ) p(y[2] = 1 x[2]; θ g ), Average probability that a sample belongs to Comp.#1 is ĉ new 1 = 1 N = N p(y[i] = 1 x[i];θ g ) i= = 2.2 7

21 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 21 of 88 Soft Assignment Estimation of Means & Variances Recall: Prob. of sample j belonging to component i p(y[j] = i x[j];θ g ) Soft Assignment: Parameters estimated by taking weighted average! µ new 1 = N i=1 x i p(y[i] = 1 x[i]; θ g ) N i=1 p(y[i] = 1 x[i];θg ) (σ 2 1) new = N i=1 (x i µ 1 ) 2 p(y[i] = 1 x[i]; θ g ) N i=1 p(y[i] = 1 x[i]; θg ) These are updated parameters starting with initial guess θ g

22 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 22 of 88 Maximum Likelihood Estimation of Parameters of GMM 1. Make initial guess of parameters: θ g = c g 1,cg 2,cg 3, µg 1, µg 2,µg 3,σg 1, σg 2, σg 3 2. Knowing parameters θ g, find Prob. of sample x i belonging to j th component. p[y[i] = j x[i]; θ g ] for i = 1, 2,... N no. of observations bc new j µ new j = = 1 N for j = 1, 2,... M no. of components NX p(y[i] = j x[i]; θ g ) i=1 P N i=1 x i p(y[i] = j x[i]; θ g ) P N i=1 p(y[i] = j x[i]; θg ) 5. (σ 2 j )new = P N i=1 (x i bµ 1 ) 2 p(y[i] = j x[i]; θ g ) P N i=1 p(y[i] = j x[i]; θg ) 6. Go back to (2) and repeat until convergence

23 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 23 of 88

24 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 24 of 88

25 Live demonstration: akaho/mixtureem.html Practical Issues E.M. can get stuck in local minima. EM is very sensitive to initial conditions; a good initial guess helps; k-means algorithm is used prior to application of EM algorithm WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 25 of 88

26 Size of a GMM Bayesian Information Criterion (BIC) value of a GMM can be defined as follows: BIC(G X) = logp(x Ĝ) d 2 logn where Ĝ represent the GMM with the ML parameter configuration d represents the number of parameters in G N is the size of the dataset the first term is the log-likelihood term; the second term is the model complexity penalty term. BIC selects the best GMM corresponding to the largest BIC value by trading off these two terms. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 26 of 88

27 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 27 of 88 source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp , 2005 The BIC criterion can discover the true GMM size effectively as shown in the figure.

28 Maximum A Posteriori (MAP) Sometimes, it is difficult to get sufficient number of examples for robust estimation of parameters. However, one may have access to large number of similar examples which can be utilized. Adapt the target distribution from such a distribution. For example, adapt a speaker independent model to a new speaker using small amount of adaptation data. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 28 of 88

29 MAP Adaptation ML Estimation θ MLE = arg max θ p(x θ) MAP Estimation θ MAP = arg max θ = arg max θ = arg max θ p(θ X) p(x θ) p(θ) p(x) p(x θ) p(θ) p(θ) is the a priori distribution of parameters θ. A conjugate prior is chosen such that the corresponding posterior belongs to the same functional family as the prior. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 29 of 88

30 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 30 of 88 Simple Implementation of MAP-GMMs Source: Statistical Machine Learning from Data: GMM; Samy Bengio

31 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 31 of 88 Simple Implementation Train a prior model p with a large amount of available data (say, from multiple speakers). Adapt the parameters to a new speaker using some adaptation data (X). Let α = [0,1] be a parameter that describes the faith on the prior model. Adapted weight of j th mixture [ ] ŵ j = αw p j + (1 α) i p(j x i ) γ Here γ is a normalization factor such that w j = 1.

32 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 32 of 88 Simple Implementation (contd.) means ˆµ j = αµ p j + (1 α) i p(j x i) x i i p(j x i) Weighted average of sample mean and a prior mean variances ) ˆσ j = α (σ p j + µp j µp j + (1 α) i p(j x i) x i x i i p(j x i) ˆµ j ˆµ j

33 HMM Primary role of speech signal is to carry a message; sequence of sounds (phonemes) encode a sequence of words. The acoustic manifestation of a phoneme is mostly determined by: Configuration of articulators (jaw, tongue, lip) physiology and emotional state of speaker Phonetic context HMM models sequential patterns; speech is a sequential pattern Most text dependent speaker recognition systems use HMMs Text verification involves verification/recognition of phonemes WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 33 of 88

34 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 34 of 88 Phoneme recognition Consider two phonemes classes /aa/ and /iy/. Problem: Determine to which class a given sound belongs. Processing of speech signal results in a sequence of feature (observation) vectors: o 1,...,o T (say MFCC vectors) We say the speech is /aa/ if: p(aa O) > p(iy O) Using Bayes Rule AcousticM odel {}}{ p(o aa) p(aa) p(o) V s. PriorProb {}}{ p(o iy) p(iy) p(o) Given p(o aa), p(aa), p(o iy) and p(iy) which is more probable?

35 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 35 of 88 Parameter Estimation of Acoustic Model How do we find the density function p aa (.) and p iy (.). We assume a parametric model: p aa() parameterised by θ aa p ij () parameterised by θ iy Training Phase: Collect many examples of /aa/ being said Compute corresponding observations o 1,..., o Taa Use the Maximum Likelihood Principle θ aa = arg max θ aa p(o;θ aa ) Recall: if the pdf is modelled as a Gaussian Mixture Model. then we use EM Algorithm

36 Modelling of Phoneme To enunciate /aa/ in a word Our Articulators are moving from a configuration for previous phoneme to /aa/ and then proceeding to move to configuration of next phoneme. Can think of 3 distinct time periods: Transition from previous phoneme Steady state Transition to next phoneme Features for 3 time-interval are quite different Use different density functions to model the three time intervals model as p aa 1(;θ 1 aa ) p aa 2(;θ 2 aa ) p aa 3(;θ 3 aa ) Also need to model the time durations of these time-intervals transition probs. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 36 of 88

37 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 37 of 88 Stochastic Model (HMM) a11 1 a p(f) p(f) p(f) f(hz) f(hz) f(hz)

38 Ó ½ Ó ¾ Ó Ó ½¼ HMM Model of Phoneme Use term State for each of the three time periods. Prob. of o t from j th state, i.e. p aa j(o t ; θ aa j ) denoted as b j (o t ) ½ ¾ Ô ½ µ Ô ¾ µ Ô µ º º º Observation, o t, is generated by which state density? Only observations are seen, the state-sequence is hidden Recall: In GMM, the mixture component is hidden WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 38 of 88

39 Probability of Observation Recall: To classify, we evaluate Pr(O Λ) where Λ are parameters of models In /aa/ Vs /iy/ calculation: Pr(O Λ aa ) Vs Pr(O Λ iy ) Example: 2-state HMM model and 3 observations o 1 o 2 o 3 Model parameters are assumed known: Transition Prob. a 11, a 12, a 21, and a 22 model time durations State density b 1 (o t ) and b 2 (o t ). b j (o t ) are usually modelled as single Gaussian with parameter µ j, σ 2 j or by GMMs WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 39 of 88

40 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 40 of 88 Probability of Observation through one Path T = 3 observations and N = 2 nodes 8 paths thru 2 nodes for 3 observations Example: Path P 1 through states 1, 1, 1. Pr{O P 1,Λ} = b 1 (o 1 ) b 1 (o 2 ) b 1 (o 3 ) Prob. of Path P 1 = Pr{P 1 Λ} = a 01 a 11 a 11 Pr{O, P 1 Λ} = Pr{O P 1,Λ} Pr{P 1 Λ} = a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 11 b 1 (o 3 )

41 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 41 of 88 Probability of Observation Path o 1 o 2 o 3 p(o, P i Λ) P a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 11 b 1 (o 3 ) P a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 12 b 2 (o 3 ) P a 01 b 1 (o 1 ).a 12 b 2 (o 2 ).a 21 b 1 (o 3 ) P a 01 b 1 (o 1 ).a 12 b 2 (o 2 ).a 22 b 2 (o 3 ) P a 02 b 2 (o 1 ).a 21 b 1 (o 2 ).a 11 b 1 (o 3 ) P a 02 b 2 (o 1 ).a 21 b 1 (o 2 ).a 12 b 2 (o 3 ) P a 02 b 2 (o 1 ).a 22 b 1 (o 2 ).a 21 b 1 (o 3 ) P a 02 b 2 (o 1 ).a 22 b 1 (o 2 ).a 22 b 2 (o 3 ) p(o Λ) = P i P {O,P i Λ} = P i P {O P i, Λ} P {P i,λ} Forward Algorithm Avoid Repeat Calculations: Two Multiplications {}}{ a 01 b 1 (o 1 )a 11 b 1 (o 2 ).a 11 b 1 (o 3 ) + a 02 b 2 (o 1 ).a 21 b 1 (o 2 ).a 11 b 1 (o 3 ) =[a 01.b 1 (o 1 )a 11 b 1 (o 2 ) + a 02 b 2 (o 1 )a 21 b 1 (o 2 )]a 11 b 1 (o 3 ) }{{} One Multiplication

42 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 42 of 88 Forward Algorithm Recursion Let :α 1 (t = 1) = a 01 b 1 (o 1 ) Let :α 2 (t = 1) = a 02 b 2 (o 2 ) Recursion : α 1 (t = 2) = [a 01 b 1 (o 1 ).a 11 + a 02 b 2 (o 1 ).a 21 ].b 1 (o 2 ) = [α 1 (t = 1).a 11 + α 2 (t = 1).a 21 ].b 1 (o 2 )

43 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 43 of 88 General Recursion in Forward Algorithm α j (t) = [ ] αi (t 1)a ij.b j (o t ) = P {o 1, o 2,...o t,s t = j Λ} Note α j (t) Sum of probabilities of all paths ending at node j at time t with partial observation sequence o 1,o 2,...,o t The probability of the entire observation (o 1,o 2,...,o T ), therefore, is p(0 Λ) = = N P {o 1,o 2,...,o T,S T = j Λ} j=1 N α j (T) j=1 where N=No. of nodes

44 Backward Algorithm analogous to Forward, but coming from the last time instant T Example: a 01 b 1 (o 1 ).a 11 b 1 (o 2 ).a 11 b 1 (o 3 ) + a 01 b 1 (o 1 ).a 11 b 1 (o 2 )a 12 b 2 (o 3 ) +... = [a 01.b 1 (o 1 ).a 11 b 1 (o 2 )].(a 11 b 1 (o 3 ) + a 12 b 2 (o 3 )) β 1 (t = 2) = p{o 3 s t=2 = 1, Λ} = p{o 3, s t=3 = 1 s t=2 = 1;Λ} + p{o 3,s t=3 = 2 s t=2 = 1, Λ} = p{o 3 s t=3 = 1, s t=2 = 1, Λ}.p{s t=3 = 1 s t=2 = 1, Λ} + p{o 3 s t=3 = 2, s t=2 = 1,Λ}.p{s t=3 = 2 s t=2 = 1, Λ} = b 1 (o 3 ).a 11 + b 2 (o 3 ).a 12 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 44 of 88

45 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 45 of 88 General Recursion in Backward Algorithm β j (t) Given that we are at node j at time t Sum of probabilities of all paths such that partial sequence o t+1,...,o T are observed β i (t) = N [a ij b j (o t+1 )] j=1 }{{} Going to each node from i th node β j (t + 1) }{{} Prob. of observation o t+2... o T given now we are in j th node at t + 1 = p{o t+1,...,o t s t = i,λ}

46 Estimation of Parameters of HMM Model Given known Model parameters, Λ: Evaluated p(o Λ) useful for classification Efficient Implementation: Use Forward or Backward Algo. Given set of observation vectors, o t how do we estimate parameters of HMM? Do not know which states o t come from Analogous to GMM do not know which component Use a special case of EM Baum-Welch Algorithm Use following relations from Forward/Backward p(o Λ) = α N (T) = β 1 (T) = N α j (t)β j (t) j=1 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 46 of 88

47 Parameter Estimation for Known State Sequence Assume each state is modelled as a single Gaussian: µ j = Sample mean of observations assigned to state j. σ 2 j = Variance of the observations assigned to state j. and Trans. Prob. from state i to j = No. of times transition was made from i to j Total number of times we made transition from i In practice since we do not know which state generated the observation So we will do probabilistic assignment. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 47 of 88

48 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 48 of 88 Review of GMM Parameter Estimation Do not know which component of the GMM generated output observation. Given initial model parameters Λ g, and observation sequence x 1,...,x T. Find probability x i comes from component j Soft Assignment p[component = 1 x i ;Λ g ] = p[component = 1, x i Λ g ] p[x i Λ g ] So, re-estimation equations are: bc j = 1 T TX p(comp = j x i ; Λ g ) i=1 bµ new j = P T i=1 x ip(comp = j x i ; Λ g ) P T i=1 p(comp = j x i; Λ g ) bσ 2 j = P T i=1 (x i bµ j ) 2 p(comp = j x i ; Λ g ) P T i=1 p(comp = j x i; Λ g ) A similar analogy holds for hidden Markov models

49 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 49 of 88 Baum-Welch Algorithm Here: We do not know which observation o t comes from which state s i Again like GMM we will assume initial guess parameter Λ g Then prob. of being in state=i at time=t and state=j at time=t+1 is τ t (i,j) = p{q t = i,q t+1 = j O,Λ g } = p{q t = i,q t+1 = j,o Λ g } p{o Λ g } where p{o Λ g } = α N (T) = i α i(t)

50 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 50 of 88 Baum-Welch Algorithm Then prob. of being in state=i at time=t and state=j at time=t+1 is τ t (i, j) = p{q t = i, q t+1 = j O,Λ g } = p{q t = i, q t+1 = j, O Λ g } p{o Λ g } where p{o Λ g } = α N (T) = i α i(t) From ideas of Forward-Backward Algorithm, numerator is p{q t = i,q t+1 = j,o Λ g } = α i (t).a ij b j (o t+1 ).β j (t + 1) So τ t (i,j) = α i(t).a ij b j (o t+1 )β j (t + 1) α N (t)

51 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 51 of 88 Estimating Transition Probability Trans. Prob. from state i to j = No. of times transition was made from i to j Total number of times we made transition from i τ t (i, j) prob. of being in state=i at time=t and state=j at time=t+1 If we average τ t (i, j) over all time-instants, we get the number of times the system was in i th state and made a transition to j th state. So, a revised estimation of transition probability is â new ij = T t=1 ( T 1 t=1 τ t(i, j) N j=1 τ t (i, j) }{{} all transitions out of i at time=t )

52 WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 52 of 88 Estimating State-Density Parameters Analogous to GMM: which observation belonged to which component, New estimates for the state pdf parameters are (assuming single Gaussian) µ i = T t=1 γ i(t)o t T t=1 γ i(t) i = T t=1 γ i(t)(o t µ i )(o t µ i ) T T t=1 γ i(t) These are weighted averages weighted by Prob. of being in state j at t Given observation HMM model parameters estimated iteratively p(o Λ) evaluated efficiently by Forward/Backward algorithm

53 Viterbi Algorithm Given the observation sequence, the goal is to find corresponding state-sequence that generated it there are many possible combination (N T ) of state sequence many paths. One possible criterion : Choose the state sequence corresponding to path that with maximum probability max i P {O,P i Λ} Word : represented as sequence of phones Phone : represented as sequence states Optimal state-sequence Optimal phone-sequence Word sequence WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 53 of 88

54 Viterbi Algorithm and Forward Algorithm Recall Forward Algorithm : We found probability over each path and summed over all possible paths N T i=1 Viterbi is just special case of Forward algo. p{o,p i Λ} At each node { instead of sum of prob. of all paths choose path with max prob. In Practice: p(o Λ) approximated by Viterbi (instead of Forward Algo.) WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 54 of 88

55 Viterbi Algorithm WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 55 of 88

56 Decoding Recall: Desired transcription Ŵ obtained by maximising Ŵ = arg max W p(w O) Search over all possible W astronomically large! Viterbi Search find most likely path through a HMM Sequence of phones (states) which is most probable Mostly: most probable sequence of phones correspond to most probable sequence of words WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 56 of 88

57 Training of a Speech Recognition System HMM parameter s estimated using large databases 100 hours Parameters estimated using Maximum Likelihood Criterion WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 57 of 88

58 Recognition WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 58 of 88

59 Speaker Recognition Spectra (formants) of a given sound are different for different speakers. Spectra of 2 speakers for one frame of /iy/ Derive speaker dependent model of a new speaker by MAP adaptation of Speaker- Independent (SI) model using small amount of adaptation data; use for speaker recognition. WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 59 of 88

60 References Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press, The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam Krishnan, Wiley-Interscience; 2nd edition, ISBN-10: The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam Krishnan, Wiley-Interscience; 2nd edition, ISBN-10: Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang, Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 60 of 88

61 Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack. Edinburgh: Edinburgh University Press, c1990. Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge, MA., Maximum Likelihood from incomplete data via the em algorithm, J. Royal Statistical Soc. 39(1), pp. 1-38, Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp , A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes, ISCI, TR Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp , WiSSAP 2009: Tutorial on GMM and HMM, Samudravijaya K 61 of 88

ASR using Hidden Markov Model : A tutorial

ASR using Hidden Markov Model : A tutorial ASR using Hidden Markov Model : A tutorial Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11 samudravijaya@gmail.com Tata Institute of Fundamental Research Samudravijaya K Workshop on ASR @BAMU; 14-OCT-11

More information

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I University of Cambridge MPhil in Computer Speech Text & Internet Technology Module: Speech Processing II Lecture 2: Hidden Markov Models I o o o o o 1 2 3 4 T 1 b 2 () a 12 2 a 3 a 4 5 34 a 23 b () b ()

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 25&29 January 2018 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Hidden Markov Modelling

Hidden Markov Modelling Hidden Markov Modelling Introduction Problem formulation Forward-Backward algorithm Viterbi search Baum-Welch parameter estimation Other considerations Multiple observation sequences Phone-based models

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Dr Philip Jackson Centre for Vision, Speech & Signal Processing University of Surrey, UK 1 3 2 http://www.ee.surrey.ac.uk/personal/p.jackson/isspr/ Outline 1. Recognizing patterns

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Lecture 3: Machine learning, classification, and generative models

Lecture 3: Machine learning, classification, and generative models EE E6820: Speech & Audio Processing & Recognition Lecture 3: Machine learning, classification, and generative models 1 Classification 2 Generative models 3 Gaussian models Michael Mandel

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course) 10. Hidden Markov Models (HMM) for Speech Processing (some slides taken from Glass and Zue course) Definition of an HMM The HMM are powerful statistical methods to characterize the observed samples of

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION HIDDEN MARKOV MODELS IN SPEECH RECOGNITION Wayne Ward Carnegie Mellon University Pittsburgh, PA 1 Acknowledgements Much of this talk is derived from the paper "An Introduction to Hidden Markov Models",

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics Phil Woodland: pcw@eng.cam.ac.uk Lent 2013 Engineering Part IIB: Module 4F11 What is Speech Recognition?

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Parametric Models Part III: Hidden Markov Models

Parametric Models Part III: Hidden Markov Models Parametric Models Part III: Hidden Markov Models Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2014 CS 551, Spring 2014 c 2014, Selim Aksoy (Bilkent

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Lecture Notes Speech Communication 2, SS 2004 Erhard Rank/Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology Inffeldgasse 16c, A-8010

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 6: Hidden Markov Models (Part II) Instructor: Preethi Jyothi Aug 10, 2017 Recall: Computing Likelihood Problem 1 (Likelihood): Given an HMM l =(A, B) and an

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Dynamic Approaches: The Hidden Markov Model

Dynamic Approaches: The Hidden Markov Model Dynamic Approaches: The Hidden Markov Model Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Inference as Message

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Master 2 Informatique Probabilistic Learning and Data Analysis

Master 2 Informatique Probabilistic Learning and Data Analysis Master 2 Informatique Probabilistic Learning and Data Analysis Faicel Chamroukhi Maître de Conférences USTV, LSIS UMR CNRS 7296 email: chamroukhi@univ-tln.fr web: chamroukhi.univ-tln.fr 2013/2014 Faicel

More information

Statistical Processing of Natural Language

Statistical Processing of Natural Language Statistical Processing of Natural Language and DMKM - Universitat Politècnica de Catalunya and 1 2 and 3 1. Observation Probability 2. Best State Sequence 3. Parameter Estimation 4 Graphical and Generative

More information

Lecture 11: Hidden Markov Models

Lecture 11: Hidden Markov Models Lecture 11: Hidden Markov Models Cognitive Systems - Machine Learning Cognitive Systems, Applied Computer Science, Bamberg University slides by Dr. Philip Jackson Centre for Vision, Speech & Signal Processing

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models CI/CI(CS) UE, SS 2015 Christian Knoll Signal Processing and Speech Communication Laboratory Graz University of Technology June 23, 2015 CI/CI(CS) SS 2015 June 23, 2015 Slide 1/26 Content

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

EECS E6870: Lecture 4: Hidden Markov Models

EECS E6870: Lecture 4: Hidden Markov Models EECS E6870: Lecture 4: Hidden Markov Models Stanley F. Chen, Michael A. Picheny and Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY 10549 stanchen@us.ibm.com, picheny@us.ibm.com,

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm + September13, 2016 Professor Meteer CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm Thanks to Dan Jurafsky for these slides + ASR components n Feature

More information

Hidden Markov Models. Dr. Naomi Harte

Hidden Markov Models. Dr. Naomi Harte Hidden Markov Models Dr. Naomi Harte The Talk Hidden Markov Models What are they? Why are they useful? The maths part Probability calculations Training optimising parameters Viterbi unseen sequences Real

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 3 Gaussian Mixture Models and Introduction to HMM s Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Statistical Sequence Recognition and Training: An Introduction to HMMs

Statistical Sequence Recognition and Training: An Introduction to HMMs Statistical Sequence Recognition and Training: An Introduction to HMMs EECS 225D Nikki Mirghafori nikki@icsi.berkeley.edu March 7, 2005 Credit: many of the HMM slides have been borrowed and adapted, with

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Multiscale Systems Engineering Research Group

Multiscale Systems Engineering Research Group Hidden Markov Model Prof. Yan Wang Woodruff School of Mechanical Engineering Georgia Institute of echnology Atlanta, GA 30332, U.S.A. yan.wang@me.gatech.edu Learning Objectives o familiarize the hidden

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Abstract INTERNATIONAL COMPUTER SCIENCE INSTITUTE 947 Center St. Suite 600 Berkeley, California 94704-98 (50) 643-953 FA (50) 643-7684I A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models Hamid R. Rabiee Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However

More information

CS 136 Lecture 5 Acoustic modeling Phoneme modeling

CS 136 Lecture 5 Acoustic modeling Phoneme modeling + September 9, 2016 Professor Meteer CS 136 Lecture 5 Acoustic modeling Phoneme modeling Thanks to Dan Jurafsky for these slides + Directly Modeling Continuous Observations n Gaussians n Univariate Gaussians

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabás Póczos & Aarti Singh Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named We Live in Exciting Times ACM (an international computing research society) has named CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Apr. 2, 2019 Yoshua Bengio,

More information

Pattern Classification

Pattern Classification Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric

More information

Linear Dynamical Systems (Kalman filter)

Linear Dynamical Systems (Kalman filter) Linear Dynamical Systems (Kalman filter) (a) Overview of HMMs (b) From HMMs to Linear Dynamical Systems (LDS) 1 Markov Chains with Discrete Random Variables x 1 x 2 x 3 x T Let s assume we have discrete

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Slides mostly from Mitch Marcus and Eric Fosler (with lots of modifications). Have you seen HMMs? Have you seen Kalman filters? Have you seen dynamic programming? HMMs are dynamic

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Statistical NLP: Hidden Markov Models. Updated 12/15

Statistical NLP: Hidden Markov Models. Updated 12/15 Statistical NLP: Hidden Markov Models Updated 12/15 Markov Models Markov models are statistical tools that are useful for NLP because they can be used for part-of-speech-tagging applications Their first

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Scribe to lecture Tuesday March

Scribe to lecture Tuesday March Scribe to lecture Tuesday March 16 2004 Scribe outlines: Message Confidence intervals Central limit theorem Em-algorithm Bayesian versus classical statistic Note: There is no scribe for the beginning of

More information

Statistical NLP Spring Digitizing Speech

Statistical NLP Spring Digitizing Speech Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon

More information

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ... Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Inf2b Learning and Data

Inf2b Learning and Data Inf2b Learning and Data Lecture 13: Review (Credit: Hiroshi Shimodaira Iain Murray and Steve Renals) Centre for Speech Technology Research (CSTR) School of Informatics University of Edinburgh http://www.inf.ed.ac.uk/teaching/courses/inf2b/

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker Adaptation Instructor: Preethi Jyothi Oct 23, 2017 Speaker variations Major cause of variability in speech is the differences between speakers Speaking

More information

Markov Chains and Hidden Markov Models

Markov Chains and Hidden Markov Models Chapter 1 Markov Chains and Hidden Markov Models In this chapter, we will introduce the concept of Markov chains, and show how Markov chains can be used to model signals using structures such as hidden

More information

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 Hidden Markov Model Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1/19 Outline Example: Hidden Coin Tossing Hidden

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians Engineering Part IIB: Module F Statistical Pattern Processing University of Cambridge Engineering Part IIB Module F: Statistical Pattern Processing Handout : Multivariate Gaussians. Generative Model Decision

More information

CS 7180: Behavioral Modeling and Decision- making in AI

CS 7180: Behavioral Modeling and Decision- making in AI CS 7180: Behavioral Modeling and Decision- making in AI Learning Probabilistic Graphical Models Prof. Amy Sliva October 31, 2012 Hidden Markov model Stochastic system represented by three matrices N =

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information