Outline of Today s Lecture

Size: px

Start display at page:

Download "Outline of Today s Lecture"

Martha Maxwell
5 years ago
Views:

edu> Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s Lecture Hidden

1 University of Washington Department of Electrical Engineering Computer Speech Processing EE516 Winter 2005 Jeff A. Bilmes Lecture 12 Slides Feb 23 rd, 2005 Outline of Today s Lecture Hidden Markov Model Maximum Likelihood training The EM algorithm Discriminative Training via MMIE 1

2 books & sources Bilmes, A gentle tutorial on the EM algorithm, ICSI Technical Report, Bahl, Maximum Mutual Information Estimation of HMM parameters for speech recognition, IEEE ICASSP 86 Gopalakrishnan, An Inequality for Rational Functions with applications to some statistical estimation problems, Transitions on Information Theory, Vol 37, No 1., 1991 Brown, The Acoustic-Modeling Problem in Automatic Speech Recogntion, 1987 Normandin, An improved MMIE training algorithm for speaker-independent small-vocabulary continuous speech recognition, ICASSP 1991 Baum, An inequality with applications to statistical prediction for functions of Markov processes and to a model of ecology, Bulletin of the American Mathematical Society, Vol 73, pp , 1967 Scribe Assignments 1/3: Day 1: Jeff Bilmes 1/5: Day 2: Xin Lie 1/10: Day 3: Tom Mackel 1/12: Day 4: Scott Philips 1/19: Day 5: Kevin Duh 1/24: Day 6: Andrei Alexandres 1/26: Day 7: Travis Wilkins 1/31: Day 8: Amar Subramanya 2/2: Day 9: Andrei Alexandres 2/7: Day 10: Dustin Lennon 2/9: Day 11: Jeremy Holleman 2/14: Day 12: Mei-yang 2/16: Day 13: Amar Subramanya 2/23: Day 14: Kevin Duh 2/28: Day 15: Travis Wilkins 3/2: Day 16: Tom Mackel 3/7: Day 17: Jeremy Holleman 3/9: Day 18: John Cutter 2

3 Final Projects There will be a deadline every week for the rest of the quarter. S/NS people must do a final project. Due dates: Friday, Feb 18 th : 1-page abstract due Friday, Feb 25 th : 2-page progress report due Friday, March 4 th : 2-page progress report due Friday, March 11 th : Final 4-page writeup due Tuesday, March 15 th : Final Projects Stop by office hours (Tuesdays, 2:30-4:00pm) if you are unsure of what to do, or send me (I am available most early evenings to meet). Hidden Markov Models Definition: an HMM is a collection of T discrete RVs and T other RVs such that the below two conditional independence properties hold true for all t. No other CI statements are true in general, unless they follow from the above two. T is itself a random variable with a mixture of sums of negative binomials distribution. 3

4 Hidden Markov Models Main problems to solve associated with HMMs: 1) compute probability of evidence p(x) 2) compute most likely path (Viterbi Path) and generalizing, find N most likely paths. 3) Train in a maximum likelihood fashion 4) Train in a discriminative fashion HMM: EM algorithm Auxiliary function: Iterative procedure to update parameters: 4

Consider aux function: Algorithm starts with λ 0 and

5 HMM: EM algorithm Optimizing the three parameters of an HMM (summary) HMM: EM algorithm Why does this work? Consider aux function: Algorithm starts with λ 0 and iterates: repeat until convergence: Goal is to increase likelihood, i.e.: 5

6 HMM: EM algorithm Why EM works to maximize likelihood: HMM: EM algorithm So EM maximizes likelihood. Is this the ideal criterion? Bayes decision rule, for different word models M, best choice is Bayes decision rule: If: 1. the HMM was the correct model for speech 2. we had an infinite amount of training data 3. we found the true global maximum of the likelihood then ML parameter estimation is optimal (unbiased with minimum variance). ML estimate will give accurate posterior probability for Bayes rule. 6

7 HMM: EM algorithm Problem is that none of these three things are true in general speech. Better parameter estimate maximizes information flow between acoustics and words i.e., it minimizes the entropy of the words given the acoustics Since posterior probability is key, why not attempt to find something related to posterior directly: Bayes decision rule says: so we need an accurate posterior p(m x) HMM: Discriminative Training KL-Divergence can help here: 7

8 HMM: Discriminative Training Therefore, this turns in to the maximization of a mutual information like quantity, and leads to Maximum Mutual Information Estimation (MMIE) We relate this back to the training data using the weak law of large numbers, which says that if we have enough training data, the above will be approximated. HMM: Discriminative Training Given training data D={(x i,m i )} i=1k, we assume that K is large enough so that we are close. this converges in various ways. empirical criterion function is: 8

9 HMM: Discriminative Training How to train I λ (M;X) notation: assume either K=1, or all data i.i.d. within vars (m,x) Method number one: Gradient Descent: We compare Maximum Likelihood (ML) gradient with MMIE gradient: HMM: Discriminative Training MMIE gradient in terms of MLE gradient: 9

10 HMM: Discriminative Training MMIE gradient has terms that want to go in direction of MLE (based on true word), but also a term that wants to go different direction based on each incorrect word sequence. We don t want the model to highly score incorrect word sequence, so we move model away from doing so. We decrease score p(x m) provides to words that are not m. So, gradient descent is a reasonable way to optimize parameters no convergence guarantees expensive only first order (higher order, Newton might be better). HMM: Discriminative Training Alternative strategy: Extend the Baum-Welch update equations (using Gopalakrishnan s method, 1991). Extended Baum Welch: maximizes rational functions of polynomials of probabilities with nonnegative coefficients (like an MMIE criterion). rational functions of of polynomials of probabilities with non-neg coefficients, examples: 10

11 General rational functions of of polynomials of probabilities with non-neg coefficients, examples: Polynomials defined over multiple simplexes, D: We have ratios of such polynomials S 1 and S 2 : General rational functions of of polynomials of probabilities with non-neg coefficients, examples: Polynomials defined over multiple simplexes, D: We have ratios of such polynomials S 1 and S 2 : 11

12 How is it that our problem is in this form? Therefore, for discrete observation parameters, both numerator and denominator are of the right form (polynomials with non-negative coefficients with positive coefficients) For Gaussian densities, this has been extended (see Normandin 91), we don t cover this today. Definition: Growth transformations of D for R(Λ) T(λ) will either increase R() or leave it alone. Our goal: find a growth transformation for R() show that for a particular polynomial, P, a growth transformation for P is also a growth transformation for R transform the polynomial so that it is applicable to a theorem that gives us a growth transformation for a suitable polynomial apply it to MMIE estimation with discrete parameters 12

13 Homogeneous polynomial means that degree of each monomial is the same in the sum. Ex: x 3 + x 2 y+xy 2 +y 3 This gives us a growth transformation, but for the wrong type of object. Theorem applies to homogeneous polys of degree d with non-negative coefficients We ve got: R(), a ratio of polynomials We might have negative coefficients (as we will see) The polynomials might be non-homogeneous We define a 3-step procedure to go from R() to a P() that satisfies the theorem, but also that if T() is a growth transformation for P(), it is also one for R() 13

14 Step 1: Move away from ratio of polynomials to just poly, where R(λ)=S 1 (λ)/s 2 (λ) Therefore, if P λ (ξ) > P λ (λ)=0, then R(ξ) > R(λ), since solve for R(ξ) in: So if we ve got a growth transform for P(λ) P λ (λ), then we ve got one for R(λ) We still need to deal with non-negative coefficients and homogeneity (negativity can arise due to - ) Step 2: Dealing with negative coefficients Define new polynomial: where, if a λ is P s minimal negative coefficient (or a λ =0 of none), and d is P s degree: Therefore, we ve added a constant to P to get P, gotten a non-negative coefficient polynomial as a result, and have not changed the effect of any growth transformations. 14

Step 3: Dealing with non-homogeneous polys while simultaneously preserving growth transforms Form new polynomial: This is variable substitution with: On new set of constrained

step 2 a growth function for P, and by step 1 a growth function for R).

15 Step 3: Dealing with non-homogeneous polys while simultaneously preserving growth transforms Form new polynomial: This is variable substitution with: On new set of constrained simplexes: Step 3: Therefore, D and D are isomorphic, and there is bijection between λ D and ψ D Any growth function in D for P will thus be a growth function in D for P (and by step 2 a growth function for P, and by step 1 a growth function for R). But P satisfies the criterion for Baum s theorem, so we construct a growth function for P and use it for R (undoing the steps 1-3 when necessary). We can combine steps 1-3 and Baum s theorem into a new theorem that gives us growth functions for rational functions R(), as follows: 15

16 We can apply this to MMIE training, where in this case we get (for uniform word priors) In discrete case we get update equations: and similar for the other HMM parameters. 16

17 Note, that we have lower bound on C. We can prove convergence if C is large enough, but as C gets larger, convergence takes a long time tradeoff Heuristic: choose least possible value and double it. Next time: practical aspects of MMIE training. On the lighter side 17

Semi-continuous HMMs, use densities of the form: HMMs Some states might be non-emitting (i.e., have no associated observation).

18 HMMs Before getting into discriminative training, we ll rest by covering a few lighter issues regarding HMMs. There are various types of observation densities: continuous observations (often modeled by mixtures) Discrete (vector quantization) Semi-continuous HMMs, use densities of the form: HMMs Some states might be non-emitting (i.e., have no associated observation). For example, start and stop state. Indicated by concentric circles. we can see this as augmenting the output alphabet with a special symbol indicating a non-emission 18

19 Two different ways to draw (and think of) SFSA view of HMMs. Rabiner (Moore) HMM: States are circles, transitions are arrows between circles. Observations are associated with the states. Jelinek (Mealy) HMMs: states and observations are associated with edges transitions between circles. HMMs HMMs Representations have same representational capacity Mealy with n states == Moore with n 2 states. Mealy approach good for representing finite state transducers: given you are in a current state, a particular input will cause the probabilistic transition to another state and will cause an output symbol to be emitted (also probabilistically). FST are amenable to manipulation (composition, hierarchies, etc.) Word-graphs and lattices (we ll see in a few weeks) use this form of representation. 19

HMMs for ASR A matrix Markov chain topology (where

Most common: strictly left-to-right (used for sub-word

triangular: L2R helps discriminability (ability to

More on this when we cover discriminative training.

modeling, achieved using multiple states sharing same

duration during which p(x) is active can not be less

20 HMMs for ASR A matrix Markov chain topology (where zeros are) is crucial for good performance. Most common: strictly left-to-right (used for sub-word sequences): /c/ /a/ /a/ /t/ Generally: upper triangular: L2R helps discriminability (ability to distinguish one sound from another). More on this when we cover discriminative training. HMMs for ASR Duration Model is rich: minimum duration modeling, achieved using multiple states sharing same output distribution. Also helps discriminability. duration during which p(x) is active can not be less than 4!! This moreover gives us a new distribution. Sum of geometric distributions: if D i is geometric 20

21 Duration Model is rich: HMMs for ASR exponential negative-binomial 21

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian