idden Markov Models, iscriminative Training, and odern Speech Recognition Li Deng Microsoft Research UW-EE516; Lecture May 21, 2009

Size: px

Start display at page:

Download "idden Markov Models, iscriminative Training, and odern Speech Recognition Li Deng Microsoft Research UW-EE516; Lecture May 21, 2009"

Dortha Rosamund Lawson
5 years ago
Views:

1 idden Markov Models, iscriminative Training, and odern Speech Recognition Li Deng Microsoft Research UW-EE516; Lecture May 21, 2009

2 Architecture of a speech recognition system Voice Signal Processing Application Application Decoder Acoustic Models Language Models Adaptation

3 Fundamental Equation of Speech Recognition Wˆ = arg max P( W x) = arg max P( x W) P( W) W W Roles of acoustic modeling and Hidden Markov Model (HMM)

4 utline Introduction to HMMs; Basic ML Training Techniques HMM Parameter Learning EM (maximum likelihood) Discriminative learning (intro) Engineering Intricacy in Using HMMs in Speech Recognition Left-to-right structure Use of lexicon and pronunciation modeling Use of context dependency (variance reduction) Complexity control: decision tree & automatic parameter tying Beyond HMM for Speech Modeling/Recognition Better generative models for speech dynamics Discriminative models: e.g., CRF and feature engineering Discriminative Training in Speech Recognition --- A unifying framework

5 The remaining mode are hidden because the underlying state sequence cannot be directly inferred from the output sequence MM Introduction: Description The 1-coin model is observable because the output sequence can be mapped to a specific sequence o state transitions

6 Description Specification of an HMM A - the state transition probability matrix aij = P(q t+1 = j q t = i) B- observation probability distribution b j (k) = P(o t = k q t = j) i k M π - the initial state distribution

7 Central problems for HMM Problem 1: Evaluation Probability of occurrence of a particular observation sequence, O = {o 1,,o k }, given the model: P(O λ) Not straightforward hidden states Useful in sequence classification (e.g., isolatedword recognition):

8 Problem 1: Evaluation Naïve computation: P ( O λ) = P( O q, λ) P( q λ) q where P( q λ) = π q1 aq 1q2aq2q3... aqt 1qT -The above sum is over all state paths -There are N T states paths, each costing O(T) calculations, leading to O(TN T ) time complexity.

9 Problem 1 Need efficient Solution Define auxiliary forward variable α: α ( i) = P( o,..., o q = i, λ),..., t 1 t t,..., t α ( i) = P( o1,..., o q i, λ) t t t = α ( i) = P( o o q = i, λ) t 1 t α ( i) = P( o o q = i, λ) t 1 t t α t (i) is the probability of observing a partial sequence of observables o 1, o t such that at time t, state q t =i Then recursion on α provides the most efficient solution to problem 1 (forward

10 Central problems for HMM Problem 2: Decoding Optimal state sequence to produce given observations, O = {o 1,,o k }, given model Optimality criterion Useful in recognition problems (e.g., continuous speech recognition

11 Problem 2: Decoding Recursion: δ t ( j ) = max ( δ t 1 1 i N ( i ) a ij ) b j ( o t ) ψ Termination: Backtracking: t ( j ) = arg max ( δ t 1 P q = = T maxδ 1 i N T ( i) arg max 1 i N 1 i δ T N ( i) ( i ) a ij ) 2 t T, 1 P* gives the state-optimised probability j N Q* is the optimal state sequence (Q* = {q1*,q2*,,qt*})

12 Central problems Central problems for HMM Problem 3: Learning of HMM parameters: Determine optimum model, given a training set of observations Find λ, such that P(O λ) is maximal (ML), or class discrimination is maximal (with greater separation margins)

13 utline Introduction to HMMs; Basic ML Techniques HMM Parameter Learning EM (maximum likelihood) Discriminative learning Engineering Intricacy in Using HMMs (Speech) Left-to-right structure Use of lexicon and pronunciation modeling Use of context dependency (variance reduction) Complexity control: decision tree & automatic parameter tying Beyond HMM for Speech Modeling/Recognition Better generative models for speech dynamics Discriminative models: e.g., CRF and feature engineering Discriminative Training in Speech Recognition --- A unifying framework

14 Problem 3: Learning Maximum Likelihood: Baum-Welch algorithm uses the forward and backward algorithms to calculate the auxiliary variables α,β B-W algorithm is a special case of the EM algorithm: E-step: calculation of posterior probs from α, β πˆ â ˆ ( k) ij b j M-step: HMM parameter updates Practical issues: Can get stuck in local maxima Numerical problems log and scaling

15 iscriminative Learning (intro) Maximum Mutual Information (MMI) Minimum Classification Error (MCE) Minimum Phone/Word Errors (MPE, MWE) Large-Margin Discriminative Training More Coverage later

16 utline Introduction to HMMs; Basic ML Techniques HMM Parameter Learning (Advanced ML Techniques) EM (maximum likelihood) Discriminative learning Engineering Intricacy in Using HMMs (Speech) Left-to-right structure Use of lexicon and pronunciation modeling Use of context dependency (variance reduction) Complexity control: decision tree & automatic parameter tying Beyond HMM for Speech Modeling/Recognition Better generative models for speech dynamics Discriminative models: e.g., CRF and feature engineering Discriminative Training in Speech Recognition --- A unifying framework

17 eft-to-right Topology in HMM for Speech ower triangle parts of transition prob matrices forced to zero! hen aij s play very minor roles viz output probabilities

18 ord/phone as HMM States, Lexicon, Context Dependency, & Decision Tree

19 utline Introduction to HMMs; Basic ML Techniques HMM Parameter Learning (Advanced ML Techniques) EM (maximum likelihood) Discriminative learning Large-margin discriminative learning Engineering Intricacy in Using HMMs (Speech) Left-to-right structure Use of lexicon and pronunciation modeling Use of context dependency (variance reduction) Complexity control: decision tree & automatic parameter tying Beyond HMM for Speech Modeling/Recognition Better generative models for speech dynamics Discriminative models: e.g., CRF and feature engineering Discriminative Training in Speech Recognition --- A

20 HMM is a Poor Model for Speech (not just IID)

21 Major Research Directions in the Field Direction 1: Beyond HMM Generative Modeling build more accurate statistical speech models than HMM Incorporate key insights from human speech processing

22 SPEAKER LISTENER brain message brain Internal model decod messa motor/articulators ear speech acoustics auditory reception

23 o Hidden Dynamic Model --- Graphical Model Representation (1) S 1 (1) (1) S 2 S 3 (1) S 4... (1) S T (2) S 1 ( 2) S 2 (2) ( 2) S 3 S... (2 ) 4 S T S ( L) 1 S ( L) 2 ( L) S ( L) 3 S4... (L) S T t1 t2 t3 t 4 tk z z z z 4 zk

24 Result Comparisons (TIMIT phone recog. accuracy %) ASR Systems Generative Discriminative arly (Discrete) HMM (Lee & Hon) 66.08? arly HMM (Cambridge U) (MMI) arly HMM (ATR-Japan) (MCE) onditional Random Field (Ohio State U., 2005) NA ecurrent Neural Nets (Cambridge U.) NA arge-margin HMM (U. Penn, 2006) (NIPS 06) odern HMM (MSR) (MCE) idden Dynamic Model (MSR) 75.18?

25 Major Research Directions in the Field Direction 2: Discriminative Modeling/Learning Acknowledge accurate speech modeling is too hard View HMM not as a generative model but as discriminant function Large-margin discriminative parameter learning Direct models for speech-class discrimination: e.g., MEMM, CRF; feature engineering; deep learning and structure Discriminative training for generative models

26 utline Introduction to HMMs; Basic ML Training Techniques HMM Parameter Learning EM (maximum likelihood) Discriminative learning (intro) Engineering Intricacy in Using HMMs in Speech Recognition Left-to-right structure Use of lexicon and pronunciation modeling Use of context dependency (variance reduction) Complexity control: decision tree & automatic parameter tying Beyond HMM for Speech Modeling/Recognition Better generative models for speech dynamics Discriminative models: e.g., CRF and feature engineering Discriminative Training in Speech Recognition --- A unifying framework (He, Deng, Wu, 2008, IEEE SPM)

28 MMIE Criterion Maximizing the mutual information is equivalent to choosing the parameter set λ to maximize: F MMIE ( λ) = t ( ) O M P( w ) R Pλ t w t log t Σ ( ) ( P O M P w) = 1 λ t w w Maximization implies increasing the numerator term (maximum likelihood estimation MLE) and/or decreasing the denominator term The latter is accomplished by reducing the probabilities of incorrect, or competing, hypotheses Clever optimization methods

29 MPE/MWE Criteria O (Λ)= MWE R r=1 p(x s,λ) P(s ) A (s,s ) sr r r r 1 r r p(x s,λ)p(s ) sr r r r Raw phone or word accuracy A(s, S) (Povey et. al. 2002) Requiring use of lattices to determine A(s,S) Very difficult to optimize wrt HMM parameters Engineering tricks Unifying MPE, MCE, MMI (He & Deng, 2007) in both criterion and optimization method

30 MCE Criterion M i X g i, 1 ),, ( = Λ iscriminative functions: ), ( arg max ), ( ), ( 1,, Λ + Λ = Λ = X g X g X d j M j i j i i [ ] ) ( min arg ) ( )), ( ( )), ( ( ) ( ) 1( ), ( ), ( 1 Λ = Λ Λ = Λ = Λ Λ = Λ Λ = L X X P X d l X d l E L C X X d X d opt X X M i i i moothed classification rror count: isclassification measure: ptimization (gradient): 0, 1 )), ( ( > = Λ γ X d l i ere error smoothing ction is sigmoidal :

31 rge-margin Estimates MMI, MPE/MWE, & common MCE do not embrace margins Margin parameter m>0 in MCE s error smoothing function 1 l( di ( X, Λ )) =, γ > exp( γ ( d ( X, Λ ) + m( I))) i Careful exploitation of the margin parameter in LM-MCE training of HMM parameters leads to significant ASR performance improvement (Yu, Deng, He, Acero, 2006, 2007)

32 rge-margin Estimates Paper: Sha & Saul (NIPS 06): Large Margin Training of Continuous-density Hidden Markov Models Re-formulate Gaussians in HMM -- turning log(variances) into independent parameters to optimize SVM-type formulation of objective function --- use of slack variables Make margin a function of the training token according to Hamming distance Convex optimization (vs. LM-MCE with local optima) Good relative performance improvement on TIMIT: phone accuracy from 67.3% (EM) to 71.8% (LM) Several other groups explore different ways of incorporating margins in HMM parameter training

33 MCE in more detail Define misclassification measure: d ( X, Λ ) = log p ( X s ) log p ( X S ) r r Λ r, r,1 Λ r, r he case of using correct and top one incorrect competing tokens) Observation. seq.: X r x 1, x 2, x 3, x 4,, x t,, x T correct label: S r OH THREE EIGHT competitor: s r,1 OH SIX EIGHT s r,1 : the top one incorrect (not equal to S r ) competing string 33

34 34 CE: Loss function s = arg max log p ( X, s ) * Classification: { } r Λ r r s Classifi. error: d r (X r,λ) > 0 1 classification error Loss function: smoothed error count function: 1 l d ( X, Λ ) = + e Λ ( ) (, ) r r r d X r d r (X r,λ) < 0 0 classification error 1 r r

35 35 MCE: Objective function MCE objective function: R 1 L ( Λ ) = l d ( X, Λ) ( ) MCE r r r R r = 1 L MCE (Λ) is the smoothed recognition error rate on the string (token) level. Model (acoustic model) is trained to minimize L MCE (Λ), i.e., Λ * = argmin Λ {L MCE (Λ)}

36 36 CE: Optimization Traditional Stochastic GD Gradient descent based online optimization New Growth Transform. Extend Baum-Welch based batch-mode method Convergence is unstable Training process is difficult to be parallelized Stable convergence Ready for parallelized processing

37 Introduction to EBW or GT (Part I) I) Applied to rational function as the objective fcnt: P( Λ ) = G( Λ) H ( Λ) Equivalent auxiliary objective fctn (often easier t optimize): F( ΛΛ ; ) = G( Λ) P( Λ ) H( Λ ) + D here D is a Λ-independent constant. (see proof in SPM paper)

38 Introduction to EBW or GT (Part II) Applied to the objective fcnt of the form: Equivalent auxiliary objective fctn (easier to optimize): (see proof in SPM paper & tech report)

39 CE: Optimization o Growth Transformation based MCE: If Λ=T(Λ') ensures P(Λ)>P(Λ'), i.e., P(Λ) grows, then T( ) is called a growth transformation of Λ for P(Λ). Minimizing L MCE (Λ) = l d( ) Maximizing P(Λ) = G(Λ)/H(Λ) Maximizing F(Λ;Λ ) = G-P H+D GT formula U( )/ Λ = 0 Λ =T(Λ ) Maximizing U(Λ;Λ ) = f ( )log f( ) Maximizing F(Λ;Λ ) = f ( ) 39

40 Rational Function Form for EC Objective Function d ( X, Λ ) = log p ( X s ) log p ( X S ) r r Λ r, r,1 Λ r, r 1 l d ( X, Λ ) = + e Λ ( ) (, ) r r r d X 1 r r R 1 L ( Λ ) = l d ( X, Λ) ( ) MCE r r r R r = 1 Not an obvious rational function (ratio) due to the summation

41 41 ational Function Form for MEC Objectiv nction Re-write MCE loss function to l d ( X, Λ ) = ( ) r r r px (, s Λ) r r,1 px (, s Λ ) + px (, S Λ) r r,1 r r Then, min. L MCE (Λ) max. Q(Λ), where ( ) Q( Λ ) = R 1 L ( Λ) MCE p( Xr, sr Λ) δ ( sr, Sr) R R px ( r, Sr Λ) s r { s r,1, S r} = = px (, s Λ ) + px (, S Λ) px (, s Λ) r= 1 r r,1 r r r= 1 r r s { s, S } r r,1 r

43 Unified Objective Function Can do the same trick for MPE/MWE MMI objective function is naturally a rational function (due to logarithm in the MMI definition) Unified form:

45 45 CE: Optimization (cont d) Coming back to MCE: its objective function can be reformulated as a single fractional function P(Λ) G( Λ) P( Λ ) = H ( Λ) where R G( Λ ) = p ( X1,..., XR, s1,..., sr) Λ δ ( sr, Sr) s s r= 1 1 H( Λ ) = p ( X,..., X, s,..., s ) s 1 R s R Λ 1 R 1 R

46 MCE: Optimization (cont d) Increasing P(Λ) can be achieved by maximizing F( ΛΛ ; ) = G( Λ) P( Λ ) H( Λ ) + D as long as D is a Λ-independent constant. i.e., P( Λ) P( Λ ) = F( Λ; Λ ) F( Λ ; Λ ) 1 H ( Λ) [ ] (Λ is the parameter set obtained from last iteration) Substitute G() and H() into F(), F( ΛΛ ; ) = pxqs (,, Λ)[ Cs ( ) P( Λ )] + D q s 46

47 Reformulate F(Λ;Λ') to F( ΛΛ ; ) = f( χ, q, s, ΛΛ ; ) dχ where s q χ [ ] f ( χ, qs,, Λ; Λ ) = Γ( Λ ) + ds ( ) p( χ, q s, Λ) ΓΛ ( ) = δ( χ, X) p( q, s)[ C( s) P( Λ )] R Cs () = δ ( sr, Sr) r= 1 s F(Λ;Λ') is ready for EM-style optimization after applying GT theory (Part II) Note: Γ(Λ ) is a constant, and log p(χ, q s, Λ) is easy to 47

48 Increasing F(Λ;Λ') can be achieved by maximizing U( Λ ) = f( χ, q, s, Λ ; Λ )log f( χ, q, s, Λ; Λ ) dχ s q χ Use EBM/GT (Part II) for E step. log f(χ,q,s,λ;λ') is decomposable w.r.t Λ, so M step is easy to compute. So the growth transformation of Λ for CDHMM is: U ( Λ) Λ = 0 Λ = T ( Λ ) 48

49 CE: HMM estimation ormulas For Gaussian mixture CDHMM, μ m T 1 ( μ, ) exp ( μ) ( μ) px Σ Σ x Σ x 2 GT of mean and covariance of Gaussian m is m Σ = = r j t r t t Δ γ () tx + D mr, rt, m m Δ γ () t + D mr, m μ T Δ γmr, ( t)( xrt, - μm)( xrt, - μm) + DmΣ m+ Dm( μm- μ m)( μm- μ m) r t Δ γ () t + D mr, m T where ( ) Δ γ () t = p( S X, Λ ) p( s X, Λ ) γ () t γ () t mr, r r r,1 r mrs,, mrs,, r 49 r,1

50 50 CE: HMM estimation ormulas Setting of D m Theoretically, set D m so that f(χ,q,s,λ;λ') > 0 Empirically, R [ D = E p( S X, Λ ) p( S X, Λ ) γ ( t) m r r r r m, r, S r= 1 t + p( s X, Λ ) γ ( t) r,1 r m, r, s t r r,1 ]

51 51 MCE: Workflow Training utterances Recognition Last iteration Model Λ Competing strings Training transcripts GT-MCE New model Λ next iteration

52 52 Experiment: TI-DIGITS Vocabulary: 1 to 9, plus oh and zero Training set: 8623 utterances / words Test set: 8700 utterances / words 33-dimentional spectrum feature: energy +10 MFCCs, plus and features. Model: Continuous Density HMMs Total number of Gaussian components: 3284

53 Experiment: TI-DIGITS GT-MCE vs. ML (maximum likelihood) baseline 0.4 MCE training - TIdigits 1100 MCE training - TIdigits WER (%) E=1.0 E=2.0 E=2.5 Loss func. (sigmoid error count) E=1.0 E=2.0 E= MCE iteration MCE iteration Obtain the lowest error rate on this task Reduce recognition Word Error Rate (WER) by 23% Fast and stable convergence 53

and recognition Discriminative training for HMM as the generative

54 HMM is an underlying model for modern speech recognizers Fundamental properties of HMM Strengths and weaknesses of HMM for speech modeling and recognition Discriminative training for HMM as the generative model Reasonably good theory and practice for HMM discriminative training

Discriminative Learning in Speech Recognition

Discriminative Learning in Speech Recognition Yueng-Tien, Lo g96470198@csie.ntnu.edu.tw Speech Lab, CSIE Reference Xiaodong He and Li Deng. "Discriminative Learning in Speech Recognition, Technical Report