Large-Margin HMM Estimation for Speech Recognition

Size: px

Start display at page:

Download "Large-Margin HMM Estimation for Speech Recognition"

Annis Ford
5 years ago
Views:

1 Large-Margn HMM Estmaton for Speech Recognton Prof. Hu Jang Department of Computer Scence and Engneerng York Unversty, Toronto, Ont. M3J 1P3, CANADA Emal: Ths s a jont work wth Chao-Jun Lu, nwe L

2 Research Projects Herarchcal covarance modelng n CDHMM jont wth Y. Tan, J.-L. Zhou, MSRA, Bejng, Chna Large-scale dscrmnatve Tranng based on MCE/GPD jont wth B. Lu, Unv. of Sc. & Tech. of Chna, J.-L. Zhou, MSRA Large-margn HMM estmaton for speech recognton jont wth C. Lu,. L, York Unv.

3 Herarchcal Covarance Modelng HCM n CDHMM 2 Σ f 1 Σ f k Σ f Σ p * = λ Σ 0 p d + 1 k = p λ Σ p Σd k k f Σd Σd Σd Σd Σd p

4 Herarchcal Covarance Modelng Schemes HCM HPM HCM+DIAG HPM+DIAG HCC Ψ = m m m λ Ψ = 1 1 m m m λ Ψ + = 0 m m m λ λ Ψ + = m m m dag λ λ [ ] Ψ + = m m m m dag dag λ

5 Performance Comparson: RM database Word Err. Rate Err. Rate Reducton Baselne HLDA MLLT STC MIC HCC 4.09% 3.50% 3.61% 3.33% 3.49% 3.16% n/a 14.4% 11.7% 18.6% 14.7% 22. 7%

6 Performance Comparson: Swtchboard mntran Word Err. Rate Err. Rate Reducton Baselne HLDA STC MIC 39 prototypes HCC 38.0% 37.2% 36.6% 36.5% 35.9% n/a 2.10% 3.68% 3.94% 5.53%

7 Reference Segmentaton Large-Scale GPD/MCE: In-Search Data Selecton $-b-u b-u-k u-k+t k-t+ T-+s -s+t s-t+r T-r+ r-+p -p+$ State search beam Token Comparson phone a-b+c phone a-b+c a-b+c phone a -b +c phone a -b +c A word-endng actve path t True Token Sets Competng Token Sets Tme

8 Large-Scale Dscrmnatve Tranng based on GPD/MCE Dscrmnatve tranng: refne the orgnal model set dscrmnatvely based on the the collected token sets. True Tokens Competng Tokens HMM model

9 Implementaton Model for state-ted HMMs Frames feature vectors Optmal Vterb path state sequence Competng states

10 Crtera n Dscrmnatve Tranng Least Imposter Words LIW: mnmzng the total number of mposter words durng the decodng of all tranng data. Imposter words are defned as ncorrect words appearng wthn beamwdth durng Vterb decodng wth a hgher lkelhood than ts reference model. Jang, et. al Least Phone Competng Tokens LPCT: mnmzng the total number of phone competng tokens durng the decodng of all tranng data. Lu, Jang, et.al Least Incorrect Frames LIF: mnmzng the total number of ncorrectly decoded frames durng decodng of all tranng data.

11 Dscrmnatve Tranng: RM task teraton Tranng set Test set 0 ML WER % Err Red WER % Err Red N/A 4.30 N/A 8% % 16% % 18% %

12 Dscrmnatve Tranng: Swtchboard teraton Tranng set Test set WER% Err Red WER% Err Red 0 ML 33.2 N/A 48.1 N/A % % % % % %

13 Large-Margn HMM Estmaton for Speech Recognton Prof. Hu Jang Department of Computer Scence and Engneerng York Unversty, Toronto, Ont. M3J 1P3, CANADA Emal: Ths s a jont work wth Chao-Jun Lu, n-we L

14 Outlne Background: Automatc Speech Recognton ASR Large Margn Classfer Large Margn for HMM-based classfers A Gradent Ascent Optmzaton for Contnuous Densty HMM CDHMM n speech recognton Prelmnary Experments Fnal Remarks and ongong works

15 ASR Soluton: MAP decson rule Ω W Ω W Γ Λ = arg max P W p W = arg max F W Ω W Ω W W = arg max W p = arg max p W P W ˆ p Λ W Acoustc Model AM: the probablty of generatng feature when W s uttered. P Γ W Language Model LM: the probablty of W word, phrase, sentence beng chosen to say. F W Dscrmnant Functon: F W = P ΓW p ΛW

16 Exstng HMM Estmaton Methods Maxmum Lkelhood Estmaton MLE The Baum-Welch algorthm: the EM algorthm for HMM Dscrmnatve Tranng Maxmum Mutual Informaton Estmaton MMIE Mnmum Classfcaton Error MCE: Dscrmnatve tranng can mprove more or less over the standard ML tranng. All dscrmnatve tranng methods suffer the problem of poor generalzaton.

17 Large-Margn Classfer: Support Vector Machne SVM larger margn

18 Large-Margn Classfers Why larger margn classfers yeld better generalzaton performance? Conceptually, large margn Robustness w.r.t. data patterns Robustness w.r.t. classfer parameters The theory n machne learnng: upper bound of generalzaton error rate R R d + C M V 2 log M 2 d / V + 1 log δ

19 How about usng SVM for Speech Recognton? Done n some smple ASR tasks: phoneme recognton speaker recognton small vocabulary solated speech recognton No sgnfcant mprovement s reported. stll not a man-stream method Why? lack of a proper kernel functon to map speech samples from one dynamc hgh-dmenson space to another hgh-dmenson space, whch s sutable for lnear classfers.

20 Large-Margn HMM-based Classfer Separaton boundary F 1-F 2=0 model 1 model 2

21 Large-Margn HMM-based Classfer Orgnal separaton boundary F 1-F 2= New separaton boundary F 1-F 2=0

22 How to defne separaton margn? 1 In 2-class separable problem: For a data token, x1, of class 1 d x = F x Λ F x Λ > 0 For a data token, x2, of class 2 d x = F x Λ F x Λ > 0

23 How to defne separaton margn? 2 Extend to multple-class problem: N classes 1, 2,, N, For a data token, x, of class [ ] mn max j j j j x x x x x d Λ Λ = Λ Λ = F F F F

24 Large-Margn Estmaton of HMMs An N-class problem: each class s represented by an HMM = Λ, Λ, L, Λ { 1 2 N Gven a tranng set D, defne a subset, called support token set S, as: S = { D and 0 d ε} } Large-Margn Estmaton LME of HMMs: ˆ = arg max mn d subject to all d > S 0

25 Large-Margn Estmaton of HMMs Convert t nto an equvalent mnmax optmzaton problem Assume belongs to class [ ] max arg mn ˆ, j j S Λ Λ = F F. and for all 0 subject to constrants : j S j < Λ Λ F F

26 Two dffcultes No.1 : wthout addtonal constrants on durng the optmzaton, maxmum margn does not exst. e.g. scale up both F Λ and F Λ j to ncrease margn unlmtedly. No.2 : how to do optmzaton? Use standard optmzaton tools, such as Matlab optmzaton toolbox However, too slow

27 How to guarantee exstence of Maxmum Margn? 1 Soluton one: maxmzng relatve margn relatve margn nstead: maxmum always exsts 1 ' 1 mn max ' < Λ Λ = Λ Λ Λ = j j j j d d F F F F F Called Large Relatve Margn Estmaton LRME Large Relatve Margn Estmaton LRME

28 How to guarantee exstence of Maxmum Margn? 2 Soluton two: optmze one HMM each tme Do foreach do a sub-optmzaton problem Λˆ arg mn where other HMMs are kept constant n the above optmzaton. Untl converge = Λ for all max S, j subject to constrants : F Λ j [ F Λ F Λ ] Called teratve localzed optmzaton ILO F S and Λ j. < 0 j

29 Iteratve Localzed Optmzaton

30 How to do optmzaton? 1 Use the gradent ascent method to maxmze a lower bound of mnmum margn Use a contnuous and dfferentable functon to approxmate the mnmum margn ˆ = arg max mn d = S arg max Q Q = mn d S

31 How to do optmzaton? 2 Approxmate Q wth summaton of exponentals 1 Q Q = η log exp[ η d η S, j ] Q > Q η < 0 lm Q = Q η η η Q η Optmze nstead ˆ ' = arg max Q η The gradent ascent method ˆ ' + 1 ˆ ' n ε Q η n = + = ' n

32 How to calculate the gradent for contnuous densty HMM? 1 Q η = S, j exp[ η d S, j d F Λ = Λ Λ d F Λ j = Λ Λ j j exp[ η d d ] ]

33 How to calculate the gradent for contnuous densty HMM? 2 Assumpton 1: adjust CDHMM mean vectors only Assumpton 2: dagonal precson matrces Assumpton 3: use the Vterb approxmaton F Λ C' 1 2 T D t= 1 d = 1 r stltd td m stlt d 2 F 1 C" 2 T D Λ j j j r td m s ' tl ' t d s ' tl t t= 1 d = 1 ' d 2

34 How to handle Recognton Errors n tranng set? Gven the tranng set D, based on the current model, defne the error set: Ψ = { D and d Use the MCE mnmum classfcaton error/gpd algorthm to update model based on to reduce. Intutvely, the MCE algorthm wll move separaton boundary to correctly classfy as many error tokens as possble. Use MCE-traned models as ntal models to start large margn estmaton LME. 0 }

35 Prelmnary Experments Englsh alphabet E-set recognton Use the OGI ISOLET database Speaker-ndependent small vocabulary solated-word recognton Feature vector 39-d: 12 MFCC + E + + Our best MLE system: 16-state whole-word CDHMM for each letter 4 Gaussan mxtures per state Acheve 96.15% accuracy for a standard test set 26- letter comparable other reported systems: OGI 96%, Cambrdge 96.73%. Test our best system on the E-set only: 91.5%

36 Prelmnary Results: E-Set ASR Performance Comparson ML MCE LME-ILO LRME 1-mx mx mx n/a 95.2 Word accuracy comparson among varous HMM tranng approaches ML: Maxmum Lkelhood Estmaton MCE: Mnmum Classfcaton Error LME-ILO: Large Margn Estmaton va Iteratve Localzed Optmzaton LRME: Large Relatve Margn Estmaton

37 LME learnng curves 1-mx Accuracy n test set Actual Margn Q Objectve Func Q

38 LME learnng curves 2-mx Accuracy n test set Actual Margn Q Objectve Func Q

39 Fnal Remarks Based on prelmnary expermental results only. LME can yeld better performance than MLE and MCE. Margn s a good ndcator of generalzaton capablty of an HMM-based speech recognzer. Maxmzng the objectve functon Q lower bound effectvely ncreases the actual separaton margn.

40 Ongong and Future Works More theoretcal exploratons: How to formulate the constrants n a theoretcally sound way? How to re-formulate LME as another type of optmzaton problem whch has more effcent solutons? sem-defnte programmng SDP? Practcally, extend to large-scale contnuous ASR tasks TIDIGITS experments under way SPINE very soon Swtchboard

42 ERROR: undefned OFFENDING COMMAND: STACK:

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement