Non-Linear Maximum Likelihood Feature Transformation For Speech Recognition

Size: px

Start display at page:

Download "Non-Linear Maximum Likelihood Feature Transformation For Speech Recognition"

Reynold Morris
6 years ago
Views:

1 No-Liear Maximum Likelihood Feature Trasformatio For Speech Recogitio Mohamed Kamal Omar, Mark Hasegawa-Johso Departmet of Electrical Ad Computer Egieerig, Uiversity of Illiois at Urbaa-Champaig, Urbaa, IL Abstract Most automatic speech recogitio systems (ASR) use Hidde Markov model (HMM) with a diagoal-covariace Gaussia mixture model for the state-coditioal probability desity fuctio. The diagoal-covariace Gaussia mixture ca model discrete sources of variability like speaker variatios, geder variatios, or local dialect, but ca ot model cotiuous types of variability that accout for correlatio betwee the elemets of the feature vector. I this paper, we preset a trasformatio of the acoustic feature vector that miimize a empirical estimate of the relative etropy betwee the likelihood based o the diagoal-covriace Gaussia mixture HMM model ad the true likelihood. We show that this miimizatio is equivalet to maximizig the likelihood i the origial feature space. Based o this formulatio, we provide a computatioally efficiet solutio to the problem based o volume-preservig maps; existig liear feature trasform desigs are show to be special cases of the proposed solutio. Sice most of the acoustic features used i ASR are ot liear fuctios of the sources of correlatio i the speech sigal, we use a o-liear trasformatio of the features to miimize this objective fuctio. We describe a iterative algorithm to estimate the parameters of both the volume-preservig feature trasformatio ad the hidde Markov models (HMM) that joitly optimize the objective fuctio for a HMM-based speech recogizer. Usig this algorithm, we achieved 2% improvemet i phoeme recogitio accuracy compared to the origial system that uses the origial Mel-frequecy cepstral coeeficiets (MFCC) acoustic features. Our approach is compared also to previous similar liear approaches like MLLT ad ICA. 1. Itroductio A importat goal for desigers of ASR systems is to achieve a high level of performace while miimizig the umber of parameters used by the system. Not oly because it icreases the computatioal load ad the storage requiremets, but also because it icreases the size of the traiig data required to estimate the parameters. Oe way of cotrollig the umber of parameters is to adjust the structure of the coditioal joit PDF used by the recogizer. For example, the dimesioality of the acoustic feature vectors i Gaussia mixture HMM is too large for their coditioal joit PDFs to have full covariace matrices. O the other had, approximatig the coditioal PDF by a diagoal covariace matrix Gaussia PDF degrades the performace of the recogizer [?], as the acoustic features used i ASR systems are either decorrelated or idepedet give the Gaussia compoet idex. The mixture of Gaussia compoets ca model discrete sources of variability like speaker variatios, geder variatios, or local dialect, but ca ot model cotiuous types of variability that accout for correlatio betwee the elemets of the feature vector like coarticulatio effects ad backgroud oise. Recet approaches to this problem that offer ew alteratives ca be classified ito two major categories. The first category try to decrease the umber of parameters required for full covariace matrices. This category iclude a variety of choices for covariace structure other tha diagoal or full. Two examples that ca be used i ASR systems are block-diagoal [?] ad baded-diagoal matrices. Aother method ofte used by ASR systems is tyig, where certai parameters are shared amogst a umber of differet models. For example, the semi-tied covariace matrices approach that estimates a trasform i a maximum likelihood fashio give the curret model parameters is described i [?]. Factor aalysis also was used i [?] to model the covariace matrix of each Gaussia compoet of the Gaussia mixture used withi each state of the HMM recogizer. The secod category choose to trasform the origial feature space to a ew feature space that satisfies the diagoalcovariace models better. This is achieved by optimizig the trasform based o a criterio that measures the validity of the assumptio. A example is a state-specific pricipal compoet aalysis (PCA) approach that was itroduced i [?]. Aother example is idepedet compoet aalysis (ICA) that was used i developig features for speaker recogitio [?] ad speech recogitio [?], [?], [?]. The maximum likelihood liear trasform (MLLT) itroduced i [?] is also a example of feature-based solutios. All previous approaches assume that idepedet or decorrelated compoets are mixed liearly to geerate the observatio data. However, for most acoustic features used i ASR, this assumptio is ujustified or uacceptable. A example is cepstral features like MFCC ad PLPCC; I the cepstral domai, coarticulatio effects ad additive oise are examples of idepedet sources i the speech sigal that are oliearly combied with the iformatio about the vocal tract shape that is importat for recogitio. The source-filter model proposes that the excitatio sigal ad the vocal tract filter are liearly combied i the cepstral domai, but the source-filter model is urealistic i may cases, especially for cosoats. Timevaryig filters ad filter-depedet sources result i oliear source-filter combiatio i the cepstral domai [?]. I [?], we formulated the problem as a o-liear idepedet compoet aalysis (NICA) problem. We showed that usig the features geerated usig NICA i speech recogitio icreased the phoeme recogitio accuracy compared to liear feature trasforms like ICA [?], liear discrimiat aalysis (LDA) [?], ad MLLT. However, usig PCA or ICA approaches

2 is justified oly if a differet feature trasform is desiged for each Gaussia compoet i the model, as it assumes that the probabilistic model imposes idepedece or decorrelatio o the features. I this work, we will itroduce a uified iformatiotheoretic approach to feature trasformatio that makes o assumptios about the true probability desity fuctio of the origial features ad ca be applied for ay probabilistic model with arbitrary costraits. It estimates a oliear trasform ad the parameters of the probabilistic model that joitly miimize the relative etropy betwee the true likelihood ad its estimatio based o the model. Ulike previous approaches, this formulatio justify usig a sigle trasform for observatios geerated by differet classes. I the ext sectio, a iformatio-theoretic formulatio of the problem is described ad a solutio based o volume-preservig maps is itroduced. A iterative algorithm is described i sectio 4 to joitly estimate the parameters of the trasform of the features ad the parameters of the model. The, experimets based o a efficiet implemetatio of this algorithm are described i sectio 5. Fially, sectio 6 provides discussio of the results ad a summary of this work. 2. Problem Formulatio We will take here a differet approach to the problem, motivated by the discussio of the previous sectio. Istead of focusig o specific model assumptios, we will choose ay hypothesized parametric family of distributios to be used i our probabilistic model, ad search for a map of the features that improves the validity of our model. To do that, we will eed the followig propositio. Propositio: Let y f (x) be a arbitrary oe-to-oe map of the features radom vector X i < to Y i <, ad let ^P Λ(y) be the likelihood of the ew features usig HMM. The map f Λ (:) ad the set of parameters Λ Λ miimize the relative etropy betwee the hypothesized ad the true likelihoods of Y if ad oly if they also maximize the objective fuctio fi L E P (Y )»log det fifi log ^P Λ(Y ) ; (1) where ] is the Jacobia matrix of the map f (:). This ca be show by writig the expressio for the relative etropy after a arbitrary trasformatio, y f (x), of the iput radom vector X i <,as R(P (Y ); ^P (Y )) H(P (Y )) E P (Y ) hlog i ^P (Y ) ; where H(P (Y )) is the differetial etropy of the radom vector Y based o its true PDF P (Y ). The relatio betwee the output differetial etropy ad the iput differetial etropy is i geeral [?], H(P (Y ))» H(P (X)) Z < P (x)log fi fififi det fi fififi dx; where P (x) is the probability desity fuctio of the radom vector X, for a arbitrary trasformatio, y f (x), of the radom vector X i <, with equality if f (x) is ivertible. (2) (3) Therefore the relative etropy ca be writte as R(P (Y ); ^P (Y )) H(P (X)) fi E P (X)»log fifi det i hlog ^P (Y ) fi fifi ; (4) for a ivertible map y f (x). The expectatio of a fuctio g(x) for a arbitrary oe-tooe map y f (x) ca be writte as [?], E P (X) [g(x)] E P (Y ) g(f 1 (y)) Λ ; (5) where f 1 (:) is the iverse map. Therefore R(P (Y ); ^P (Y )) H(P (X)) fi fififi»log det i hlog ^P (Y ) : fi fififi Equatio 6 proves the propositio. The propositio states that miimizig the relative etropy is equivalet to maximizig the likelihood i the origial feature space, but with the ew features are modeled by HMM istead of the origial features A Maximum Likelihood Approach A importat special case that reduces the problem to maximum likelihood estimatio (MLE) of the model ad map parameters is give i the followig lemma, but first we eed to defie volume-preservig maps i <, where is a arbitrary positive iteger. Defiitio: A C 1 map f : S x! S y where S x ρ < ad S y ρ < is said to be volume-preservig if ad oly if fi fidet fi fi 18x 2 Sx. Lemma: Let y f (x) be a arbitrary oe-to-oe C 1 volume-preservig map of the radom vector X i < to Y i <, ad let ^P Λ(y) be the estimated likelihood usig HMM. The map f Λ (:) ad the set of parameters Λ Λ joitly miimize the relative etropy betwee the hypothesized ad the true likelihoods of Y if ad oly if they also maximize the expected log likelihood based o the hypothesized PDF. Usig the defiitio of the volume-preservig maps, the proof of the lemma is straightforward. By reducig the problem to MLE problem, efficiet algorithms based o the icremetal EM algorithm ca be desiged [?] Geerality of The Approach Our approach geeralizes previous approaches to feature trasform for speech recogitio i two ways. First, trasforms ca be desiged to satisfy arbitrary costraits o the model, ot ecessarily those that impose a idepedece or decorrelatio costrait o the features. Secod, it ca also be applied to ay parameterized probabilistic model ot ecessarily Gaussia. Therefore, it ca be used to desig a sigle trasform of the observatios, if the whole HMM recogizer is take as our probabilistic model, ad it ca be used to desig statedepedet or phoeme-depedet trasforms, if the state or the (6)

3 phoeme probabilistic models i the recogizer are used respectively. To show the geerality of our approach ad its wide rage of applicatios, we relate it with previous methods. PCA may be viewed as a special case of the propositio uder two equivalet costraits. First, if the trasform is costraied to be liear ad the model PDF is costraied to be a diagoal-covariace Gaussia, the the propositioreduces to PCA. Equivaletly, if the true feature PDF is assumed to be Gaussia, ad the model PDF is costraied to be a diagoalcovariace Gaussia, the propositioreduces to PCA. ICA also ca be show as a special case of propositio whe the hypothesized model assumes statistical idepedece of the trasformed features ad the trasform is costraied to be liear. Noliear ICA removes the costrait that the trasform must be liear. Factor aalysis is also a special case of the propositio by assumig that the hypothesized joit PDF is Gaussia with special covariace structure. MLLT is a special case of the propositio by usig a liear volume-preservig map of the features ad assumig the hypothesized joit PDF is Gaussia or a mixture of Gaussis. The two assumptios of liearity ad Gaussiaity together are equivalet to the assumptio that the origial features are Gaussia. It should be oted that all liear maps desiged to improve the satisfactio of the features of a give model are special cases of the lemma, as ay liear map is equivalet to a liear volumepreservig map multiplied by a scalar. 3. Implemetatio of the Maximum Likelihood Approach I the previous sectio, we showed that by usig a volumepreservig map, the problem is reduced to maximizig the likelihood of the output compoets. I this sectio, we use a symplectic map to geerate the ew set of features Symplectic Maps Symplectic maps are volume-preservig maps that ca be represeted by scalar fuctios. This very iterestig result allows us to joitly optimize the parameters of the symplectic map ad the model parameters usig the EM algorithm or oe of its icremetal forms [?]. Let x (x 1;x 2), ad y (y 1;y 2), with x 1;x 2;y 1;y 2 2 < 2, the ay reflectig symplectic map ca be represeted by y 1 x (x2) 2 ; (7) y 2 x 1 ; (8) where V ( ) ad T ( ) are two arbitrary scalar fuctios [?]. We use two multi-layer feed-forward eural etworks to get a good approximatio of these scalar fuctios [?]. V (u; A; C) T (u; B; D) j1 j1 c js(a ju); (9) d js(b ju); (10) where S(:) is a oliear fuctio like sigmoid or hyperbolic taget, a j is the jth row of the M matrix A, ad c j is the jth elemet of the M 1 vector C, b j is the jth row of the M matrix B, ad d j is the jth elemet of the M 1 vector D. The parameters of these two eural etworks ad the parameters of the model are joitly optimized to maximize the likelihood of the traiig data Joit Optimizatio of The Map ad Model Parameters We will explai i this sectio, how the parameters of the volume-preservig map ad the probabilistic model ca be joitly optimized to maximize the likelihood of the estimated features. We will assume that the system is HMM-based recogizer [?]. However, this approach ca be applied to ay statistical classificatio, detectio, or recogitio systems. We will assume also that the scalar fuctios i the symplectic map are represeted by three-layer feed forward eural etworks (NN) with the oliearity i the NNs represeted by hyperbolic taget fuctios. The derivatio for ay other o-liear fuctio is a straightforward replicatio of the derivatio provided here. Usig the EM algorithm, the auxiliary fuctio [?] tobe maximized is Q(Φ k ; Φ k1 ) E ο [log P (y; jφ k1 )jy; Φ k ]; (11) where 2 ο is the state sequece correspodig to the sequece of observatios x 2 < T that are trasformed to the sequece y 2 < T, T is the sequece legth i frames, Φ k (Λ k ;W k ) is the set of the recogizer parameters ad the symplectic parameters at iteratio k of the algorithm. The updatig equatios for the HMM parameters are the same as metioed i [?], ad therefore will ot be give here. We will assume that the recogizer models the coditioal PDF of the observatio as a mixture of diagoal-covariace Gaussias ad therefore j NX KX i1 m1 P (y i ;mjφ k ) P (y i jφ k ) (μ mj yj i ) ; ff 2 mj (12) where μ mj, ad ffmj 2 are the mea ad the variace of the jth elemet of the mth PDF respectively. Startig with A ad B, to update the values of the symplectic parameters a qr ad b qr for q 1; 2; ;M; ad for r 1; 2; ; ; we have to calculate the partial derivative of 2 the auxiliary fuctio with respect to this parameters. These partial derivatives are related to the partial derivatives of the auxiliary fuctio with respect to the features by the followig relatio ad 1j j1 1j ; (13) j1 2j ; (14) j1

4 where 1j 8 >< >: ad 8 >< >: P Mh1 2x 2r ch a hj S(a h x 2)[1 S 2 (a h x 2)] for r 6 j P M 2x 2r h1 ch a hj S(a h x 2)[1 S 2 (a h x 2)] c q[1 S 2 (aqx 2)] for r j h1 2 X k1 (15) ; (16) dh b hj b hk S(b h y 1)[1 S 2 (b h y 1)] ; (17) P Mh1 2y 1r ch b hj S(b h y 1)[1 S 2 (b h x 2)] for r 6 j P M 2y 1r h1 ch b hj S(b h y 1)[1 S 2 (b h x 2)] d q[1 S 2 (bqy 1)] for r j (18) For C ad D, to updated values of the symplectic parameter c q ad d q for q 1; 2; ;M; we have to calculate the partial derivative of the auxiliary fuctio with respect to these parameters. These partial derivatives are related to the partial derivatives of the auxiliary fuctio with respect to the features by the followig relatio ad where j1 1j j1 2j j1 1j ; (19) ; (20) 1j a qj[1 S 2 (aqx 2)]; (21) 2 X k1 ; (22) ad b qj[1 2 S (bqy1)]: (23) Usig Equatios from 12 to 23, the values of the symplectic map parameters ca be updated i each iteratio usig ay gradiet-based optimizatio algorithm. 4. EXPERIMENTS AND RESULTS The symplectic maximum likelihood algorithm described i sectio 3 is used to study the optimal feature space for diagoalcovariace Gaussia mixture HMM modelig of the TIMIT database. We used the cojugate-gradiet algorithm to update the values of the symplectic map parameters i each iteratio. The Mel-frequecy Cepstrum Coefficiets are calculated for 4500 utteraces from the TIMIT database. The overall 26- feature vector cosists of 12 MFCC coefficiets, eergy ad their deltas. I each iteratio, the ew feature vector is calculated usig the curret symplectic trasformatio parameters by usig the symplectic mappig equatio, the the maximum likelihood estimates of the HMM model parameters are calculated. The, the maximum likelihood estimates of the symplectic map parameters are estimated usig the cojugate-gradiet algorithm.. After the iterative algorithm coverges to a set of locally optimal HMM ad symplectic parameters, the traiig data are trasformed by the symplectic map yieldig the fial symplectic maximum likelihodd traform (SMLT) feature vector. The ew features are compared to LDA, liear ICA, ad MLLT i their phoeme recogitio accuracy. I our experimets, the 61 phoemes defied i the TIMIT database are mapped to 48 phoeme labels for each frame of speech as described i [?]. These 48 phoemes are collapsed to 39 phoeme for testig purposes as i [?]. A three-state leftto-right model for each triphoe is traied usig the EM algorithm. The umber of mixtures per state was fixed to four. After traiig the overall system ad obtaiig the symplectic map parameters, the approximately idepedet output coefficiets of the symplectic map are used as the iput acoustic features to a Gaussia mixture hidde Markov model speech recogizer [?]. The parameters of the recogizer are traied usig the traiig portio of the TIMIT database. The parameters of the triphoe models are the tied together usig the same approach as i [?]. To compare the performace of the proposed algorithm with other approaches, we geerated acoustic features usig LDA, liear ICA, ad MLLT. We used the maximum likelihood approach to LDA [?] ad kept the dimesios of the output of LDA the same as the iput. We used also the maximum likelihood approach to liear ICA as described i [?] ad briefly overviewed i sectio 2. Fially we implemeted MLLT as described i [?] ad briefly overviewed i sectio 2. All these techiques used a feature vector that cosists of twelve MFCC coefficiets, the eergy, ad their deltas as their iput. Testig this recogizer, usig the test data i the TIMIT database, we get the phoeme recogitio results i table 1. These results are obtaied by usig a bigram phoeme laguage model ad by keepig the isertio error aroud 10% as i [?]. The table compares these recogitio results to the oes obtaied by MFCC, LDA, liear ICA ad MLLT. Table 1: Phoeme Recogitio Accuracy Acoustic Features Recogitio Accuracy MFCC 73.7% Liear ICA 73.5% LDA 73.8% MLLT 74.6% SMLT 75.5%

5 5. DISCUSSION I this work, we described a framework for feature trasformatio for speech recogitio. We itroduced a oliear symplectic maximum likelihood feature trasform algorithm. This ca be attributed to the ability of the algorithm to fid a better represetatio of the acoustic clues of differet phoemes. The improvemet due to this differet represetatio over the iput MFCC features that have the same amout of iformatio about phoemes, is due to the approximate idepedece property of the ew features that allow a more efficiet probabilistic modelig of the coditioal probabilities with the same model complexity. 6. ACKNOWLEDGMENT This work was supported by NSF award umber

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters