REPORT DOCUMENTATION PAGE

Size: px

Start display at page:

Download "REPORT DOCUMENTATION PAGE"

Richard May
5 years ago
Views:

1 REPORT DOCMENTATION PAGE Form Approved OMB No Public reportin burden for this collection of information is estimated to averae hour per response, includin the time for reviewin instructions, searchin data sources, atherin and maintainin the data needed, and completin and reviewin the collection of information. Send comments reardin this burden estimate or any other aspect of this collection of information, includin suestions for reducin this burden to Washinton Headquarters Service, Directorate for Information Operations and Reports, 5 Jefferson Davis Hihway, Suite 4, Arlinton, VA -43, and to the Office of Manaement and Budet, Paperwork Reduction Project (74-88) Washinton, DC 53. PLEASE DO NOT RETRN YOR FORM TO THE ABOVE ADDRESS.. REPORT DATE (DD-MM-YYYY) MARCH 4. TITLE AND SBTITLE. REPORT TYPE Conference Paper (PREPRINT) INTEGRATED FEATRE NORMALIZATION AND ENHANCEMENT FOR ROBST SPEAKER RECOGNITION SING ACOSTIC FACTOR ANALYSIS (PREPRINT) 3. DATES COVERED (From - To) 5a. CONTRACT NMBER FA875-9-C-67 5b. GRANT NMBER 5c. PROGRAM ELEMENT NMBER 35885G 6. ATHOR(S) 5d. PROJECT NMBER 388 Taufiq Hasan, and John H. L. Hansen 5e. TASK NMBER BA 5f. WORK NIT NMBER AE 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Research Associates for Subcontractor: niversity of Texas at Dallas REPORT NMBER Defense Conversion, Inc. 8 W Campbell Rd Hillside Terrace Richardson, TX 758 Marcy NY SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) Air Force Research Laboratory/Information Directorate Rome Research Site/RIGC 55 Brooks Road Rome NY SPONSOR/MONITOR'S ACRONYM(S). SPONSORING/MONITORING AGENCY REPORT NMBER AFRL-RI-RS-TP--4. DISTRIBTION AVAILABILITY STATEMENT Approved for public release; distribution unlimited. PA# 88ABW--89 Cleared Date: 9 March 3. SPPLEMENTARY NOTES This paper was accepted for publication in the Proceedins of the Interspeech Conference, Portland, Oreon, 9-3 Sept-. This work was funded in whole or in part by Department of the Air Force contract number FA875-9-C- 67. The.S. Government has for itself and others actin on its behalf an unlimited, paid-up, nonexclusive, irrevocable worldwide license to use, modify, reproduce, release, perform, display, or disclose the work by or on behalf of the Government. All other rihts are reserved by the copyriht owner. 4. ABSTRACT State-of-the-art factor analysis based channel compensation methods for speaker reconition are based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM) mean super-vectors can be constrained to lie in a lower dimensional subspace, which does not consider the fact that conventional acoustic features may also be constrained in a similar way in the feature space. In this study, motivated by the low-rank covariance structure of cepstral features, we propose a factor analysis model in the acoustic feature space instead of the super-vector domain and derive a mixture of dependent feature transformation. We demonstrate that, the proposed Acoustic Factor Analysis (AFA) transformation performs feature dimensionality reduction, decorrelation, variance normalization and enhancement at the same time. The transform applies a square-root Wiener ain on the acoustic feature eienvector directions, and is similar to the sinal sub-space based speech enhancement schemes. We also propose several methods of adaptively selectin the AFA parameter for each mixture. The proposed feature transformation is applied usin a probabilistic mixture alinment, and is interated with a conventional i-vector system. Experimental results on the telephone trials of the NIST SRE demonstrate the effectiveness of the proposed scheme. 5. SBJECT TERMS Channel Normalization, Factor Analysis, Tactical SIGINT Technoloy, Gaussian Modelin 6. SECRITY CLASSIFICATION OF: 7. LIMITATION OF ABSTRACT a. REPORT b. ABSTRACT c. THIS PAGE 8. NMBER OF PAGES 5 9a. NAME OF RESPONSIBLE PERSON JOHN G. PARKER 9b. TELEPHONE NMBER (Include area code) Standard Form 98 (Rev. 8-98) Prescribed by ANSI Std. Z39.8

2 Interated Feature Normalization and Enhancement for robust Speaker Reconition usin Acoustic Factor Analysis Taufiq Hasan and John H. L. Hansen Center for Robust Speech Systems (CRSS), Eric Jonsson School of Enineerin, niversity of Texas at Dallas, Richardson, Texas,.S.A. {Taufiq.Hasan, Abstract State-of-the-art factor analysis based channel compensation methods for speaker reconition are based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM) mean super-vectors can be constrained to lie in a lower dimensional subspace, which does not consider the fact that conventional acoustic features may also be constrained in a similar way in the feature space. In this study, motivated by the low-rank covariance structure of cepstral features, we propose a factor analysis model in the acoustic feature space instead of the super-vector domain and derive a mixture dependent feature transformation. We demonstrate that, the proposed Acoustic Factor Analysis (AFA) transformation performs feature dimensionality reduction, de-correlation, variance normalization and enhancement at the same time. The transform applies a square-root Wiener ain on the acoustic feature eienvector directions, and is similar to the sinal sub-space based speech enhancement schemes. We also propose several methods of adaptively selectin the AFA parameter for each mixture. The proposed feature transform is applied usin a probabilistic mixture alinment, and is interated with a conventional i-vector system. Experimental results on the telephone trials of the NIST SRE demonstrate the effectiveness of the proposed scheme.. Introduction Factor analysis based channel compensation methods for speaker reconition are based on the assumption that, speaker/utterance dependent adapted GMM [] mean super-vectors can be constrained to lie in a lower dimensional subspace [ 4]. Lower dimensional speaker and channel dependent subspaces assumption inspired various channel compensation schemes such as, Eienvoice [], Eienchannel [3] and Joint Factor Analysis (JFA) [3]. With the introduction of i-vectors, which are the latent factors of the so called total variability space [4], research trend shifted towards directly applyin compensation techniques on these lower dimensional utterance level features, enablin the development of fully Bayesian techniques [5, 6]. While super-vector domain factor analysis techniques and its derivatives are effective, they do not consider the fact that acoustic feature vectors can also be constrained in a lower dimensional subspace of the feature space. This is clear, since a low-rank modelin of the super-covariance matrix obtained from different utterance GMMs is not equivalent to a low-rank assumption of the acoustic feature covariance matrix. Lower dimensional representation of speech short-time spectrum is a well established phenomenon which motivated a family of speech enhancement methods known as the sinal subspace approach [7]. This phenomenon This project was funded by AFRL under contract FA (Approved for public release, distribution unlimited: 88ABW-- 8), and partially by the niversity of Texas at Dallas from the Distinuished niversity Chair in Telecommunications Enineerin held by J.H.L. Hansen. is also found to be valid for popular acoustic features, such as Melfrequency Cepstral Coefficients (MFCC) [8], even thouh these features are processed by Discrete Cosine Transform (DCT) for de-correlation. To illustrate that acoustic feature covariance matrices have close to zero eienvalues, and can be assumed low-rank, we train a 4 mixture full-covariance GMM niversal Backround Model (BM) usin 6 dimensional MFCC features on a lare development data set. For a typical mixture of this BM, the covariance matrix is plotted as an intensity imae in Fi. (a), revealin that the non-diaonal components are indeed sinificant. Fi. (b) shows the sorted eienvalues of three different mixtures showin how the enery is compacted in the first few coefficients, while the later ones are close to zero. Also, it is known that the first few eien-directions of the feature covariance matrix are relatively more speaker dependent [8], further justifyin the low-rank assumption of features for a speaker reconition task. Inspired by the abovementioned observations, in this study, we propose an acoustic factor analysis [9] scheme for speaker reconition and develop a mixture-dependent feature dimensionality reduction transform. The proposed transformation performs dimensionality reduction, de-correlation, feature variance normalization, and enhancement, at once. Also, instead of hard feature alinment to a specific mixture, applyin the transformation, and then retrainin the BM, we use a probabilistic frame alinment and transform the BM parameters within the system. Interatin the proposed method within a standard i-vector system provides sinificant ain in system performance.. Acoustic Factor Analysis In this section, we describe the proposed factor analysis model of acoustic features, discuss its formulation, mixture-wise application for dimensionality reduction, advantaes and properties... Formulation Let X = {x n n = N} be the collection of all acoustic feature vectors from the development set. sin a factor analysis model, a d feature vector x X can be represented by, x = Wy + µ + c. () Here, W is a d q low rank factor loadin matrix that represents q < d bases spannin the subspace with important variability in the feature space, and µ is the d mean vector of x. We denote the latent variable vector y N (, I) as acoustic factors, which is of dimension q. We assume the remainin noise component c N (, σ I) to be isotropic and the model is thus equivalent to Probabilistic Principal Component Analysis (PPCA) []. The advantae of this model is that the acoustic factors y, explains the correlation between the feature coefficients x, which we believe are more speaker dependent [8], while the noise component c incorporates the residual variance of the data. It should be em- Details on feature extraction and BM data are iven in Sec. 4..

3 6.3 4 Typical mixture (Enery compaction 68.3%) Low compact mixture (Enery compaction 8.67%) Eienvalue 8 6 Hih compact mixture (Enery compaction 53.3%) Occurance probability Static Sorted Index Top posterior probability, max [p( x n )] (a) (b) (c) Fiure : (a) An intensity imae plot showin a typical covariance matrix of a full-covariance BM (darker indicate lower values) trained on 6-dimensional MFCC features ( static+ + ). (b) Sorted eienvalues of three covariance matrices from the BM showin different enery compaction in various mixtures. Enery compaction is the percentae of top eienvectors that account for 95% of total enery. A typical, low and hih compact mixture eienvalues are shown. (c) Distribution of top BM mixture posterior probability values for development features revealin that frame alinment is not always definitive. phasized that even thouh we denote the term c as noise, when x represent cepstral features c actually represent convolutional channel distortion. sin a mixture of these models [], we have, p(x) = w p(x ) where () p(x ) = N (µ, σ I + W W T ). (3) Here, µ, w, W and σ denote the mean vector, mixture weiht, factor loadins, and noise variance for the -th AFA mixture... Mixture dependent transformation One advantae of usin a mixture of PPCA models for acoustic factor analysis is that its parameters can be conveniently extracted from a full-covariance GMM-BM trained usin the Expectation- Maximization (EM) alorithm []. The procedure is iven below:... niversal Backround Model The BM model Λ, trained on the dataset X is iven by, p(x Λ ) = M w N (µ, Σ ) (4) = where w are the mixture weihts, M is the total number of mixtures, µ are the mean vectors and Σ are the full covariance matrices. Here, µ and w are the same as in () and (3).... Noise subspace selection The AFA parameter q in () defines the number of principal axes to retain, assumin the lower (d q) directions spans the noiseonly subspace [7]. The maximum likelihood estimate of the noise variance σ for the -th mixture is iven by, σ = d q d i=q+ λ,i (5) where λ,q+ λ,d are the d q smallest eienvalues of Σ. We note that q can be different for each mixture, and thus in the later sections, we denote it by q()...3. Compute the factor loadin matrix The maximum likelihood estimation of the factor loadin matrix W of the -th mixture of the AFA model in () is iven by [], W = q (Λ q σ I) / R (6) where q is a d q matrix whose columns are the q leadin eienvectors of Σ, Λ q is a diaonal matrix containin the correspondin q eienvalues, and R is a q q arbitrary orthoonal rotation matrix. In this work, we set R = I...4. The AFA transformation The posterior mean of the acoustic factors y n can be used as the transformed and dimensionality reduced version of x n for the -th component of the AFA model. This can be shown to be []: where E{y n x n, } = y n x n, = A T (x n µ i) z n, (7) A = W M T and (8) M = σ I + W T W. (9) The matrix A is termed the -th AFA transform. Here, the oriinal feature vectors x n are replaced by the mixture dependent transformed vectors z n,. It can be easily shown that z n, is normally distributed with a zero mean and diaonal covariance matrix Σ z = I σλ q, demonstratin that A performs mean normalization and de-correlation. Conventionally, a feature vector x n is alined with a mixture that yields the hihest posterior probability p( x n, Λ ), and the correspondin transformation is applied for dimensionality reduction []. However, as we observe the distribution of max p( x n, Λ ) for our development data in Fi. (c), features are alined with multiple Gaussians in most cases providin values of max p( x n, Λ ).5. Thus, we propose to apply the AFA transform usin a probabilistic alinment and transform the BM instead of retrainin it. The new BM is iven by, M p(z ˆΛ ) = w N (, Σ z ) () i=.3. Feature enhancement in AFA Expansion of the transformation matrix A in (8) unveils the builtin enhancement operation it performs. Substitutin the expressions of W and M from (6) and (9) in (8), we have: A T = Λ q (Λq σ I) T / T q T where we utilized the fact that, q diaonal ain matrix iven by: = Λq q G q T () = I and introduced a G = Λ q (Λ q σ I) T /. ()

4 GAIN Square-root Wiener Gain ( ξ ξ+ ) / SNR, ξ [db] Wiener Gain ξ ξ+ Fiure : Input SNR [db] (ξ) vs. Wiener ains. Wiener and square-root Wiener ains are shown with a solid (-) and dashed (- -) line, respectively. 3. AFA interated i-vector system In this section, we describe how the proposed method could be incorporated into the current state-of-the-art i-vector system framework [4]. First, a full covariance BM model, Λ iven by (4), is trained on the development data vectors. Next, the AFA dimension q() for each mixture is set, and the noise variance σ is computed usin (5). The factor loadin matrix W and transformation matrix A are then computed usin (6) and (8), respectively. For each development utterance s, the zero order statistics is iven by, N s() = n s γ (n) where γ (n) = p( x n, Λ ) (5) Keepin aside the term Λ q in (), we observe that the operation in (7) first computes the inner product of the mean normalized acoustic feature with the q principal eienvectors of Σ, then for each i-th eienvector direction, applies the ain function specified by the i-th diaonal of G in (), which is actually a square-root Weiner ain function []. Definin the classic speech enhancement terminoloy a priori SNR ξ as [7] ξ = (λ,i σ)/σ, and usin this to express the ain equations, we plot the ain functions aainst ξ in Fi.. Thus, in addition to de-correlation and dimensionality reduction, the transform A also performs feature enhancement, assumin a noise variance σ. It maybe noted that conventional factor analysis techniques in super-vector space are also known to be representable usin similar Wiener like ain functions as discussed in []..4. Feature variance normalization in AFA The term Λ q in () normalizes the variance of the acoustic feature stream in the i-th eien-direction, since λ,i is the expected feature variance alon this direction [3]. The AFA transformation performs this normalization in each mixture in addition to the enhancement mentioned in the previous section. This process is interestinly similar to the cepstral variance normalization frequently performed in the front-end except for its domain of operation..5. Adaptive AFA dimension selection Since the distribution of eienvalues are different in each mixture, a unique value of q() should be suitable in each case. This is illustrated in Fi. (b), where eienvalues of three different mixture covariance matrices are shown. Typically we see an enery compaction of 7%, that is, the first 7% eienvalues account for 95% of the total enery. But in other cases, the enery compaction can be very low or hih, demandin that the AFA retained dimension to be low or hih, respectively. Motivated by this observation, we develop a simple method of selectin the AFA dimension, q(). We first set the enery compaction ratio E close to.9.97, then compute the sorted eienvalues λ i, of Σ for each mixture. The optimal dimension retainin E% enery is then calculated as: q E() = min q q i= λ,i s.t. d > E. (3) i= λ,i This method is denoted by AFA-Var-En. As an alternate method, we use the effective rank estimation alorithm in [4], enerally used for noise estimation in matrices, to select the AFA dimension. A threshold δ [, ] is set and the AFA dimension is obtained by: ( ) q / i= q δ () = min s.t. λ,i q d > δ. (4) i= λ,i We denote this method as AFA-Var-Rk. When the AFA dimension fixed for all mixtures, we denote the system by AFA-Fix. followin the standard procedure [,4]. However, for AFA, the first order statistics ˆF s() is extracted usin the transformed features in the correspondin mixtures instead of the oriinal features. γ (n)z n, = A T γ (n)(x n µ ) (6) ˆF s() = n s n s For estimatin the total variability (TV) matrix, the standard procedure is followed [4] usin the new BM ˆΛ iven in (), and statistics ˆF s() and N s(). It should be noted that, in this case the super-vector dimension reduces to K = M = q() from Md, and TV matrix size becomes K R. We define super-vector compression ratio α = K/Md, measurin overall AFA compaction. 4. Experiments and Results We perform our experiments on the male trials of NIST SRE telephone train/test condition (condition 5, normal vocal effort). 4.. System Description For voice activity detection, a phoneme reconizer [5] combined with an enery based scheme is used. 6-dimension feature vectors (9 MFCC +Enery + + ) are extracted, usin a 5 ms window with ms shift and Gaussianized usin a 3-s slidin window. Gender dependent full and diaonal covariance BMs with 4 mixtures are trained on utterances selected from Switchboard II Phase and 3, Switchboard Cellular Part and, and the NIST 4, 5, 6 SRE enrollment data. For the TV matrix trainin, the same dataset is utilized. 4 dimensional i-vectors are extracted, whitened and then lenth normalized [5]. For session variability compensation and scorin we use a Gaussian Probabilistic Linear Discriminant Analysis (PLDA) scheme with a fullcovariance noise model [5]. 4.. Results usin AFA transformation We compare several AFA systems with our diaonal and fullcovariance BM based i-vector systems, denoted by baseline dia-cov and baseline full-cov, respectively. The followin AFA systems are built: AFA-Fix with q() = 4, 48 and 54, AFA-Var-En with E =.9,.95 and.97, and AFA-Var-Rk with δ =.99,.995 and.997. In this experiment, we fix the PLDA eienvoice size N EV to. The results are shown in Table. From the results, we observe that the AFA-Var-En system in eneral outperforms the basic AFA-Fix, even with similar compression ratio α. The best results are obtained for AFA-Var-En (E =.97) with an EER of.76% obtainin a 5.37% relative improvement from Baseline full-cov. The AFA-Var-Rk (δ =.997) method provided the best DCF old of These results prove that the proposed AFA transformation is successfully able to reduce some nuisance directions in the feature space producin i-vectors with better speaker discriminatin ability. We also observe that AFA i-vector systems are computationally less expensive compared to the baseline system, rouhly by a factor of α. 3

5 Table : Comparison between baseline i-vector and AFA systems with respect to %EER, DCF old and DCF new for N EV =. Percent relative improvement (%r) and super-vector compression ratio (α) are also shown. System α %EER/%r DCF old/%r DCF new/%r Baseline full-cov q = /.4.3/.57.44/3.46 AFA-Fix q = 48.8./.5./ /. q = /5.96./ /3.9 E = /-.56.3/ /6.7 AFA-Var-En E = /../.96.4/.55 E = /5.37./9.4.4/.8 δ = /-.3.3/.7.4/8.35 AFA-Var-Rk δ = /6.9./5.6.4/7.39 δ = /.95./.56.38/6.49 Table : Linear score fusion of baseline and AFA systems Individual system performances System N EV %EER DCF old DCF new (i) Baseline full-cov (ii) Baseline dia-cov (iii) AFA-Var-En (.97) (iv) AFA-Fix (48) Fusion system performances Fusion of (i) & (iii) Fusion of (i) & (iv) Fusion of (i) - (iii) This is expected since the computational complexity of an i-vector system is proportional to the super-vector size [6] Fusion of multiple systems We pick four of our systems for fusion: (i) Baseline full-cov, (ii) Baseline dia-cov, (iii) AFA-Var-En (E =.97), and (iv) AFA- Fix (q = 48). The PLDA N EV parameter was set to for systems (i)-(iii) and 5 for system (iv). Simple linear fusion was used with mean and variance normalization of scores to (, ) for calibration. From the results presented in Table, fusion performance of (i) and (iii) clearly reveals that AFA and baseline fullcov systems have complementary information, as the EER and DCF values improve. The best result is achieved by fusin systems (i), (ii) and (iii), to obtain: EER =.87%, DCF old =.993 and DCF new = Performance comparison of the systems (i), (ii) and the fusion is shown in Fi. 3 usin Detection Error Trade-off (DET) curves. Here, aain we observe the superiority of the proposed AFA-Var-En system compared to the baseline especially in the low false alarm reion, whereas the fusion system shows better performance in the full DET curve rane. 5. Conclusions In this study, we have proposed a factor analysis model for acoustic features to compensate for transmission channel mismatch in speaker reconition. sin the model, we have developed a mixture dependent feature transform that performs dimensionality reduction, de-correlation, variance normalization and enhancement, at once. Instead of a separate front-end processin, the proposed transform has been interated within an i-vector speaker reconition framework usin a probabilistic feature alinment technique. Experimental results have demonstrated the superiority of the proposed scheme compared to the baseline i-vector system. 6. References [] D. Reynolds, T. Quatieri, and R. Dunn, Speaker verification usin adapted Gaussian mixture models, Diital Sinal Process., vol., no. 3, pp. 9 4,. [] P. Kenny, G. Boulianne, and P. Dumouchel, Eienvoice modelin with sparse trainin data, IEEE Trans. Speech and Audio Process., vol. 3, no. 3, pp , May 5. Miss probability (%) AFA-Var-En (.97) Baseline Full-cov Fusion False Alarm probability (%) Fiure 3: DET curves showin individual subsystems: AFA-Var-En (E =.97), Baseline full-cov, and Fusion of (i)-(iii) shown in Table. [3] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, Joint factor analysis versus Eienchannels in speaker reconition, IEEE Trans. Audio, Speech, and Lan. Process., vol. 5, no. 4, pp , May 7. [4] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, Frontend factor analysis for speaker verification, IEEE Trans. Audio, Speech, and Lan. Process., vol. 9, no. 99, pp , May. [5] D. Garcia-Romero and C. Y. Espy-Wilson, Analysis of i-vector lenth normalization in speaker reconition systems, in Proc. Interspeech, Florence, Italy, Oct., pp [6] P. Matejka, O. Glembek, F. Castaldo, M. Alam, O. Plchot, P. Kenny, L. Buret, and J. Cernocky, Full-covariance BM and heavy-tailed PLDA in i-vector speaker verification, in Proc. ICASSP, Florence, Italy, Oct., pp [7] Y. Ephraim and H. Van Trees, A sinal subspace approach for speech enhancement, IEEE Trans. Speech and Audio Process., vol. 3, no. 4, pp. 5 66, Jul 995. [8] B. Zhou and J. H. L. Hansen, Rapid discriminative acoustic model based on Eienspace mappin for fast speaker adaptation, IEEE Trans. Speech and Audio Process., vol. 3, no. 4, pp , July 5. [9] T. Hasan and J. H. L. Hansen, Factor analysis of acoustic features usin a mixture of probabilistic principal component analyzers for robust speaker verification, in Proc. Odyssey, Sinapore, June. [] M. Tippin and C. Bishop, Mixtures of probabilistic principal component analyzers, Neural computation, vol., no., pp , 999. [] S. Vasehi, Advanced sinal processin and diital noise reduction. Wiley, 996. [] A. McCree, D. Sturim, and D. Reynolds, A new perspective on GMM subspace compensation based on PPCA and wiener filterin, in Proc. Interspeech, Florence, Italy, Oct., pp [3] T. Anderson, Asymptotic theory for principal component analysis, The Annals of Mathematical Statistics, vol. 34, no., pp. 48, 963. [4] J. Cadzow, SVD representation of unitarily invariant matrices, IEEE Trans. Acoustics, Speech and Sinal Process., vol. 3, no. 3, pp. 5 56, Jun 984. [5] P. Schwarz, P. Matejka, and J. Cernocky, Hierarchical structures of neural networks for phoneme reconition, in Proc. ICASSP, vol., May 6. [6] O. Glembek, L. Buret, P. Matejka, M. Karafiat, and P. Kenny, Simplification and optimization of i-vector extraction, in Proc. ICASSP, Florence, Italy, Oct., pp

Support Vector Machines using GMM Supervectors for Speaker Verification

1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail: