Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization

Size: px

Start display at page:

Download "Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization"

Antonia Mason
5 years ago
Views:

1 Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization Qiang Wu, Liqing Zhang, and Guangchuan Shi Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai , China Abstract. Nonnegative tensor factorization is an extension of nonnegative matrix factorization(nmf) to a multilinear case, where nonnegative constraints are imposed on the PARAFAC/Tucker model. In this paper, to identify speaker from a noisy environment, we propose a new method based on PARAFAC model called constrained Nonnegative Tensor Factorization (cntf). Speech signal is encoded as a general higher order tensor in order to learn the basis functions from multiple interrelated feature subspaces. We simulate a cochlear-like peripheral auditory stage which is motivated by the auditory perception mechanism of human being. A sparse speech feature representation is extracted by cntf which is used for robust speaker modeling. Orthogonal and nonsmooth sparse control constraints are further imposed on the PARAFAC model in order to preserve the useful information of each feature subspace in the higher order tensor. Alternating projection algorithm is applied to obtain a stable solution. Experiments results demonstrate that our method can improve the recognition accuracy specifically in noise environment. 1 Introduction Speaker recognition is the task of determining the identification of a person from one s voice which has great potential applications in industry, business and security, etc. For a speaker recognition system, feature extraction is one of important tasks, which aims at finding succinct, robust, and discriminative features from acoustic data. Acoustic features such as linear predictive cepstral coefficients (LPCC)[1], mel-frequency cepstral coefficients (MFCC)[1], perceptual linear predictive coefficients (PLP) [2] are commonly used. The conventional speaker modeling methods such as Gaussian mixture models(gmm)[3] achieve very high performance for speaker identification and verification tasks on high-quality data when training and testing conditions are well controlled. However, in the real application such systems usually do not perform well for a large variety of speech signals corrupted by adverse conditions such as environmental noise and channel distortions. Feature compensation techniques [2,4] such as CMS, RASTA have been developed for robust speech recognition. Spectral subtraction [5] and subspacebasedfiltering[6]techniquesassumingaprioriknowledgeofthenoisespectrumhavebeen widely used because of their simplicity. Recently the computational auditory nerve models and sparse coding attract much attention from both neuroscience and speech signal processing communities. Smith et al.[7] proposed an algorithm for learningefficient auditory codes using a theoretical model for coding sound in terms of spikes.much research F. Sun et al. (Eds.): ISNN 2008, Part I, LNCS 5263, pp , c Springer-Verlag Berlin Heidelberg 2008

2 12 Q. Wu, L. Zhang, and G. Shi about sparse coding and representation for sound and speech[8,9,10] is also proved to be useful for auditory modeling and speech separation which will be a potential way for robust speech feature extraction. As a powerful data modeling tool for pattern recognition, multilinear algebra of the higher order tensors has been proposed as a potent mathematical framework to manipulate the multiple factors underlying the observations. Currently common tensor decomposition methods include: (1) the CANDECOMP/PARAFAC model [11,12,13]; (2) the Tucker Model[14,15]; (3) Nonnegative Tensor Factorization (NTF) which imposes the nonnegative constraint on the CANDECOMP/PARAFAC model [16,17]. In computer vision applications, Multilinear ICA [18]and tensor discriminant analysis [19] are applied to image representation and recognition, which improve recognition performance. In this paper, we proposed a new feature extraction method for robust speaker recognition based on auditory periphery model and tensor factorization. A novel tensor factorization method called cntf is derived by imposing orthogonal and nonnegative constraints on the tensor structure. The advantages of our feature extraction method include following: (1) simulation of the auditory perception mechanism of human being provides a higher frequency resolution at low frequencies which helps to obtain robust spectro-temporal feature; (2) a supervised feature extraction procedure via cntf learns the basis functions of multi-related feature subspaces which preserve the individual, spectro-temporal information in the tensor structure; furthermore the orthogonal constraint ensures redundancy minimization between different basis functions; (3) sparse constraint on cntf enhances energy concentration of speech signal which will preserve the useful feature during the noise reduction. The sparse tensor feature extracted by cntf can be further processed into a representation called auditory-based nonnegative tensor feature(antf) via discrete cosine transform, which can be used as feature for speaker recognition. 2 Method 2.1 Multilinear Algebra and PARAFAC Model Multilinear algebra is the algebra of higher order tensors. A tensor is a higher order generalization of a matrix. Let X R N1 N2... NM denotes a tensor. The order of X is M. An element of X is denoted by x n1,n 2,...,n M,where1 n d N d and 1 d M. The mode-d matricization or matrix unfolding of an Mth-order tensor X R N1 N2... NM rearranges the elements of X to form the matrix X (d) R N d N d+1 N d+2 N M N 1 N d 1, which is the ensemble of vectors in R N d obtained by keeping index n d fixed and varying the other indices. Matricizing a tensor is similar to vectoring a matrix. The PARAFAC model was suggested independently by Carroll and Chang[11] under the name CANDECOMP(canonical decomposition) and by Harshman[12] under the name PARAFAC(parallel factor analysis) which has gained increasing attention in the data mining field. This model has structural resemblance with many physical models of common real-world data and its uniqueness property implies that the data following the PARAFAC model can be uniquely decomposed into individual contributions.

3 Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization 13 An M-way tensor X R N1 N2... NM can be decomposed into a sum of M rank-1 terms, i.e. represented by the outer product of M vectors: X = a (1) a (2) a (M), (1) where is the outer product operator, a (d) R N d,ford =1, 2,...,M. The rank of tensor X, denoted R = rank(x ), is the minimal number of rank-1 tensors that is required to yield X : X = R r=1 A (1) :,r A(2) :,r A(M) :,r, (2) where A (d) :,r represents the rth column vector of the mode matrix A (d) R N d R. The PARAFAC model aims to find a rank-r approximation of the tensor X, X R r=1 A (1) :,r A(2) :,r A(M) :,r, (3) The PARAFAC model can also be written in matrix notation by use of the Khatri-Rao product, which gives the equivalent expressions: X (d) A (d) [ A (d 1)... A (1) A (M)... A (d+1)] T, (4) where is the Khatri-Rao product operator. 2.2 Constrained Nonnegative Tensor Factorization Given a nonnegative M-way tensor X R N1 N2... NM, nonnegative tensor factorization(ntf) seeks a factorization of X in the form: X ˆX = R r=1 A (1) :,r A(2) :,r A(M) :,r, (5) where the mode matrices A (d) R Nd R for d = 1,...,M are restricted to have only nonnegative elements in the factorization. In order to find an approximate tensor factorization ˆX, we can construct Least Square cost function J LS and KL-divergence cost function J KL based on the approximate factorization model (4). The cost functions with mode matrices A (d) are given by J LS1 (A (d) )= 1 2 = 1 2 M X (d) A (d) Z (d) 2 F d=1 M N d N d ( ) 2 [X (d) ] pq [A (d) Z (d) ] pq (6) d=1 p=1 q=1

4 14 Q. Wu, L. Zhang, and G. Shi J KL1 (A (d) )= = M D(X (d) A (d) Z (d) ) d=1 M N d N ( ) d [X (d) ] pq [X (d) ] pq log [X [A (d) Z (d) (d) ] pq +[A (d) Z (d) ] pq ] pq d=1 p=1 q=1 where Z (d) = [ A (d 1)... A (1) A (M)... A (d+1)] T and N d = M j d N j. These cost functions are quite similar to NMF[20], which performs matrix factorization in each mode and minimizes the error for all modes. By above model, we can add additional constraint which makes the basis functions be as orthogonal as possible, i.e. ensures redundancy minimization between different basis functions. This orthogonal constraint can be imposed by minimizing the formula p q [A(d)T A (d) ] pq. For the traditional NMF methods, many approaches have been proposed to control the sparsenses by additional constraints or penalization terms. These constraints or penalizations can be applied to the basis vectors or both basis and encoding vectors. The nsnmf model[22] proposed a factorization model V = WSH, providing a smoothing matrix S R q q given by S =(1 θ)i + θ q 11T (8) where I is the identify matrix, 1 is a vector of ones, and the parameter θ satisfies 0 θ 1. Forθ =0, the model(8) is equivalent to the original NMF. As θ 1, stronger smoothness is imposed on S, leading to a strong sparseness on both W and H. By this nonsmooth approach, we can control the sparseness of basis vectors and encoding vectors and maintain the faithfulness of the model to the data. The same idea can be applied to the NTF. Then the corresponding cost functions with orthogonal and sparse control constraints can be given by J LS2 (A (d) )= J KL2 (A (d) )= M 1 N d N d ) 2 ([X (d) ] pq [A (d) SZ (d) ] pq + α [A (d)t A (d) ] pq 2 d=1 p=1 q=1 (9) M N d N d ( ) [X (d) ] pq [X (d) ] pq log [X [A (d) SZ (d) (d) ] pq +[A (d) SZ (d) ] pq ] pq d=1 p=1 q=1 +α p q[a (d)t A (d) ] pq (10) p q (7) where α>0 is a balancing parameter between reconstruction and orthogonality. We can derive multiplicative learning algorithms for mode matrices A (d) using the exponential gradient, which are similar to those in NMF. Updating algorithms in an element-wise manner for minimizing the cost function (9) and (2.2) are directly derived as done in [16,17]:

5 Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization 15 LS: KL: A (d) ij A (d) ij A (d) ij [X (d) Z (d)t S T ] ij [A (d) SZ (d) Z (d)t S T ] ij + α (11) p j [A(d)T ] pi A (d) ij k [SZ(d) [X ] (d) ] ik jk [A (d) SZ (d) ] ik k [SZ(d) ] jk + α p j [A(d)T ] pi (12) 3 Feature Extraction Based on Auditory Model and Tensor Representation As we know, human auditory system is of powerful capability in speech recognition and speaker recognition. Much of research on auditory model has already shown that the features based on simulation of auditory system are more robust than traditional features under noisy background. In our feature extraction framework, we calculate the frequency selectivity information by imitating the process performed in the auditory periphery and pathway. And the robust speech features are obtained by the projections of the extracted auditory information mapped into multiple interrelated feature subspace via cntf. A diagram of feature extraction and speaker recognition framework is shown in Figure 1. Pre-Emphasis DCT GMM Recognition Result Cochlear Filters Nonlinearity X A Fig. 1. Feature extraction and recognition framework 3.1 Feature Extraction Based on Auditory Model We extract the features by imitating the process occurred in the auditory periphery and pathway, such as outer ear, middle ear, basilar membrane, inner hair-cell, auditory nerves, and cochlear nucleus. We implement traditional pre-emphasis to model the combined outer and middle ear functions, which is x pre (t) =x(t) 0.97x(t 1),wherex(t) is the discrete time speech signal, t =1, 2,...,andx pre (t) is the filtered output signal. The frequency selectivity of peripheral auditory system such as basilar membrane is simulated by a bank of cochlear filters, which have an impulse response in the following form: g i (t) =a i t n 1 e 2πbiERB(fi)t cos(2πf i t + φ i ), (1 i N), (13)

6 16 Q. Wu, L. Zhang, and G. Shi where n is the order of the filters, N is the number of filterbanks. For the ith filter bank, f i is the center frequency, ERB(f i ) is the equivalent rectangular bandwidth (ERB) of the auditory filter, φ i is the phase, and a i,b i R are constants where b i determines the rate of decay of the impulse response, which is related to bandwidth. In order to model nonlinearity of the inner hair-cells, we compute the power of each band in every frame k with a logarithmic nonlinearity: P (i, k) =log(1 + γ {x i g(t)} 2 ), (14) t frame k where P (i, k) is the output power, γ is a scaling constant, and x i g(t)= τ x pre(τ)g i (t τ) is the outputs of each gammatone filterbanks. This model can be considered as average firing rates in the inner hair-cells, which simulate the higher auditory pathway. The resulting power feature vector P (i, k) at frame k with component index of frequency f i, comprises the spectro-temporal power representation of the auditory response. Similar to Mel-scale processing in MFCC extraction, this power spectrum provides a much higher frequency resolution at low frequencies than at high frequencies. 3.2 Sparse Tensor Representation In order to extract robust features based on tensor structure, we model the cochlear power feature of different speakers as 3-order tensor X R N f N t N s. Each feature tensor is an array with three modals frequency time speaker identity which comprises the cochlear power feature matrix X R N f N t of different speakers. Then we transform the auditory feature tensor into multiple interrelated subspaces by cntf to learn the basis functions A (d), (d =1, 2, 3). Figure 2 shows the tensor model for the calculation of basis functions. Compared with traditional subspace learning methods, the extracted tensor features may characterize the differences of speakers and preserve the discriminative information for classification. As described in Section 3.1, the cntf Basis Functions Fig. 2. Tensor model for calculation of basis functions via cntf cochlear power feature can be considered as neurons response in the inner hair-cells. The hair-cells have receptive fields which refer to a coding of sound frequency. Here we employ the sparse localized basis function A R N f R in time-frequency subspace to transform the auditory feature into the sparse feature subspace, where R is the dimension of sparse feature subspace. The representation of auditory sparse feature X s is obtained via the following transformation: X s = ÂX (15)

7 Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization (a) Basis functions (b) Examples of encoding vector Fig. 3. Results of cntf applied to the clean speech data. (a) basis functions (100 80) in spectrotemproal domain. (b) Examples for encoding feature vector. where Â consists of the nonnegative elements of A 1,i.e.Â =[A 1 ] +. Figure 3(a) shows an example of basis functions in spectro-temporal domain. From this result we can see that most elements of basis function are near to zero, which accords with the sparse constraint of cntf. Figure 3(b) gives several examples for the encoding feature vector after transformation which also prove the sparse characteristic of feature. Our feature extraction model is based on the fact that in sparse coding the energy of the signal is concentrated on a few components only, while the energy of additive noise remains uniformly spreading on all the components. As a soft-threshold operation, the absolute values of pattern from the sparse coding components are compressed towards to zero. The noise is reduced while the signal is not strongly affected. We also impose orthogonal constraint to cntf which helps to extract the helpful feature by minimizing the redundancy of different basis functions. 4 Experiments Results In this section we provide the evaluation results of a speaker identification system using ANTF. Aurora2 speech corpus is used to test the recognition performance, which is designed to evaluate speech recognition algorithms in noisy conditions. Different noise classes were considered to evaluate the performance of ANTF against MFCC, Mel- NMF, Mel-PCA feature and identification accuracy was assessed. In our experiments the sampling rate of speech signals was 8kHz. For the given speech signals, we employed time window of length samples (5s). For computational simplicity, we selected 36 cochlear filter banks and time duration 10 samples(1.25ms). Then the dimension of the speaker data is = 360. We calculated the basis functions using cntf after the calculation of cochlear power feature. For learning the basis functions in different subspaces, 550 sentences (5 sentences each person) were selected randomly as the training data and 200 dimension sparse tensor representation is extracted. In order to estimate the speaker model and test the efficiency of our method, we use 5500 sentences (50 sentences each person) as training data and 1320 sentences (12 sentences each person) mixed with different kinds of noise were used as testing data. The

8 18 Q. Wu, L. Zhang, and G. Shi Table 1. Identification accuracy in four noisy conditions(subway, car noise, babble, exhibition hall) for Aurora2 noise testing dataset Noise Subway Babble Car noise Exhibition hall SNR(dB) ANTF(%) Mel-NMF(%) Mel-PCA(%) MFCC(%) testing data were mixed with subway, babble, car noise, exhibition hall in SNR intensities of 20dB, 15dB, 10dB and 5dB. For the final feature set, 16 cepstral coefficients were extracted and used for speaker modeling. GMM was used to build the recognizer with 64 gaussian mixtures. For comparison, the performance of MFCC, Mel-NMF and Mel-PCA with 16-order cepstral coefficients are also tested. We use PCA and NMF to learn the part-based representation in the spectro-temporal domain after mel filtering, which is similar to [9]. The feature after PCA or NMF projection was further processed into the cesptral domain viadiscretecosinetransform. Table 1 presents the identification accuracy obtained by ANTF and baseline system in all testing conditions. We can observe from Table 1 that the performance degradation of ANTF is slower with increasing noise intensity that compared with other features. It performs better than other three features in the high noise conditions such as 5dB condition noise. Figure 4 describes the identification rate in four noisy conditions averaged over SNRs between 5-20 db, and the overall average accuracy across all the conditions. The results suggest that this auditory-based tensor representation feature is robust against the additive noise, which indicates the potential of the new feature for dealing with a wider variety of noisy conditions. Identification rate 100% 80% 60% 40% 20% ANTF Mel NMF Mel PCA MFCC 0 Subway Babble Car noise Exhibition hall Average Fig. 4. Identification accuracy in four noisy conditions averaged over SNRs between 5-20dB, and the overall average accuracy across all the conditions, for ANTF and other three features using Aurora2 noise testing dataset

9 Robust Speaker Modeling Based on Constrained Nonnegative Tensor Factorization 19 5 Conclusion In this paper, we presented a novel speech feature extraction framework which is robust to noise with different SNR intensities, for evaluation with identification systems operating under a wide variety of conditions. This approach is primarily data-driven and effectively extracts robust feature of speech called ANTF that is invariant to noise types and interference with different intensities. We derived new feature extraction methods called cntf for robust speaker identification. The research is mainly focused on the encoding of speech based on general higher order tensor structure to extract the robust auditory-based feature from interrelated feature subspace. The frequency selectivity features at basilar membrane and inner hair cells were used to represent the speech signals in the spectro-temporal domain, and then cntf algorithm was employed to extract the sparse tensor representation for robust speaker modeling. The discriminative and robust information of different speakers may be preserved after the multi-related subspace projection. Experiment on Aurora2 has shown the improvement of the noise robustness by the new method, in comparison with baseline systems trained on the same amount of information. Acknowledgment The work was supported by the National High-Tech Research Program of China (Grant No.2006AA01Z125) and the National Natural Science Foundation of China (Grant No ). References 1. Rabiner, L.R., Juang, B.: Fundamentals on Speech Recognition. Prentice Hall, New Jersey (1996) 2. Hermansky, H., Morgan, N.: RASTA Processing of Speech. IEEE Trans. Speech Audio Process 2, (1994) 3. Reynolds, D.A., Quatieri, T.F., Dunn, R.: Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 10, (2000) 4. Reynolds, D.A.: Experimental Evaluation of Features for Robust Speaker Identification. IEEE Trans. Speech Audio Process 2, (1994) 5. Berouti, M., Schwartz, R., Makhoul, J., Beranek, B., Newman, I., Cambridge, M.A.: Enhancement of Speech Corrupted by Acoustic Noise. Acoustics, Speech, and Signal Processing. In: IEEE International Conference on ICASSP 1979, vol. 4, pp (1979) 6. Hermus, K., Wambacq, P., Van hamme, H.: A Review of Signal Subspace Speech Enhancement and Its Application to Noise Robust Speech Recognition. EURASIP Journal on Applied Signal Processing 1, (2007) 7. Smith, E., Lewicki, M.S.: Efficient Auditory Coding. Nature 439, (2006) 8. Kim, T., Lee, S.Y.: Learning Self-organized Topology-preserving Complex Speech Features at Primary Auditory Cortex. Neurocomputing 65, (2005) 9. Cho, Y.C., Choi, S.: Nonnegative Features of Spectro-temporal Sounds for Classification. Pattern Recognition Letters 26, (2005) 10. Asari, H., Pearlmutter, B.A., Zador, A.M.: Sparse Representations for the Cocktail Party Problem. Journal of Neuroscience 26, (2006)

10 20 Q. Wu, L. Zhang, and G. Shi 11. Carroll, J.D., Chang, J.J.: Analysis of Individual Differences in Multidimensional Scaling via An n-way Generalization of Eckart-Young Decomposition. Psychometrika 35, (1970) 12. Harshman, R.A.: Foundations of the PARAFAC Procedure: Models and Conditions for An Explanatory Multi-modal Factor Analysis. UCLA Working Papers in Phonetics 16, 1 84 (1970) 13. Bro, R.: PARAFAC: Tutorial and Applications. Chemometrics and Intelligent Laboratory Systems 38, (1997) 14. De Lathauwer, L., De Moor, B., Van de walle, J.: A Multilinear Singular Value Decomposition. SIAM Journal on Matrix Analysis and Applications 21, (2000) 15. Kim, Y.D., Choi, S.: Nonnegative Tucker Decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2007, pp. 1 8 (2007) 16. Welling, M., Weber, M.: Positive Tensor Factorization. Pattern Recognition Letters 22, (2001) 17. Shashua, A., Hazan, T.: Non-negative Tensor Factorization with Applications to Statistics and Computer Vision. In: Proceedings of the International Conference on Machine Learning (ICML), pp (2005) 18. Vasilescu, M.A.O., Terzopoulos, D.: Multilinear independent components analysis, In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition, vol. 1, pp (2005) 19. Tao, D.C., Li, X.L., Wu, X.D., Maybank, S.J.: General Tensor Discriminant Analysis and Gabor Feature for Gait Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, (2007) 20. Lee, D.D., Seung, H.S.: Algorithms for Non-negative Matrix Factorization. Advances in Neural Information Processing Systems 13, (2001) 21. Li, S.Z., Hou, X.W., Zhang, H.J., Cheng, Q.S.: Learning Spatially Localized, Parts-based Representation. In: Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 1 6 (2001) 22. Pascual-Montano, A., Carazo, J.M., Kochi, K., Lehmann, D., Pascual-Marqui, R.D.: Nonsmooth Nonnegative Matrix Factorization. IEEE Transactions on. Pattern Analysis and Machine Intelligence. 28, (2006)

Sparseness Constraints on Nonnegative Tensor Decomposition

Sparseness Constraints on Nonnegative Tensor Decomposition Na Li nali@clarksonedu Carmeliza Navasca cnavasca@clarksonedu Department of Mathematics Clarkson University Potsdam, New York 3699, USA Department