Nearest Neighbor Discriminant Analysis for Robust Speaker Recognition

Size: px

Start display at page:

Download "Nearest Neighbor Discriminant Analysis for Robust Speaker Recognition"

Johnathan Palmer
5 years ago
Views:

INTERSPEECH 2014 Nearest Neighbor Discriminant Anaysis for Robust Speaker Recognition Seyed Omid Sadjadi, Jason W. Peecanos, Weizhong Zhu IBM Research, Watson Group {sadjadi,jwpeeca,zhuwe}@us.ibm.

1 INTERSPEECH 2014 Nearest Neighbor Discriminant Anaysis for Robust Speaker Recognition Seyed Omid Sadjadi, Jason W. Peecanos, Weizhong Zhu IBM Research, Watson Group Abstract With the advent of i-vectors, inear discriminant anaysis (LDA) has become an integra part of many state-of-the-art speaker recognition systems. Here, LDA is primariy empoyed to annihiate the non-speaker reated (e.g., channe) directions, thereby maximizing the inter-speaker separation. The traditiona approach for computing the LDA transform uses parametric representations for both intra- and inter-speaker scatter matrices that are based on the Gaussian distribution assumption. However, it is known that the actua distribution of i-vectors may not necessariy be Gaussian, and in particuar, in the presence of noise and channe distortions. Motivated by this observation, we present an aternative non-parametric discriminant anaysis (NDA) technique that measures both the within- and between-speaker variation on a oca basis using the nearest neighbor rue. The effectiveness of the NDA method is evauated in the context of noisy speaker recognition tasks using speech materia from the DARPA Robust Automatic Transcription of Speech (RATS) program. Experimenta resuts indicate that the NDA is more effective than the traditiona parametric LDA for speaker recognition under noisy and channe degraded conditions. Index Terms: discriminant anaysis, i-vector, nearest neighbor, speaker recognition 1. Introduction Speaker recognition has evoved significanty over the past few years. The research trend in this domain has graduay migrated from joint factor anaysis (JFA) based methods, which attempt to mode the speaker and channe subspaces separatey [1], towards the i-vector approach that modes both speaker and channe variabiities in a singe ow-dimensiona (e.g., a few hundred) space termed the tota variabiity subspace [2]. Accordingy, there has been a growing interest to design agorithms for the compensation of the channe subspace in the i-vector paradigm. Here, irrespective of the back-end mode, inear discriminant anaysis (LDA) with the Fisher criterion [3] has been the most commony adopted approach for inter-session variabiity compensation. Current state-of-the-art speaker recognition systems empoy the Fisher LDA as a preprocessor to generate dimensionaity reduced and channe-compensated features from i-vectors [4, 5, 6, 7, 8, 9, 10]. The dimensionaity reduced vectors can then be convenienty modeed and scored with various cassifiers such as support vector machines (SVM) [2], probabiistic LDA (PLDA) [11, 12, 13], and the simpe yet effective cosine distance (CD) based method [2]. The Fisher LDA aims at finding the most discriminative feature subset through a inear transformation of the origina input This work was supported in part by Contract No. D11PC20192 DOI/NBC under the RATS program. The views, opinions, findings and recommendations contained in this artice are those of the authors and do not necessariy refect the officia poicy or position of the DOI/NBC. space. Such a transformation attempts to maximize the betweencass (or inter-speaker) scatter whie minimizing the within-cass variation. Traditionay, parametric within- and between-cass scatter matrices are formed based on the Gaussian distribution assumption for the sampes in each cass. However, if the cassconditiona distributions are non-gaussian, one cannot expect the use of such parametric forms to resut in proper feature subsets that are capabe of preserving compex structures within data needed for cassification (e.g., muti-modaity). It is we known in the speaker recognition community that the actua distribution of i-vectors may not necessariy be Gaussian [12]. This is in particuar more probematic when speech recordings are coected in the presence of noise and channe distortions. Hence, despite its popuarity, the parametric LDA may not be the best choice here. To cope with the above noted issue associated with the parametric nature of the scatter matrices, in the semina work of [14], a non-parametric discriminant anaysis (NDA) approach was proposed for genera two-cass pattern recognition probems. It was ater extended to muti-cass probems and successfuy appied in severa other studies for face recognition tasks as we [15, 16]. The NDA measures both the within- and betweencass scatter matrices on a oca basis using the k-nearest neighbor (k-nn) rue, and unike LDA, is generay of fu rank. Note that for a C cass probem, the parametric LDA can provide at most C 1 discriminant features (i.e., the number of casses minus 1). Nonetheess, this is ess probematic in speaker recognition because typicay the number of speakers in the training set exceeds the dimensionaity of the tota variabiity subspace. The non-parametric nature of the scatter matrices in the NDA inherenty resuts in features that can preserve the oca structure (e.g., the cass boundaries) within data which is important for cassification. In this study, we investigate the appication of NDA for robust speaker recognition. In particuar, we evauate the effectiveness of NDA against the traditiona LDA in the context of speaker recognition tasks under actua noisy and channe degraded conditions using speech materia from the DARPA program, Robust Automatic Transcription of Speech (RATS). The RATS data, which is distributed by the Linguistic Data Consortium (LDC) [17], consists of conversationa teephone speech (CTS) recordings that have been retransmitted (through LDC s Muti Radio-Link Channe Coection System) and captured over 8 extremey degraded communication channes with distinct distortion characteristics. The distortion type seen in RATS data is noninear and the noise is to some extent correated with speech. We conduct our speaker experiments with a state-of-the-art i- vector based system [10] and report on fase-reject (miss) rate at 2.5% fase-aarm (FA) rate as performance metric. We aso expore the impact of different configuration parameters for NDA (e.g., the feature dimensionaity as we as the number of neighbors in the k-nn anaysis) on speaker recognition performance. Copyright 2014 ISCA September 2014, Singapore

2 2. I-vector feature extraction In this section we briefy describe the tota variabiity modeing concept which was inspired by the JFA approach that attempts to estimate separate subspaces for speaker and channe variabiities. Unike JFA, the tota variabiity approach assumes that both speaker and channe variabiities reside in the same ow-dimensiona subspace. This concept is cosey reated to the eigenvoice adaptation technique deveoped for automatic speech recognition (ASR) that was first proposed in [18] based on principa component anaysis (PCA), and ater in [19] based on probabiistic PCA (PPCA) [20]. Given a set of observations (i.e., frames) for each speech session, the eigenvoice adaptation technique computes an offset (inear shift) in the prior acoustic space represented by the Gaussian mixture mode (GMM) mean supervectors obtained from a universa background mode (UBM). The key idea here is that variabiity within and across sessions can be described via a few set of parameters (a.k.a factors) in a ow-dimensiona subspace spanned by the coumns of a owrank rectanguar matrix, T, entited the tota variabiity matrix. Mathematicay, the adapted mean supervector, M, for a given set of observations can be modeed M = m + Tx + ɛ, (1) where m is the prior mean supervector, x N (0, I) is a mutivariate random variabe termed an identity 1 vector i-vector, and ɛ N (0, Σ) is a residua noise term to account for the variabiity not captured via T (Σ is typicay copied from the UBM). In other words, for the given observation set, the i-vector represents the coordinates in the tota variabiity subspace. The procedure for training the i-vector extractor (i.e., the T matrix) is simiar to the eigenvoice earning process described in [19] (a simpified version is presented in [21]), except each speech recording (session) is assumed to be produced by a unique speaker. 3. Linear discriminant anaysis (LDA) LDA is widey adopted in pattern recognition probems as a preprocessing stage for feature seection and dimensionaity reduction. It computes an optimum inear projection A: R d R n by maximizing the ratio of the inter-cass scatter to intra-cass variance: y = A T x, (2) where A is a rectanguar matrix with n ineary independent coumns. Here, the within- and between-cass scatter matrices are used to formuate a cass separabiity criterion which converts the matrices into a singe statistic. This statistic takes on arger vaues when the between-cass scatter is arger and the within-cass variance is smaer. Severa such cass separabiity criteria are described in [22], of which the foowing is the most widey used, [ ( )] Â = arg max tr A T S b A, (3) A T S wa=i where S b and S w denote the between- and within- cass scatter matrices, respectivey. The optimization probem in (3) has an anaytica soution that is a matrix whose coumns are the n eigenvectors corresponding to the argest eigenvaues of S 1 w S b. 1 The term i-vector sometimes aso refers to a vector of intermediate size, bigger than the underying cepstra feature vector but much smaer than the GMM supervector. The within-cass scatter matrix measures the scatter of sampes in each cass around the expected vaue of that cass S w = p ie [(x µ i ) (x µ i ) ] T Ci = p iσ i, (4) where p i, µ i, and Σ i are the a priori probabiity (proportiona to the number of sessions per speaker), expected vaue, and covariance matrix for cass i. The between-cass scatter matrix, on the other hand, measures the scatter of cass-conditiona expected vaues around the goba mean S b = p i(µ i µ) (µ i µ) T, (5) where µ is the expected vaue of the training sampes computed µ = E [x] = p iµ i. (6) There are three disadvantages associated with the parametric nature of the scatter matrices in (4) and (5). First, the underying distribution of casses is assumed to be Gaussian with a common covariance matrix for a casses. Therefore, one cannot expect the parametric LDA to generaize we to non-gaussian and muti-moda (as opposed to unimoda) distributions. Second, notice that the rank of S b is C 1, which means the parametric LDA can provide at most C 1 discriminant features. However, this may not be sufficient in appications such as anguage recognition where the number of anguage casses is much smaer than the dimensionaity of the i-vectors (for instance, there are ony 5 target anguage categories in the RATS program). Nevertheess, this may not pose a chaenge for speaker recognition tasks in which the number of training speakers exceeds the dimensionaity of the tota variabiity subspace. Finay, because ony the cass centroids are taken into account for computing S b in (5), the parametric LDA cannot effectivey capture the boundary structure between adjacent casses which is essentia for cassification [22]. To overcome the above noted imitations of LDA, a nonparametric discriminant anaysis technique was proposed in [14], that measures both the within- and between-cass scatters on a oca basis using a nearest neighbor rue. We provide a brief description of NDA in the next section. 4. Nonparametric discriminant anaysis To remedy the imitations identified for LDA, a nonparametric discriminant anaysis techniques was proposed in [14]. In NDA, the expected vaues that represent the goba information about each cass are repaced with oca sampe averages computed based on the k-nn of individua sampes. More specificay, in the NDA approach, the between-cass scatter matrix is defined S b = N i j=1 j i =1 ( w ij x i M ij ) ( x i M ij ) T, (7) where x i denotes the th sampe from cass i, and M ij is the oca mean of k-nn sampes for x i from cass j which is computed M ij = 1 K K NN k (x i, j), (8) k=1 1861

3 V 3 V 2 V 1 V 4 Figure 1: Symboic exampe iustrating the parametric versus nonparametric scatter between two casses. v 1 represents the goba gradient of cass centroids. The vectors {v 2,, v 6} represent the oca gradients. where NN k (x i, j) is the k th nearest neighbor of x i in cass j. The weighting function w ij in (7) is defined w ij = min { d α( x i, NN K(x i, i) ), d α( x i, NN K(x i, j) )} d α (x i, NNK(xi, i)) + dα (x i, NNK(xi, j)), (9) where α R is a constant between zero and infinity, and d(.) denotes the Eucidean distance. The weighting function is introduced in (7) to deemphasize the oca gradients that are arge in magnitude to mitigate their infuence on the scatter matrix. The weight parameters approach 0.5 for sampes near the cassification boundary (e.g., see {v 2, v 3, v 5, v 6} shown in Figure 1), whie dropping off to 0 for sampes that are far from the boundary (e.g., see v 4 in Figure 1). The contro parameter α determines how rapidy such decay in the weights occurs. The nonparametric within-cass scatter matrix, S w, is computed in a simiar fashion as in (7), except the weighting function is set to 1 and the oca gradients are computed within each cass. The NDA transform is then formed by cacuating the eigenvectors of S S w b. 1 Three important observations can be made from a carefu examination of the nonparametric between-cass scatter matrix in (7). First, notice that as the number of nearest neighbors, K, approaches N j, the tota number of sampes in cass j, the oca mean vector, M ij, approaches the goba mean of cass j (i.e., µ j ). In this scenario, if we set the weight parameters to 1, the NDA transform essentiay becomes the LDA projection, which means the LDA is a specia case of the more genera NDA. Second, because a the sampes are taken into account for the cacuation of the nonparametric between-cass scatter matrix (as opposed to ony the cass centroids), S b is generay of fu rank. This means that unike the LDA that provides at most C 1 discriminant features, the NDA generay resuts in d- dimensiona vectors (assuming a d-dimensiona input space) for the cassification. As we discussed before, this is of great importance for appications such as anguage recognition where the number of casses is much smaer than the dimensionaity of the tota subspace (or the input space in genera). Finay, compared to LDA, NDA is more effective in preserving the compex structure (i.e., oca and boundary structure) within and across different casses. As seen from the exampe shown in Figure 1 (where k is set to 1 for simpicity), LDA ony uses the goba gradient obtained with the centroids of the two casses (i.e., v 1) to measure the between-cass scatter. On the other hand, NDA uses the oca gradients (i.e., {v 2,, v 6}) that are emphasized aong the boundary through the weighting function, w ij. Hence, the boundary information becomes embedded into the resuting transformation. V 6 V 5 5. Experiments This section provides a description of our experimenta setup incuding speech data as we as the speaker recognition system used in our evauations. We conduct our speaker recognition experiments using actua noisy and channe degraded speech materia avaiabe from the DARPA RATS program, which is distributed by the LDC [17]. The RATS data consists of CTS recordings that have been retransmitted and captured over 8 extremey degraded high-frequency (HF) radio channes, abeed A H, with distinct noise characteristics. The type of distortion seen in RATS data is noninear (e.g., akin to cipping as we as ampitude compression effects) and the noise is to some extent correated with speech. A tota of five data reeases are avaiabe from the LDC for the RATS speaker recognition task: LDC2012E49, LDC2012E63, LDC2012E69, LDC2012E85, LDC2012E117, which contain speech spoken in five anguages: Levantine Arabic, Dari, Farsi, Pashto, and Urdu. We partitioned these data into two sets for our system training and evauation (consisting of enroment and test). In the RATS program a 6-sided speaker enroment scenario is assumed, and we foow this assumption in our deveopment setup. For system evauation, there are 8 duration-specific tasks, of which we report on resuts for the foowing enroment-test conditions: 120s, 30s, 10s, and 3s. It shoud be noted here that for the 120s condition, there are 6 enroment sessions of 120s of speech and one test session of 120s of speech. This appies simiary to the other durations. The tota number of trias for each of these conditions are: 333k, 163k, 173k, and 172k, respectivey. It is worth remarking here that there are no cross-gender or cross-anguage trias in our evauations. For speech parameterization, we extract 19-dimensiona power normaized cepstra coefficients (PNCC) [23] from 20 ms frames every 10 ms using a 24-channe Gammatone fiterbank spanning the frequency range Hz. The first and second tempora cepstra derivatives are aso computed over a 5- frame window and appended to the static features to capture the dynamic pattern of speech over time. This resuts in 57- dimensiona feature vectors. For non-speech frame dropping, we empoy an unsupervised speech activity detector (SAD) that generates frame-eve decisions using mutipe threshods set on various basic speech features incuding frame og-energy, spectra divergence, and signa-to-noise ratio. After dropping the non-speech frames, goba (utterance eve) cepstra mean and variance normaization (CMVN) is appied to suppress the shortterm inear channe effects. We perform our experiments in the context of a state-of-theart i-vector based speaker recognition system [10]. To earn the i- vector extractor, a gender-independent 1024-component GMM- UBM with diagona covariance matrices is trained using a subset of the deveopment set. The zeroth and first order Baum- Wech statistics are then computed for each recording and used to earn a 400-dimensiona tota variabiity subspace. After extracting 400-dimensiona i-vectors, we either use LDA or NDA for inter-session variabiity compensation. The dimensionaity reduced i-vectors are then centered (the mean is removed) and unit-ength normaized. For scoring, a Gaussian PLDA mode with fu covariance residua noise term [11, 13] is earned using the i-vectors extracted from 13,158 speech segments from 743 unique speakers. The Eigenvoice subspace in the PLDA mode is assumed fu-rank. We incude segments from 120s, 30s and 10s cuts in the PLDA training to expose our mode to the duration conditions seen in the evauation data. The number of sessions per speaker ranges from 5 to 25 segments, and we empoy a one- 1862

4 Normaized eigenvaue LDA NDA Dimension Figure 2: Normaized and sorted eigenvaues of S 1 w S b for the 1 LDA transform (dashed), and S S w b for the NDA transform (soid) obtained with our deveopment i-vectors. versus-rest strategy to compute the inter-speaker scatter matrix in (7). This provides fexibiity on the number nearest neighbors used for computing the oca means. 6. Resuts In this section we summarize our resuts obtained with the experimenta setup presented in Section 5. Figure 2 shows the normaized and sorted eigenvaues for S 1 w S b in LDA (dashed), and S 1 S w b in NDA (soid), which are obtained with our deveopment i-vectors. Comparing the two curves, it is seen that the decay in the eigenvaues of the LDA transform occurs more rapidy than that of the NDA transform. In other words, the speakerdiscriminative information for LDA is confined within a ower dimensiona subspace compared to NDA. Note that from our discussion in Section 4, in NDA the compex oca structure (as opposed to the goba mean scatter) within the data is preserved and the cass boundary information is embedded into the transform. Our hypothesis is that a arger subspace is required for NDA to propery represent such a compex structure. Accordingy, we expect NDA to perform best at arger feature dimensions (in the transformed space) compared to LDA. Resuts of our speaker recognition experiments with LDA and NDA (K = 9) on RATS data are summarized in Tabes 1 and 2, respectivey. The resuts are reported in terms of fasereject (miss) rate at 2.5% fase-aarm (FA) rate (miss@2.5%fa), which is a RATS program metric for the speaker recognition task. These resuts are generated by varying the dimensionaity of the transformed subspace from 150 to 400 with an increment of 50. Two important observations can be made from the resuts presented in the tabes. First, irrespective of the dimensionaity of the feature subspace, NDA consistenty performs better than LDA across the four duration-specific conditions. As we discussed before, this is due to the nonparametric representations for the scatter matrices in NDA that makes no assumption regarding the underying cass-conditiona distributions. In addition, NDA is more effective in capturing the oca structure and boundary information within and across different speakers. Second, LDA performs best with a 200-dimensiona feature space, whie the best performance for NDA is achieved with a 250- dimensiona transform. As noted above, more dimensions are required for NDA to capture the boundary structure and oca information. We aso investigated the impact of the number of nearest neighbors used for computing NDA on speaker recognition performance. The resuts are provided in Tabe 3 for K ranging from 3 to 13 with an increment of 2. Here, the dimensionaity Tabe 1: Speaker recognition performance across the four enroment-test conditions with LDA and different feature dimensions ranging from 150 to 400, in terms of percent miss@2.5%fa. Enro-Test s-120s s-30s s-10s s-3s Tabe 2: Speaker recognition performance across the four enroment-test conditions with NDA (K = 9) and different feature dimensions ranging from 150 to 400. Enro-Test s-120s s-30s s-10s s-3s of the transformed subspace is fixed at 250. For onger duration tasks (i.e., 120s and 30s), the best performance is obtained with K = 9, whie for 10s and 3s tasks the best performance is achieved with K = 7. It is evident that for the best performance to be achieved, the number of nearest neighbors, K, shoud be tuned. In practice, this is typicay accompished using a deveopment set. 7. Concusions LDA has become an integra part of many state-of-the-art speaker recognition systems for inter-session variabiity compensation. Given that the actua distribution of i-vectors may not be necessariy Gaussian as we as the imitations identified for the parametric LDA, we presented an aternative nonparametric discriminant anaysis (NDA) technique that measures both the within- and between-speaker variation on a oca basis using the nearest neighbor rue. Unike LDA, the NDA approach makes no specific assumption regarding the underying cass-conditiona distributions. To evauate the efficacy of NDA, we conducted speaker recognition experiments using actua noisy and channe degraded data from the RATS program. Experimenta resuts indicated effectiveness of NDA against LDA for speaker recognition tasks. A cear advantage of NDA over LDA is that it is generay of fu rank, making it attractive for speech appications (such as anguage recognition) with a imited number of casses. Our preiminary experiments aso confirm the effectiveness of NDA for RATS anguage recognition tasks where the number of anguage casses is much smaer than the dimensionaity of i-vectors. Tabe 3: Speaker recognition performance for different number of nearest neighbors, K, in NDA with 250-dimensiona features. Enro-Test s-120s s-30s s-10s s-3s

5 8. References [1] P. Kenny, G. Bouianne, P. Oueet, and P. Dumouche, Joint factor anaysis versus eigenchannes in speaker recognition, IEEE Trans. Audio Speech Lang. Process., vo. 15, no. 4, pp , [2] N. Dehak, P. Kenny, R. Dehak, P. Dumouche, and P. Oueet, Front-end factor anaysis for speaker verification, IEEE Trans. Audio Speech Lang. Process., vo. 19, no. 4, pp , [3] R. A. Fisher, The use of mutipe measurements in taxonomic probems, Annas of eugenics, vo. 7, no. 2, pp , [4] P. Matějka, O. Gembek, F. Castado, M. J. Aam, O. Pchot, P. Kenny, L. Burget, and J. Černocky, Fu-covariance UBM and heavy-taied PLDA in i-vector speaker verification, in Proc. IEEE ICASSP, Prague, Czech, May 2011, pp [5] A. Kanagasundaram, D. B. Dean, R. J. Vogt, M. Mcaren, S. Sridharan, and M. Mason, Weighted LDA techniques for i-vector based speaker verification, in Proc. IEEE ICASSP, Kyoto, Japan, March 2012, pp [6] A. Kanagasundaram, D. B. Dean, S. Sridharan, and R. J. Vogt, PLDA based speaker verification with weighted LDA techniques, (Odyssey 2012), Singapore, Singapore, June [7] J. Yang, C. Liang, L. Yang, H. Suo, J. Wang, and Y. Yan, Factor anaysis of Lapacian approach for speaker recognition, in Proc. IEEE ICASSP, Kyoto, Japan, March 2012, pp [8] M. McLaren and D. Van Leeuwen, Source-normaized LDA for robust speaker recognition using i-vectors from mutipe speech sources, IEEE Trans. Audio Speech Lang. Process., vo. 20, no. 3, pp , [9] M. McLaren, N. Scheffer, M. Graciarena, L. Ferrer, and Y. Lei, Improving speaker identification robustness to highy channedegraded speech through mutipe system fusion, in Proc. IEEE ICASSP, Vancouver, BC, May 2013, pp [10] W. Zhu, S. Yaman, and J. Peecanos, The IBM RATS phase II speaker recognition system: overview and anaysis, in Proc. IN- TERSPEECH, Lyon, France, August 2013, pp [11] S. J. Prince and J. H. Eder, Probabiistic inear discriminant anaysis for inferences about identity, in Proc. IEEE ICCV, Rio De Janeiro, October 2007, pp [12] P. Kenny, Bayesian speaker verification with heavy taied priors, (Odyssey 2010), Brno, Czech, June [13] D. Garcia-Romero and C. Y. Espy-Wison, Anaysis of i-vector ength normaization in speaker recognition systems. in Proc. IN- TERSPEECH, Forence, Itay, August 2011, pp [14] K. Fukunaga and J. Mantock, Nonparametric discriminant anaysis, IEEE Trans. Pattern Ana. Mach. Inte., vo. 5, no. 6, pp , [15] M. Bressan and J. Vitria, Nonparametric discriminant anaysis and nearest neighbor cassification, Pattern Recognition Lett., vo. 24, no. 15, pp , [16] Z. Li, D. Lin, and X. Tang, Nonparametric discriminant anaysis for face recognition, IEEE Trans. Pattern Ana. Mach. Inte., vo. 31, no. 4, pp , [17] K. Waker and S. Strasse, The RATS radio traffic coection system, (Odyssey 2012), Singapore, Singapore, June [18] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzieski, Rapid speaker adaptation in eigenvoice space, IEEE Trans. Speech Audio Process., vo. 8, no. 6, pp , [19] P. Kenny, G. Bouianne, and P. Dumouche, Eigenvoice modeing with sparse training data, IEEE Trans. Speech Audio Process., vo. 13, no. 3, pp , [20] M. E. Tipping and C. M. Bishop, Probabiistic principa component anaysis, Journa of the Roya Statistica Society: Series B (Statistica Methodoogy), vo. 61, no. 3, pp , [21] D. Matrouf, N. Scheffer, B. G. Fauve, and J.-F. Bonastre, A straightforward and efficient impementation of the factor anaysis mode for speaker verification. in Proc. INTERSPEECH, Antwerp, Begium, August 2007, pp [22] K. Fukunaga, Introduction to Statistica Pattern Recognition, 2nd ed. New York: Academic press, [23] C. Kim and R. M. Stern, Power-normaized cepstra coefficients (PNCC) for robust speech recognition, in Proc. IEEE ICASSP, Kyoto, Japan, March 2012, pp

Supervised i-vector Modeling - Theory and Applications

Supervised i-vector Modeling - Theory and Applications Supervised i-vector Modeing - Theory and Appications Shreyas Ramoji, Sriram Ganapathy Learning and Extraction of Acoustic Patterns LEAP) Lab, Eectrica Engineering, Indian Institute of Science, Bengauru,