An I-Vector Backend for Speaker Verification

An I-Vetor Bakend for Speaker Verifiation Patrik Kenny, 1 Themos Stafylakis, 1 Jahangir Alam, 1 and Marel Kokmann 2 1 CRIM, Canada, {patrik.kenny, themos.stafylakis, jahangir.alam}@rim.a 2 VoieTrust, Canada, marel.kokmann@voietrust.om Abstrat We propose a new approah to the problem of unertainty modeling in text-dependent speaker verifiation where speaker fators are used as the feature representation. The state-of-the-art bakend in this situation onsists in using point estimates of speaker fators to model the joint distribution of pairs of enrollment and test feature vetors under the same-speaker hypothesis. We develop a version of this bakend that works with Baum-Welh statistis instead of point estimates. The likelihood ratio alulations for speaker verifiation turn out to be formally equivalent to evidene alulations with i-vetor extrators having non-standard normal priors. Experiments show that this i-vetor bakend performs well on Part III of the RSR2015 dataset. 1. Introdution This paper is onerned with bakend modeling in textdependent speaker reognition where speaker fators are used as the feature representation, and speifially with the ase where enrollment and test utteranes onsist of random digit strings. We fous on the problem that point estimates of speaker fators orresponding to individual spoken digits are noisy in the statistial sense beause digits are of very short duration. In fat, it turns out that if the unertainty assoiated with the point estimate of the y-vetor orresponding to a spoken digit is quantified by a posterior ovariane matrix as in [1, 2], then this unertainty is generally almost as large as the population ovariane matrix. We believe that this explains why our attempts to use PLDA with unertainty propagation in text-dependent speaker reognition produed mixed results ompared with ordinary PLDA [3, 4]. This has led us to adopt a more fundamental approah to the problem of unertainty modeling whih we work out in this paper. Our experiene has been that the best strategy for developing a bakend lassifier when speaker fators (or y-vetors in the notation of our previous work [5, 6]) are used as features for text-dependent speaker reognition is to model the joint distribution of enrollment and test y-vetors [7]. This approah, whih we refer to as the Joint Density Bakend, was introdued in [8] where it was shown to perform equivalently to PLDA (without unertainty propagation) in the text-independent ase. It is atually more natural in text-dependent speaker reognition than in the text-independent situation sine it is unneessary to resort to i-vetor averaging in order to equalize the number enrollment and test feature vetors in a trial in the text-dependent ase. Thus we propose to ompensate for the unertainty in y- vetor point estimates in the Joint Density Bakend (rather than in PLDA). The standard approah to quantifying the unertainty in the point estimate of a test y-vetor is to alulate a posterior ovariane matrix using the zero order Baum-Welh statistis extrated from the test utterane [1, 2]. In this paper we will adopt a different approah that quantifies this unertainty in a trialdependent way whih takes aount of the joint distribution of the enrollment and test y-vetors under the same-speaker hypothesis. The rationale here is that onditioning a random variable on another always results in a redution of variane on average. 1 Thus, in the ase of a trial involving a relatively large amount of enrollment data, the unertainty in the test utterane ought to be relatively small even if the test utterane is short, provided that the alulation whih serves to quantify the unertainty takes aount of the enrollment data as well as the test data. This approah to unertainty modeling led us to develop a version of the Joint Density Bakend whih works with Baum- Welh statistis rather than point estimates of y-vetors. It turns out the likelihood ratio evaluation for speaker verifiation is formally equivalent to performing evidene alulations with an i- vetor extrator having two non-standard normal priors (one for the same-speaker hypothesis and one for the different-speaker hypothesis). Thus we refer to the new lassifier as the I-Vetor Bakend. To be lear, the I-Vetor Bakend is used only for probability alulations and not as a feature extrator. For our experiments we used the Part III (random digit) development test set in the RSR2015 orpus; we used the digit portion of the bakground set to train the JFA model whih served as the y-vetor extrator and to train the bakend lassifiers. This speaker reognition task is desribed in [9] and in the ompanion paper [10]. 2.1. y-vetors 2. Bakground We model speakers pronuniation of digits with a tied-mixture HMM (one set of mixture weights for eah digit) ombined with a JFA model of digit supervetors of the form m + V y(d) + Ux r. (1) Here d is used to indiate a generi digit. The matries U and V are retangular of low rank and x r and y(d) have standard normal priors. The hidden variable x r varies from one reording to another and so models hannel effets whereas the y-vetors serve as features for speaker reognition. We set the number of mixture omponents (whih we denote by C) to 128 and the rank of V (whih we denote by R) to 300 and trained this JFA model on the RSR2015 bakground digit data. There are only 97 speakers in the bakground set but we were able to train a model of higher rank beause the JFA model was onfigured to model speaker-digit ombinations rather than speakers as suh. 1 This is a onsequene of the law of total variane. http://en.wikipedia.org/wiki/law of total variane

In the terminology of the ompanion paper [10], the y- vetors in this paper are loal in the sense that they vary from one digit to another. Eigenvoie modeling as originally oneived used global y-vetors, so that a speaker s pronuniation of all of the digits was modeled by a single y-vetor [11]. This is not appropriate for the task at hand sine not all digits are represented in a test utterane. 2.2. The Joint Density Bakend At enrollment time, eah speaker utters the ten digits in random order several times. Beause y-vetors are tied aross all utteranes by a speaker, the enrollment proess results in one y- vetor per digit (regardless of the number of reordings available for enrollment). Denote these y-vetors by y e (d) (e for enrollment and d for digit). Similarly, for eah digit d appearing in a test utterane, we obtain a y-vetor y t (d). The Joint Density Bakend forms a likelihood ratio for speaker verifiation of the form Y d P T (y e (d), y t (d)) P N (y e (d), y t (d)) where d ranges over digits in the test utterane, P T refers to the joint distribution of feature vetor pairs ourring in target trials and P N to the joint distribution in non-target trials. We assume that the denominator fatorizes as P T (y e (d))p T (y t (d)). We model the numerator as a multivariate Gaussian of dimension 2R 2R. We estimate this joint distribution by re-arranging the RSR2015 bakground digit strings into a olletion of target trials. 2.3. The I-Vetor Bakend The I-Vetor bakend, whih we will desribe in detail in the next two setions, an be viewed as a hidden version of the Joint Density Bakend whih works with Baum-Welh statistis rather than point estimates of y-vetors. It stands in the same relation to JFA as the Joint Density Bakend does to PLDA. The main differene between the Joint Density Bakend and PLDA is that the marginal distributions of y-vetors extrated from enrollment and test data are not assumed to be the same; the I-Vetor Bakend is onstruted from the JFA model whih serves to extrat y-vetors by a similar type of relaxation. In other respets, soring a speaker verifiation trial with the I- Vetor Bakend is similar in spirit to JFA as it was originally formulated [5]. 3. I-Vetor Extrators with Non-Standard Priors In order to explain how likelihood ratios are alulated with the I-Vetor Bakend, we first have to explain how to perform evidene alulations with i-vetor extrators having non-standard priors. (The evidene is just the likelihood of the data when the hidden variable, i.e. the i-vetor, is integrated out.) Thus we assume that we have an i-vetor extrator with total variability matrix T of rank R orresponding to a UBM with C mixture omponents (indexed by ) of dimension F and the prior distribution on i-vetors, given by a mean vetor µ and preision matrix P. We denote the prior by Π(w). As usual, we assume that the Baum-Welh statistis have been pre-whitened so that the preision matrix assoiated with eah mixture omponent an be taken to be the identity matrix and the mean vetor to be zero (e.g. [12]). Given an utterane, we denote the zero and first order statistis assoiated with a mixture omponent by N and F. In the ase of a non-standard prior, the i-vetor posterior distribution Q(w) is given in terms of the Baum-Welh statistis by Cov (w, w) = P + N T T! 1 w = Cov (w, w) P µ + T F! A point estimate of the assoiated supervetor is given by 3.1. Evidene riterion.(2) s = T w. (3) Evidene alulations with i-vetor extrators an be performed diretly (Proposition 2 of [6]) but it is onvenient to use the formula for the variational lower bound on the log evidene [13], namely ln P (O w) D (Q(w) Π(w)) (4) where O refers to the olletion of aousti observations. (Sine we are in a position to alulate exat posteriors, the lower bound gives the log evidene exatly.) The divergene an be alulated using the formula for the divergene between 2 R-dimensional Gaussians, giving R 2 1 ln P Cov (w, w) (5) 2 + 1 2 tr (P Cov (w, w)) + 1 2 ( w µ) P ( w µ). To evaluate the first term in (4) set ɛ n = O n s where O n denotes the nth observation vetor and s is the part of the supervetor s that orresponds to the mixture omponent. Ignoring the onstant terms in the Gaussian kernels, we have ln P (O w) = 1 γ n() ɛ nɛ n 2 and, ignoring the ontributions of terms whih involve only the zero order and seond order statistis (they are not needed to alulate evidene ratios), we an write this as 1 2 2 s F + N s s + N tr ( Cov (s, s )) The term involving the ovarianes an be evaluated as!! 1 2 tr N T T Cov (w, w). (7) For future referene, it is useful to point out that the omputations are most naturally organized by alulating the statistis T F and N T T. (8) The i-vetor posterior distribution is obtained by ombining these with the prior aording to (2); (3), (5) and (7) follow diretly; and the ontribution of the first order statistis to (6) an be handled by writing P s F in the form w P T F. n «. (6)

3.2. Estimating the non-standard priors Aggregating the negative divergenes in the variational lower bound over all utteranes in a training set gives the auxiliary funtion whih is optimized to estimate the prior Π(w). The re-estimation formulas are µ = 1 w(s) S s P 1 = 1 w(s)w (s) µµ (9) S s where S is the total number of training utteranes and for eah utterane s, w(s) is the orresponding i-vetor. These updates are performed iteratively and the evidene is guaranteed to inrease from one iteration to the next. 4. Likelihood Ratio Calulations in the I-Vetor Bakend 4.1. Baum-Welh Statistis In enrolling a speaker, we reate a set of syntheti Baum-Welh statistis for eah digit d by taking the raw Baum-Welh statistis in eah reording, removing the hannel effets and pooling over the enrollment reordings (and similarly for a test utterane). If the raw Baum-Welh statistis for reording r and mixture omponent are denoted by N r (d) and F r (d) then the syntheti zero and first order statistis are N (d) = r F (d) = r N r (d) (F r (d) N r (d)u x r ) (10) where x r is a point-estimate of the hidden variable x r in (1). It is well known that inserting a non-linear length normalization step between a JFA-based feature extrator and a Gaussian bakend suh as PLDA or the Joint Density Bakend effetively ompensates for the unsatisfatory Gaussian assumptions on whih JFA is based [14]. The question arises as to whether a similar type of normalization should be performed on the syntheti Baum-Welh statistis before presenting them to the I-Vetor Bakend. Sine the y-vetors produed by JFA are supposed to have a standard normal distribution, 1 R y(d) 2 ought to be equal to 1 on average. This suggests saling the first order syntheti statistis in suh a way that, for eah spoken digit, y(d) 2 1 = 1. (We leave the zero order statistis R unhanged.) 4.2. Joint likelihood ratio We an model target trials with an i-vetor extrator of dimension 2CF 2R defined by setting «V 0 T = (11) 0 V where V is opied from the JFA model (1). Eah digit in a target trial ontributes a set of Baum-Welh statistis obtained by onatenating a set of syntheti Baum-Welh statistis extrated from the enrollment utteranes and a set of syntheti Baum-Welh statistis extrated from the test utterane. (Thus supervetors are of dimension 2CF rather than CF.) Similarly, the hidden variable w (the i-vetor ) an be thought of as a onatenation of two JFA speaker fator vetors, y e (d) and y t (d). (Thus w is of dimension 2R rather than R.) Analogously to the Joint Density Bakend, the orrelations between y e (d) and y t (d) under the same-speaker hypothesis are learned by training a non-standard normal prior Π T (w) using a set of target trials as a training set and the minimum divergene update formulas (9). As for non-target trials, we an model them using the same i-vetor extrator and a prior Π N (w) obtained by zeroing out the ross orrelations in the ovariane matrix whih defines Π T (w) (that is, by treating the enrollment and test utteranes as being statistially independent). Thus we obtain a likelihood ratio for speaker verifiation by evaluating evidene as in Setion 3.1 in two ways, one with the prior Π T (w) and one with the prior Π N (w). Beause this omputation is rather extravagant (the i-vetor extrator is of rank 2R rather than R), we found it neessary to use a restrited test onsisting only of diffiult trials for our pilot experiments. 4.3. Preditive likelihood ratio We have seen how to onstrut a likelihood ratio of the form P (E, T )/P (E)P (T ) where E stands for the enrollment data and T the test data. A more effiient approah is to alulate a preditive likelihood ratio of the form P (T E)/P (T ). This an also be derived from the priors Π T (w) and Π N (w) whih we have just desribed but the alulation at verifiation time involves an i-vetor extrator of rank R rather than 2R. Fix a digit d and a speaker s. To alulate the numerator distribution P (T E), take the syntheti first order Baum-Welh statistis for the digit from the speaker s enrollment data and pad them with 0 s to obtain a full set statistis of dimension 2CF, and similarly for the syntheti zero order statistis. Using (2), alulate an i-vetor posterior of dimension 2R. Interpreting the i-vetor as a onatenation of an enrollment y- vetor and a test y-vetor, this yields a distribution on test y- vetors by marginalization. Denote this marginal distribution by Π T,d,s (y). For the denominator alulation, we simply marginalize Π N (w) to obtain a distribution on test y-vetors whih we denote by Π N (y). At verifiation time, we perform evidene alulations with the i-vetor extrator defined by setting T = V and these two non-standard priors. Note that (unlike the joint ratio approah), the preditive likelihood ratio an benefit from pre-omputing the statistis (8) for eah test utterane and that the denominator of the likelihood ratio need only be evaluated one per test utterane. 5. Experiments These experiments were onduted with a standard, 60- dimensional PLP front end on the RSR2015 Part III development test set. Utteranes of less than 1 seond duration or SNR of less than 15 db were rejeted so that not all trials were performed [10]. For pilot experiments with the I-Vetor Bakend we used a subset of the female test set onsisting of the target trials and 100 K high soring non target trials. To train the Joint Density Bakend and the i-vetor priors referred to in Setions 4.2, we devised a bakend training set by organizing the RSR2015 Part III bakground digit data into a set of 14 K target trials intended to simulate the trials in the development test set. 5.1. Benhmark results Table 1 reports results obtained on the restrited test set with a GMM/UBM system (line 1) and the Joint Density Bakend (lines 2, 3 and 4). Line 2 refers to a straightforward imple-

EER DCF 2008 1 13.5% 0.531 2 12.6% 0.519 3 15.1% 0.586 4 13.1% 0.539 Table 1: Joint Density Bakend and GMM/UBM benhmarks, restrited development set, no sore normalization diag? digit- EER DCF 2008 dep? 1 joint 13.1% 0.571 2 pred. 12.6% 0.550 3 joint 14.7% 0.624 4 pred. 12.2% 0.535 5 pred. 12.2% 0.526 6 pred. 10.8% 0.484 Table 2: I-Vetor Bakend, restrited development set, no sore normalization mentation of the Joint Density Bakend, with a single, digitindependent, full ovariane matrix of dimension 2R 2R (R = 300). Modeling the joint density of enrollment and test y-vetors with a single Gaussian distribution ommon to all digits is unsatisfatory but the bakend training set is too small to enable full ovariane matries of dimension 600 600 to be estimated for eah digit. This raises the question of whether full ovariane matries are atually neessary. Assuming that standard minimum divergene estimation has been used in training the JFA model that serves as the y-vetor extrator, the y-vetor omponents will be unorrelated. This suggests that diagonal onstraints ould be imposed in modeling the joint distribution of enroll-test y-vetor pairs. (Speifially, we would expet that the joint 2R 2R ovariane matrix ould be modeled by diagonal bloks of dimension R R.) It turns out however that imposing suh diagonal onstraints on the joint ovariane matrix leads to a substantial degradation in performane (line 3). These onstraints make it possible to estimate digit-dependent ovariane matries (line 4) but performane is not as good as with a single, full ovariane matrix (line 2). Note that analogous onstraints an be plaed on the i-vetor priors referred to in Setions 4.2 by a straightforward modifiation of (9). 5.2. I-vetor bakend The results we obtained on the restrited test set with variants of the I-Vetor Bakend are summarized in Table 2. For our first experiments we used digit-independent priors without diagonal onstraints. We ompared joint likelihood ratios with preditive likelihood ratios and investigated the effet of the normalization proedure desribed in Setion 4.1. Lines 1 and 2 ontain results obtained with the joint likelihood ratio and preditive likelihood ratio when the syntheti Baum-Welh statistis (10) are not normalized; the orresponding results with normalization are given in lines 3 and 4. It appears that preditive likelihood ratios perform better than joint likelihood ratios (line 2 vs. line 1 and line 4 vs. line 3) and that normalizing the Baum-Welh statistis is slightly benefiial (line 4 vs. line 2). The interesting results are in Table 2 are in lines 5 and 6 where we imposed diagonal onstraints (analogous to those dis- EER (M/F) DCF 2008 (M/F) 1 GMM/UBM 4.8%/8.0% 0.217/0.356 2 JDB 5.7%/6.2% 0.244/0.326 3 IVB 5.0%/6.3% 0.215/0.310 4 IVB digt-dep. 4.7%/5.9% 0.205/0.297 Table 3: Joint Density Bakend (JDB) vs. I-Vetor Bakend (IVB), full development set, with male and female results broken out ussed in the previous setion) on the priors. We used preditive likelihood ratios in both ases. For line 5, we used a digit independent prior with diagonal onstraints. This works just as well as this unonstrained digit-independent prior (line 4); ontrast this behavior with the major performane degradation we observed in performing the analogous experiment with the Joint Density Bakend (line 3 of Table 1). For line 6, we used diagonally onstrained digit-dependent priors and obtained a substantial improvement, again ontrary to the behavior we observed with the Joint Density Bakend (line 4 of Table 1). For all of these experiments, the preditive likelihood ratio for a trial was normalized by the duration of the test utterane (with a similar normalization that handles the enrollment and test data symmetrially for the joint likelihood ratio). Interestingly, this type of normalization turns out not to be helpful slightly better results an be obtained by removing it. 5.3. Full development set Table 3 ompares results obtained with the GMM/UBM system (line 1) and the best onfigurations of the Joint Density Bakend and the I-Vetor Bakend on the full development set using s- norm sore normalization in eah ase. Line 2 gives the Joint Density benhmark (single full ovariane matrix), lines 3 and 4 refer to the I-Vetor Bakend with a digit-independent prior and digit-dependent priors. (Diagonal onstraints were imposed in both ase and preditive likelihood ratios were not subjet to duration normalization.) 6. Conlusion Unlike Part I of the RSR2015 data, the Part III development test set is a hard task and it is not easy to outperform a GMM/UBM system using subspae methods trained on the bakground data alone [9, 10]. (The RSR bakground does not ontain suffiiently many speakers to train the JFA model (1) adequately so that y-vetors need to be supplemented by other features in order to ahieve low error rates [10].) We have shown that when y-vetors are used as the sole feature representation, then the I-Vetor Bakend enables us to ahieve a large gain in performane in the ase of female speakers (who are partiularly problemati in this task) and modest gains in the ase of males. Comparing lines 2, 3 and 4 of Table 3 shows that I-Vetor Bakend yields substantial gains over the Joint Density Bakend. Best results are obtained by making the i-vetor priors digit-dependent although most of the gains are obtained with a digit-independent prior. Digit-dependene is ahieved by imposing diagonal onstraints on the i-vetor priors and this works well even in the digit-independent ase. Interestingly, analogous onstraints do not work in the ase of the Joint Density Bakend. It appears that rude priors work better the level of hidden variables than at the level of observations.

7. Referenes [1] S. Cumani, O. Plhot, and P. Lafae, On the use of i- vetor posterior distributions in probabilisti linear disriminant analysis, IEEE Transations on Audio, Speeh and Language Proessing, vol. 22, no. 4, pp. 846 857, 2014. [2] P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam, and P. Dumouhel, PLDA for Speaker Verifiation with Utteranes of Arbitrary Duration, in Pro. ICASSP, Vanouver, Canada, May 2013. [3] T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, and M. Kokmann, Text-dependent speaker reognition using PLDA with unertainty propagation, in Pro. Interspeeh, Lyon, Frane, Sept. 2013. [4] T. Stafylakis, P. Kenny, P. Ouellet, J. Perez, M. Kokmann, and P. Dumouhel, I-Vetor/PLDA Variants for Text- Dependent Speaker Reognition, Aug. 2013. [Online]. Available: http://www.rim.a/perso/patrik.kenny [5] P. Kenny, G. Boulianne, and P. Dumouhel, Eigenvoie modeling with sparse training data, IEEE Trans. Speeh Audio Proessing, vol. 13, no. 3, pp. 345 359, May 2005. [6] P. Kenny, Joint Fator Analysis of Speaker and Session Variability: Theory and Algorithms, Teh. Report CRIM-06/08-13, 2005. [Online]. Available: http://www.rim.a/perso/patrik.kenny [7] P. Kenny, T. Stafylakis, J. Alam, and M. Kokmann, JFA modeling with left-to-right struture and a new bakend for text-dependent speaker reognition, in Pro. ICASSP, Brisbane, Australia, Apr. 2015. [8] S. Cumani and P. Lafae, Generative Pairwise Models for Speaker Reognition, in Pro. Odyssey Speaker and Language Reognition Workshop, Joensuu, Finland, June 2014. [9] A. Larher, K. A. Lee, B. Ma, and H. Li, Text-dependent speaker veriation: Classifiers, databases and RSR2015, Speeh Communiation, vol. 60, pp. 56 77, 2014. [10] T. Stafylakis, P. Kenny, J. Alam, and M. Kokmann, JFA for speaker verifiation with random digit strings, in Pro. Interspeeh, Dresden, Germany, Sept. 2015. [11] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, Rapid speaker adaptation in eigenvoie spae, IEEE Trans. Speeh Audio Proessing, vol. 8, pp. 695 707, Nov. 2000. [12] P. Kenny, A small footprint i-vetor extrator, in Pro. Odyssey 2012, Singapore, June 2012. [Online]. Available: http://www.rim.a/perso/patrik.kenny [13] C. Bishop, Pattern Reognition and Mahine Learning. New York, NY: Springer Siene+Business Media, LLC, 2006. [14] D. Garia-Romero and C. Y. Espy-Wilson, Analysis of i-vetor length normalization in speaker reognition systems, in Pro. Interspeeh 2011, Florene, Italy, Aug. 2011.