Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition

Size: px

Start display at page:

Download "Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition"

Alexander Griffith
6 years ago
Views:

1 Uncertainty Modeling without Subspace Methods for Text-Dependent Speaker Recognition Patrick Kenny, Themos Stafylakis, Md. Jahangir Alam and Marcel Kockmann Odyssey Speaker and Language Recognition Workshop Bilbao, Spain June, / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

2 Uncertainty Modeling in Text-Dependent Speaker Recognition Large numbers of mixture components are surprisingly effective in text-dependent speaker recognition where utterances are typically of 1 or 2 seconds duration The number of times a mixture component is observed typically << 1 and it could be 0 (particularly at test time) so observations ought to be treated as being noisy in the statistical sense Some progress has been made in uncertainty modeling in text-independent speaker recognition with subspace methods (i-vectors, speaker factors) but these are of limited use in text-dependent speaker recognition We tackle the problem of uncertainty modeling without resorting to subspace methods 2 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

3 RSR2015 Part III (Random Digits) Background set (97 speakers) used for JFA and backend training Results reported on development set Enrollment consists of 3 utterances of the 10 digits in random order Each test utterance consists of a random string of 5 digits Error rates are much higher than on Part I Counterintuitively, it is hard to beat a naive GMM/UBM benchmark using HMMs We focus on backend modeling with a standard 60-dimensional PLP front end 3 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

4 JFA for Speaker Recognition with Digits Given a speaker and a collection of enrollment recordings, the recordings are modeled by supervectors of the form m + Ux r + Dz (1) Speakers are characterized by z-vectors (supervector sized); the x-vectors (low-dimensional) model channel effects To perform speaker recognition, for each digit d in a test utterance compare the vectors supervectors z e and z t where z e is extracted from the enrollment utterances z t is extracted from the test utterance z vectors may be digit-independent (global) or digit-dependent (local) 4 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

5 The Joint Density Backend uses point estimates of z e and z t The Hidden Supervector Backend treats z e and z t as latent variables. Inference requires Baum-Welch statistics A joint prior distribution (under the same-speaker hypothesis) P(w) where w = (z e, z t ) Calculating the posterior of w given Baum-Welch statistics 5 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

6 Joint Density Backend The joint distribution for target trials, P T (z e, z t ), is modeled by a Gaussian for each mixture component Insufficient data to train full covariance Gaussians and diagonal Gaussians obviously incorrect Semi-diagonal constraints (see paper) Gaussians estimated by arranging the background set into a collection of target trials For non-target trials, assume statistical independence, i.e. P N (z e, z t ) = P T (z e ) P T (z t ) Likelihood ratio for speaker verification: PT (z e, z t ) P N (z e, z t ) where the product ranges over the digits in the test utterance and mixture components in the UBM 6 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

7 Hidden Supervector Backend For each mixture component treat z e, z t as a pair of hidden mean vectors which are correlated in the case of a target trial Use an i-vector extractor to do probability calculations (not to extract factors) The i-vector w is the pair z e, z t so its dimension is twice that of the acoustic feature vectors The i-vector model has full rank so we can take the total variability matrix to be the identity and shift the burden of modeling the correlation between z e and z t to the prior The prior cannot be standard normal so it needs to be estimated 7 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

8 Posterior Calculations For an i-vector extractor with a non-standard prior, ( Cov(w, w) = P + ) 1 N c T c T c c ( w = Cov(w, w) Pµ + ) T c F c c where µ is the prior expectation and P the precision. (In the standard case, µ = 0 and P = I.) 8 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

9 Minimum Divergence Estimation of the Prior We need to supply the mean µ and precision matrix P that specifies the prior distribution of i-vectors for same-speaker trials. Arrange the background set into a collection of target trials indexed by s = 1,..., S and let w(s) be the i-vector for trial s. µ = 1 w(s) S s P 1 = 1 w(s)w (s) µµ S s Minor modifications to make µ and P digit dependent or impose semi-diagonal constraints. 9 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

10 For the different speaker hypothesis, treat z e and z t as being statistically independent. In other words, suppress the cross correlations in the covariance matrix P 1 that defines the prior under the same-speaker hypothesis. 10 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

11 Likelihood Ratio Given data and a probability model with hidden variables, the evidence is the likelihood of the data calculated by integrating out the hidden variables For an i-vector model the integral can be evaluated in closed form (it is a Gaussian integral) and expressed in terms of the Baum-Welch statistics (see paper) To evaluate the likelihood ratio for a speaker verification trial, evaluate the evidence twice Using the prior for the same-speaker hypothesis Using the prior for the different speaker hypothesis 11 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

12 Preparing the Baum-Welch Statistics For each speaker, we have a collection of (enrollment or test) recordings indexed by r For each mixture component c, zero and first order statistics denoted by N r c and F r c Remove the channel effects from each recording and pool over recordings N c = r F c = r N r c (F r c N r cu c x r ) x r is a point-estimate of the hidden variable x r in (1) One set of synthetic statistics per speaker (regardless of the number of recordings) 12 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

13 Length Normalization of the Synthetic Statistics In the JFA model (1), z c is a hidden variable The posterior covariance and expectation C c and z c, are given by C c = (I + N c D cd c ) 1 z c = C c D cf c so that z c 2 = z c 2 + trace(c c ) For each speaker, we scale the synthetic first order statistics so that c zc 2 is the same for all speakers 13 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

14 The dominant term in (2) is trace(c c ) An experiment in the Appendix A demonstrates its usefulness The posterior covariance matrix C c depends critically on the relevance factor 14 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

15 128 Mixture Components, Global z-vectors norm.? EER (M/F) DCF (M/F) 1 GMM - 4.8%/8.0% 0.217/ JDB - 4.8%/7.6% 0.219/ HSB 4.5%/6.8% 0.201/ HSB 3.9%/6.1% 0.177/0.307 Table 1: Results on the development set obtained with 128 Gaussians. The systems are a GMM/UBM system, the Joint Density Backend (JDB) and the Hidden Supervector Backend (HSB) both with global z-vectors. Baum-Welch statistics normalization is indicated by norm. 15 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

16 512 Components, Global z-vectors r EER (M/F) DCF (M/F) 1 GMM 2 4.7%/8.2% 0.195/ JDB 2 4.3%/6.1% 0.196/ HSB 1 3.3%/4.6% 0.148/0.234 Table 2: Results on the development set obtained with 512 Gaussians and global z-vectors. 16 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

17 512 Components, Local z-vectors EER (M/F) DCF (M/F) JDB (component fusion) 3.9%/5.2% 0.184/0.259 HSB (component fusion) 3.6%/3.9% 0.152/0.197 HSB (forced alignment) 3.5%/4.0% 0.152/0.197 Table 3: Results on the development set obtained with 512 Gaussians, local z-vectors and digit-dependent backends 17 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

18 Fusion of Local and Global EER (M/F) DCF (M/F) dev local 3.7%/3.8% 0.149/0.193 dev global 3.2%/4.5% 0.148/0.232 dev fusion 2.9%/3.6% 0.131/0.186 eval local 2.6%/4.5% 0.134/0.211 eval global 2.7%/4.7% 0.140/0.236 eval fusion 2.3%/4.0% 0.122/0.192 Table 4: Results on the development and evaluation sets obtained with local and global Hidden Supervector systems, 512 components. 18 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

19 Conclusion Modeling uncertainty yields error rate reductions of up to 25% compared with the Joint Density Backend, consistently across all experiments on the RSR Part III task This can be achieved without resorting to subspace methods although the idea can be seen as applying the same idea as the I-Vector Backend (Interspeech 2015) at the level of individual mixture components Unlike the I-Vector backend, the Hidden Supervector Backend can be configured in a way which makes very modest computational demands With semi-diagonal constraints on the prior, the run-time linear algebra involves only diagonal matrices 19 / 19 P. Kenny, T. Stafylakis, J. Alam et al. Uncertainty Modeling without Subspace Methods

A Small Footprint i-vector Extractor

A Small Footprint i-vector Extractor Patrick Kenny Odyssey Speaker and Language Recognition Workshop June 25, 2012 1 / 25 Patrick Kenny A Small Footprint i-vector Extractor Outline Introduction Review