Joint Factor Analysis for Speaker Verification

Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37

Outline 1 Speaker Verification Baseline System Session Variation 2 Joint Factor Analysis Hidden Markov Model Factor Analysis Model Principal Components Analysis (PCA) Probabilistic PCA 3 JFA for Speaker Verification General Steps Hyperparameter estimation 2/37

Baseline System H0: Target Model + Input Speech Feature Extraction Pre-Processing Σ Λ H1: Background Model 3/37

Baseline System 1 Given a speech segment X, we test 2 hypotheses: H 0 : X is from claimed target speaker S (GMM) H 1 : X is not from speaker S, it is from the background (UBM). 2 Decision Rule Score = log p(x TargetModel) p(x UBM) H 0 > < H 1 Threshold Note: Score = log p(x TargetModel) log p(x UBM) 4/37

Baseline Experiment 1 Feature Extraction (MFCC) 2 Train the UBM model 3 Obtain Adapted GMM model for target speaker model 4 Test trials against 2 hypotheses 5 Scoring 6 DET(Detection Error Tradeoff) curve false accept VS. false reject Problem How to cancel the channel effect? 5/37

Session Variation Inter-Speaker Variation: Two utterances are from different speaker 7/37

Session Variation Inter-Speaker Variation: Two utterances are from different speaker Inter-Session Variation: Two utterances are from the same speaker Channel effects: Utterances are recorded from different channels Intra-Speaker Variation: Utterances varies with speaker s health or emotional state etc.. 8/37

Gaussian Mixture Model Review Recall GMM: M s p(s z s ) = N(s;µ s i,σs i )z s,i i=1 p(s) = M s p(z s )p(s z s ) = πi s N(s;µs i,σs i )z s,i z s M s p(z s ) = i=1 π z s,i i i=1 z s is a hidden variable indicate which Gaussian mixture component is active. Remark: {z s,i } i 1...Ms are independent 10/37

Hidden Markov Model Graphic Model z 1 z 2 z i 1 z i z i+1 s 1 s 2 s i 1 s i s i+1 P(z n z n 1,...,z 1 ) = P(z n z n 1 ) HMM is often used in speaker recognition. 11/37

Hidden Markov Model We have the following joint probability: ( N p(x,z θ) = p(z 1 π) n=2 where A is transition probability matrix and p(z n z n 1,A) = p(z 1 π) = K k=1 p(x n z n,φ) = ) N p(z n z n 1,A) K K m=1 A z n 1,jz n,k jk k=1 j=1 k, π k = 1 π z 1k k K p(x n φ k ) z nk k=1 p(x m z m,φ) 12/37

Supervector definition Given the GMM mean vector (m c ) F 1, c {1,...,C}, C is the total number of mixture components, F is the dimension of feature vector Supervector is: m CF 1 = (m T 1,...,mT c ) 14/37

Speaker and Channel Dependent Supervector M h M h CHANNEL SPACE S C SPEAKER SPACE M h is the speaker-and channel-dependent supervector 15/37

Notations S: speaker ID Speaker factors: components of y(s) Channel factors: components of x h (s) Speaker space: affine translating the range of vv by m Channel space: the range of uu Loading matrix for speaker factors and channel factors: v and u h = 1,,H(s): one index from set of recordings for a speaker s C: total number of mixture components for a fixed GMM structure F: dimension of the acoustic feature vectors R C : channel rank R S : speaker rank Σ(s): given speaker s and recording h, the covariance of the observation from GMM d: given speaker s, the covariance of the observation from GMM 16/37

Joint Factor Analysis Model JFA model M(s) = m+vy(s) +dz(s) M h (s) = M(s)+ux h (s) m C F : Given a HMM/GMM structure with C mixture components, we concatenate the mean vectors m 1,...,m C together then obtain m M(s): single speaker-dependent supervector M h (s): speaker-and-channel dependent u and v are speaker independent d is a block diagonal matrix z is normal 17/37

JFA model M h CHANNEL SPACE M(s) ux h (s) SPEAKER SPACE 18/37

Problem Purpose: estimate the hyperparameters Λ = (m,u,v,d,σ). The number of GMM component is large. C = 2048 The dimension of the feature vector is F = 39 C F = 79872 = m 79872 and Σ 79872 79872. Problem Σ is very large and it is not full rank, how to estimate? 19/37

Principle Components Analysis Technique x 2 x n u 1 x n PCA technique is to find a principal subspace (magenta line), s.t. the variance of the projected points ( x n ) are maximized. x 1 21/37

Maximum Variance Formulation Find the principle components for principle subspace Given feature vectors as observations {(x n ) N 1 }, n = 1,...,N, we want to find the principle subspace with M basis, M < N Sample mean x and sample covariance S: x = 1 N S = 1 N N n=1 x n N (x n x)(x n x) T n=1 Let the M basis for the principle subspace be u 1,...,u M and u T i u = 1, i [M] P N M = [u 1,...,u M ] N M 23/37

Maximum Variance Formulation Optimization problem find the 1 st principle component By Lagrange methode: maximize: take derivative maxu T 1 Su 1 u T 1 u 1 = 1 u T 1 Su 1 +λ 1 (1 u T 1 u 1 ) u T 1 Su 1 +λ 1 (1 u T 1 u 1) u 1 = (S+S T )u 1 +λ 1 ( 2u 1 ) = 2Su 1 2λ 1 u 1 = 0 = Su 1 = λ 1 u 1 Solution: λ 1 is the largest eigenvalue of S, the correspond u 1 is the first principle component 24/37

Maximum Variance Formulation Find M principle components: find M largest eigenvalues, and their correspond eigenvectors u i, i [M], such that: u T i u i = 1, i [M] u i u j, i j Eigen-decomposition S, find the M largest eigenvalues, decreasing sorted. Then, find the u 1,u 2,...,u M. Remark: [u T 1,u T 2,...,u T M ]T S[u 1,u 2,...,u M ] = P T SP 25/37

Probabilistic Model PCA model D >> M x D 1 = W D M z M 1 +µ+ǫ x is D-dimension observation vector z is M-dimension hidden variable We are given the following probability distributions: p(z) = N(z 0,I) p(x z) = N(x Wz+µ,σ 2 I) 27/37

Probabilistic PCA p(x) is Gaussian p(x) = p(x z)p(z)dz = N(x µ,c), C = WW T +σ 2 I mean and variance of p(x) E[x] = E[Wz+µ+ǫ] = µ cov[x] = E[(Wz+ǫ)(Wz+ǫ) T ] = E[Wzz T W T ]+E[ǫǫ T ] = WW T +σ 2 I 28/37

Probabilistic PCA zn σ 2 µ W xn The graph shows for each observation x n is associate with a value of latent variable z n x n can be obtained by marginalization over z n. Using EM algorithm to estimate the parameters in PCA model (Train PCA model) 29/37

5 steps for JFA Speaker Verification System 1 Train the UBM model 2 Train JFA/PCA model: estimate speaker independent hyperparameters Λ = (m,u,v,d,σ) from a large database in which each speaker is recorded in multiple sessions 3 Adapt Λ from one speaker population to another 4 Enrolling a speaker: estimate the speaker-independent hyperparameters Λ(s) = (m(s), u(s), v(s), d(s), Σ(s)) 5 Test: Given test utterance χ and hypothesized speaker, where X are observations. log P Λ(s)(X) P Λ (X) 31/37

Train the JFA/PCA model Estimate Λ Training set: several speakers with multiple recordings for each speaker Use EM algorithms to estimate Λ Maximum Likelihood Approach (slow) Divergence minimization approach (faster, well initialized) Both algorithm are to fit entire collection of speakers in the training data Total likelihood s P Λ(X(s)), s ranges over the speakers in the training set. It increases from 1 iteration to the next. 33/37

Adapt from one speaker population to another Adaptation is necessary since data set is limit. For a given speaker, there are at most 2 recordings. Keep channel space related hyperparameters fixed (u and Σ h ), re-estimate only the speaker space hyperparameters (m,v,d). Remark: Assume channel space related hyperparameters are speaker independent 34/37

Enroll a target speaker Estimate Λ(s) Recall JFA model: M(s) = m+vy(s) +dz(s) M h (s) = M(s)+ux h (s) Calculate the posterior distribution M(s) Adjusting the Λ(s) to fit this posterior Adopt minimum divergence approach 35/37

Likelihood Function Hyperparameters Λ = (m,u,v,d,σ). P Λ (X(s)) = P Λ (X(s) X)N(X 0,I)dX where: X(s) (observable) is the collections of labeled frames for recording h ( ) T X(s) = X 1 (s),...,x H(s) (s) X(s) (unobservable) is the vector of hidden variables ( ) T X(s) = x 1 (s),...,x H(s),y(s),z(s) N(X 0,I) is the standard Gaussian kernel N(X 0,I) = N(x 1 0,I)...N(x H(s) 0,I)N(y 0,I)N(z 0,I) 36/37

Likelihood ratio Given speech data X uttered by speaker t Test H 0 = {t = s} against H 1 = {t s} 1 T log P Λ s (X) P Λ (X) 37/37