Session Variability Compensation in Automatic Speaker Recognition

Size: px

Start display at page:

Download "Session Variability Compensation in Automatic Speaker Recognition"

Rodger Little
5 years ago
Views:

1 Session Variability Compensation in Automatic Speaker Recognition Javier González Domínguez VII Jornadas MAVIR Universidad Autónoma de Madrid November 2012

2 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 2/31

3 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 3/31

4 The Inter-Session Variability Problem: Causes Inter-session variability: All phenomena causing two recordings of a same identity to be different. v Transmission channel effects (GSM, landline,...). v Transducer characteristics (microphone type,...). v Environment Noise (traffic, people speaking,...) v Intra-speaker variability (age, illness, emotions,...) 4/31

5 Factor Analysis: Basis Principles: 1. Variability as a continuous source rather than discrete. 2. Modeling both session and inter-speaker/language variability. Assumption: 1. Variability lies in a lower-dimensional subspace. 5/31

6 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 6/31

Eigenfaces, working scheme DEVELOPMENT STAGE Three first

DxM A Visualization DxK Train data: t samples TRAIN

KxT Visualization Test sample TEST STAGE s i!

7 Eigenfaces, working scheme DEVELOPMENT STAGE Three first principal components Dev data: m samples Coding B PCA(C) DxM A Visualization DxK Train data: t samples TRAIN STAGE Reconstructed training images Coding M DxT A T M M KxT Visualization Test sample TEST STAGE s i! " accepted Coding t Dx1 A T M t Kx1 d(t,m) S Tx1 s i > " rejected 7/31

8 The GMM-UBM Framework: Maximum a Posteriori (MAP) cj cj x x x x x x x x x x x x x UBM ci Speaker Model A B ci 8/31

9 The GMM-UBM Framework: The Supervector Concept cj UBM ci DIMENSIONALITY! UBM = {" i,µ i,! i } n M: number of mixtures ( ) n F: feature dimension (20-60) n MF ( ~20k- 50k) 9/31

10 Eigenvoices & Eigenchannels GMM-UBM (MAP): sh = + Dz sh D: Full-rank diagonal (scaling factor) z: speaker component s speaker h utterance µ UBM supervector µ s speaker supervector µ sh target model supervector Eigenvoices: s = + Vy s V: speaker variability subspace (low-rank) y: corresponding weights for a given speaker, speaker factors Eigenchannels sh = s + Ux h U: session variability subspace (low-rank) x: corresponding weights for a given utterance/speaker, channel factors 10/31

11 Joint Factor Analysis: Eigenvoices + Eigenchannels + MAP Model = Speaker/language + Session [Kenny 04]. μ s s speaker h utterance µ UBM supervector µ s speaker supervector µ sh target model supervector V speaker variability subspace U session variability subspace x channel factors y speaker factors µ sh = µ + Vy s + Dz s + Ux sh 11/31

12 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. A link with new machine learning paradigms 7. Results (NIST SRE10, SRE 12) 12/31

13 Factor Analysis: Graphical model! z n x n = observed variables (speaker supervectors). z n = latent variables (channel or speaker factors). µ L Hyperparameters (, L, ) = UBM supervector. L = U, V = UBM covariance x n N 13/31

14 Joint Factor Analysis: Point estimate of latent factors, x, y n A point essmate (mean of posterior) of x, y can be computed as in classic relevance MAP![z x] = " f! = (I + L T N" #1 $ #1 L) #1 L T $ #1 p(z) E[z x] p(z x) x 1 P(x z) ~ N (μ + Lz, Ψ) 0 Latent Variables Domain (D =1) Observations Domain (D = 2) x 2 14/31

15 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 15/31

16 Factor Analysis: Where and How n Two classifiers (Acoustic systems) GMM SVM Maximum Margin Hyperplane Support Vectors Support Vectors Margin n Three different levels (domains) Feature domain Statistics domain Model domain (supervectors) 16/31

17 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 17/31

18 Efficiency: ATVS system at NIST SRE2008 tel-tel Step System SVM JFA JFA- LS Development UBM training (2M feature vectors, gender dependent) 4h 4h 4h Training Variability Subspace U/V 1h 1h/1h 1h/1h Feature extrachon (per ~265s file) MFCC 2s 2s 2s Training (per ~265s file) GMM- train 8s 8s 8s FA point- essmate 0.1s 0.1s SVM- train 120s Total(train) 130.1s 10.1s 10.1s xrt train (CPU/speech) 0.50RT 0.04RT 0.04RT TesHng (per ~265s file) SV- train 8s FA point- essmate 0.1s 0.1s Scoring(frame by frame/ linear scoring) 3.2s 0.2s 1 x 10-4 s t- norm(100 models) 320s 20s 1x10-2 s Total(test) 331.2s 22.2s 2.02s xrt test (CPU/speech) 1.24RT 0.08RT 7.5x10-3 RT Training: 1min speech is processed in SVM: 30s JFA: 2.4s JFA-LS: 2.4s Testing: 1min speech is processed in SVM: 74.4s JFA: 4.8s JFA-LS: 0.45s 18/31

19 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 19/31

20 Total Variability : m = m + Tw n Limited real data restrictions à U estimation might include speaker information. n Total Variability: T represents both session and target information m s = m + Tw sh n Disentangling phase in w domain (LDA, WCCN) 20/31

21 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 21/31

22 PLDA: FA over i-vectors W ij i speaker j utterance w mean i-vector w ij target model i-vector F speaker variability subspace G session variability subspace k channel factors h speaker factors e noise term w ij = w + Fh i + Gk ij + e ij 22/31

23 PLDA: FA over i-vectors I speaker H utterance w i-vector F speaker variability subspace G session variability subspace k channel factors h speaker factors e noise term h k w H I Θ = {μ,f,g,σ} 23/31

24 Outline 1. The Inter-session Variability Problem 2. From Eigenfaces to Joint Factor Analysis 3. Factor Analysis in Speaker and Language Recognition: I. Theory II. Where and How III. Efficiency 4. Total Variability 5. PLDA 6. Results (NIST SRE10, SRE 12) 24/31

25 Experimental Results: SRE 10 n Organizer n n National Institute of Standards and Technology (NIST) Competitive participants MIT-LL, SRI, IBM Relevant data about task 2 Channel types involved (telephone, microphone) 2 Speech style involved (conversational, interview) Different vocal effort (high, neutral, low) ~150s train/test ~ 900 speakers ~ 800 models ~ 900 Test Files > Trials 25/31

26 Experimental Results: SRE 10, some samples n Telephone data n Microphone data Close mic Far mic n Vocal Effort Low Vocal Effort High Vocal Effort 26/31

27 Experimental Results: SRE 10 CondiHon EER_male EER_female EER_all C01_ext int vs. int, matched mic 0,54 0,96 0,84 C02_ext int vs. int, mismatched mic 0,47 1,27 1,11 C03_ext int vs. tel, mic vs. phn 1,88 3,54 2,71 C04_ext int vs. tel, mic vs. mic 1,37 3,69 2,52 C05_ext tel vs. tel, phn vs. phn 2,02 2,36 2,22 C06_ext tel vs. tel, normal vs. high vel 2,9 3,54 3,37 C07_ext mic vs. mic, normal vs. high vel 4,03 4,54 4,27 C08_ext tel vs. tel, normal vs. low vel 1,34 1,86 1,69 C09_ext mic vs. mic, normal vs. low vel 4,37 3,56 4,58 27/31

28 Experimental Results: SRE 12 n Drastic changes from past sre evaluations Multitraining (different number of files for model training) Test Variability Duration (20s to 160s) Noisy conditions n Large amount of test files under noisy conditions (10dbs, 0dbs SNR ) n Reverberation n An industrial task ~ 2K speakers ~ 1.8K models ~ 2.5K Test Files ~ 2M Trials (core); ~88M 28/31

29 Experimental Results: SRE 12, some samples n Noisy Files 10 dbs 0 dbs 29/31

30 Experimental Results: SRE 12, noise robustness 30/31

31 QUESTIONS 31/31

Joint Factor Analysis for Speaker Verification

Joint Factor Analysis for Speaker Verification Mengke HU ASPITRG Group, ECE Department Drexel University mengke.hu@gmail.com October 12, 2012 1/37 Outline 1 Speaker Verification Baseline System Session