Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

Size: px

Start display at page:

Download "Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre"

Jasper Eaton
5 years ago
Views:

1 Fast speaker diarization based on binary keys Xavier Anguera and Jean François Bonastre

2 Outline Introduction Speaker diarization Binary speaker modeling Binary speaker diarization system Experiments Conclusions and future work

3 What is speaker diarization? (in case no one has told you already) Given a multi-speaker recording, identify who speaks when, setting each speaker with a generic ID. No information a priori is given regarding the number of speakers or their identity

4 Standard speaker diarization approaches

5 Standard Speaker Diarization system

6 State of the art Speaker diarization has reached very competitive accuracy levels 7-10% for Broadcast news (LIMSI RT04) 12-14% for Meetings (I2R RT09) but is currently too slow for many real-life applications standard: >> 1xRT ICSI (mono-core): 0.97xRT ICSI (GPU): 0.07xRT

7 What do we want? To dramatically speedup the processing while maintaining the accuracy level (DER) How do we do it? By adapting a recently proposed binary speaker modeling to diarization

8 Review of binary speaker modeling

9 Typical speaker modeling using GMM Training x[n] Acoustic param. EM-ML training GMM model λ Testing y[n] Acoustic param. Model evaluatio n Lkld(y[n] λ)

adapted from Statistical features usually model most occurring characteristics

10 Problems of using GMM modeling for Diarization Lack of precision in modeling a particular speaker Very dependent on the model initialization or the UBM it is adapted from Statistical features usually model most occurring characteristics instead of speaker specific information Very slow when using iterative EM-ML and Viterbi

11 A new modeling paradigm Constraints Fast to compare two speaker models Should allow to model a speaker dynamically-> more than 1 vector per speaker file Noise robustness Should be possible to EXPLAIN a decision Solution Large space, to be discriminant between speakers But reduced quantification -> binary 11

12 Binary speaker modeling (I) Acoustic data Acoustic parameters extraction Binary key computation Binary Keys Background model (KBM) Binary key

13 Binary speaker modeling (II) General KBM components Selected KBM component (defines the in-interest subspace) In-interest area for the input data For a given input data, different sub-areas of the acoustic space are selected (each corresponding to one UBM component) 13

14 Binary speaker modeling (III) Selection of n best specificities Outputed values (bronwn data) Outputed values (green data) Selection of n best specificities 14

15 Obtaining the binary fingerprint

16 Similarity between binary vectors It is very fast to compute Any binary measure can be used, for example: S(v 1,v 2 ) = N i =1 N i =1 (v 1 [i] v 2 [i]) (v 1 [i] v 2 [i]) v v v 1 v v 1 v S(v 1,v 2 ) = 2 12 = 0.166

17 Preliminary speaker modeling experiments Initial experiments on a small database show that binary speaker models are quite discriminant for KBM > 512 Gauss

18 Binary speaker diarization system

19 Speaker diarization main blocks NOTE: we are still using the agglomerative clustering approach, but performed over binary keys

20 Acoustic + binary processing Acoustic modeling is only used in the initialization step. Thereafter everything is done in the binary space. We use standard acoustic features 19 MFCC (no Energy, no deltas) extracted every 10ms with 25ms window.

21 KBM model It is a special UBM trained from the test data No external data is used Its complexity is N>=512 Gaussians Performance does not usually improve above N=2000 Gaussians Standard Divisive (EM-ML) training approaches cannot be used as the Gaussian means are not representing particular speakers, but rather averages of all.

22 Building the KBM model

23 KBM training for Diarization We aim at training the KBM from the test data with no a priori knowledge on the speakers Select 1 st Gauss. Initialize v_kl2 argmax Lkld(x i,θ i ) Initialize v_kl2 v KL2 [i] =KL2(θ i,θ 1st ) i Gaussian Pool Iterate until N Gauss. Update KL2 distances Select Gauss with biggest KL2 dist. v KL 2 [i] =min(v KL 2 [i],kl2(θ',θ i ))

24 Efficient binarization For spkr. Diarization many binary keys will need to be computed with different sets of acoustic features. We split the process in 2 steps: 1. Compute the K-best KBM Gaussians for each acoustic feature vector <- only done once 2. For any subset of K-best binarized vectors compute the binary key as usual

25 MFCC features vectors KBM N Gauss Initially we have a set of acoustic features and the KBM model

26 MFCC features vectors 0 KBM N Gauss For each feature vector we obtain a binary vector with a 1 on the Gaussians with highest Lkld values. N-1

27 MFCC features vectors 0 KBM N Gauss N-1

28 MFCC features vectors KBM N Gauss 0 Such binarized vectors can be stored in memory in a compact way by just storing the positions of the most relevant Gaussians for each feature vector t disk N-1

29 MFCC features vectors To obtain a fingerprint for any segment we first accumulate the counts of all previously selected Gaussians KBM N Gauss t disk N-1

30 MFCC features vectors And finally, we get the binary key by turning to 1 the best cells N KBM N Gauss t disk N-1

31 Clustering initialization We need to define a set of N init initial clusters. We reuse the info in the KBM to do so: 5th 2nd 1st 6th 4th 3rd Viterbi/seg mental assignment Acoustic features KBM model

32 Agglomerative clustering Initial clusters Clusters training Segmental Assignment Obtain the fingerprint for frames associated to each cluster Clusters training Select best clustering Yes Closest pair merging Reached one cluster? No Compute the binary distance between all cluster pairs and merge the most similar

33 Segmental assignment We perform a fast assignment of segments to clusters based on signature similarities Binary Cluster models Binary comparison 1 sec. 1 sec. 1 sec. Clus. 3 Clus. 1 Clus. 3 Clus. 2

34 Best clustering selection From N init to 1 we select the optimum clustering using the student-t test T s metric inspired in [1] The intra and inter-cluster distances are used to obtain two comparing distributions. d 1 1 sec We select the clustering with biggest T s T s = µ µ σ 1 + σ 2 2 n 1 n 2 d2 d 1 d 1 d 2 D 1 : intra-cluster distances D 2 : inter-cluster distances Note that all segment-distances need to be pre-computed just once at the beginning [1] T-testdistance and clustering criterion for speaker diarization, Trung Hieu Nguyen, Eng Siong Chng and Haizhou Li, in Proc. Interspeech, 2008.

35 Some cluster selection examples #clusters = #speakers Optimum diarization result

36 Evaluation We used ALLNIST Rich Transcription Datasets We evaluate it using: Diarization error rate (DER): percentage of time where the wrong label is assigned, including overlap. Realtime factor (computed over the speech data) To compare we use a baseline acoustic-based system similar to [2] [2] A robust speaker clustering algorithm, Jitendra Ajmera and Chuck Wooters, in Proc. of IEEE ASRU, US Virgin Islands, USA, Dec

37 Results (I) Standard GMM-like training of the KBM Optimum results when stopping criterion is perfect

38 Results (II) DER as a function of # Gaussians in the KBM

39 Comparison results Meeting-by-meeting comparison between baseline (blue) and proposed system (red)

40 Conclusions and future work Progress in speaker diarization seems stagnant and doomed to long processing times We propose a very fast system by using a recently proposed binary speaker modeling technique We achieve DER scores that are close to GMM-based DER Next we are working on Improving the binary key fingerprint Finding a better stopping criterion Further speeding up the system

41 Thanks! Xavier Anguera

Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre

Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre The 2 Parts HDM based diarization System The homogeneity measure 2 Outline