Fast speaker diarization based on binary keys Xavier Anguera and Jean François Bonastre
Outline Introduction Speaker diarization Binary speaker modeling Binary speaker diarization system Experiments Conclusions and future work
What is speaker diarization? (in case no one has told you already) Given a multi-speaker recording, identify who speaks when, setting each speaker with a generic ID. No information a priori is given regarding the number of speakers or their identity
Standard speaker diarization approaches
Standard Speaker Diarization system
State of the art Speaker diarization has reached very competitive accuracy levels 7-10% for Broadcast news (LIMSI RT04) 12-14% for Meetings (I2R RT09) but is currently too slow for many real-life applications standard: >> 1xRT ICSI (mono-core): 0.97xRT ICSI (GPU): 0.07xRT
What do we want? To dramatically speedup the processing while maintaining the accuracy level (DER) How do we do it? By adapting a recently proposed binary speaker modeling to diarization
Review of binary speaker modeling
Typical speaker modeling using GMM Training x[n] Acoustic param. EM-ML training GMM model λ Testing y[n] Acoustic param. Model evaluatio n Lkld(y[n] λ)
Problems of using GMM modeling for Diarization Lack of precision in modeling a particular speaker Very dependent on the model initialization or the UBM it is adapted from Statistical features usually model most occurring characteristics instead of speaker specific information Very slow when using iterative EM-ML and Viterbi
A new modeling paradigm Constraints Fast to compare two speaker models Should allow to model a speaker dynamically-> more than 1 vector per speaker file Noise robustness Should be possible to EXPLAIN a decision Solution Large space, to be discriminant between speakers But reduced quantification -> binary 11
Binary speaker modeling (I) Acoustic data Acoustic parameters extraction Binary key computation Binary Keys Background model (KBM) 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 Binary key
Binary speaker modeling (II) General KBM components Selected KBM component (defines the in-interest subspace) 0 0 1 0 In-interest area for the input data For a given input data, different sub-areas of the acoustic space are selected (each corresponding to one UBM component) 13
Binary speaker modeling (III) Selection of n best specificities Outputed values (bronwn data) 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 Outputed values (green data) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 Selection of n best specificities 14
Obtaining the binary fingerprint
Similarity between binary vectors It is very fast to compute Any binary measure can be used, for example: S(v 1,v 2 ) = N i =1 N i =1 (v 1 [i] v 2 [i]) (v 1 [i] v 2 [i]) v 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 v 2 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 v 1 v 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 v 1 v 2 1 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 2 12 S(v 1,v 2 ) = 2 12 = 0.166
Preliminary speaker modeling experiments Initial experiments on a small database show that binary speaker models are quite discriminant for KBM > 512 Gauss
Binary speaker diarization system
Speaker diarization main blocks NOTE: we are still using the agglomerative clustering approach, but performed over binary keys
Acoustic + binary processing Acoustic modeling is only used in the initialization step. Thereafter everything is done in the binary space. We use standard acoustic features 19 MFCC (no Energy, no deltas) extracted every 10ms with 25ms window.
KBM model It is a special UBM trained from the test data No external data is used Its complexity is N>=512 Gaussians Performance does not usually improve above N=2000 Gaussians Standard Divisive (EM-ML) training approaches cannot be used as the Gaussian means are not representing particular speakers, but rather averages of all.
Building the KBM model
KBM training for Diarization We aim at training the KBM from the test data with no a priori knowledge on the speakers Select 1 st Gauss. Initialize v_kl2 argmax Lkld(x i,θ i ) Initialize v_kl2 v KL2 [i] =KL2(θ i,θ 1st ) i Gaussian Pool Iterate until N Gauss. Update KL2 distances Select Gauss with biggest KL2 dist. v KL 2 [i] =min(v KL 2 [i],kl2(θ',θ i ))
Efficient binarization For spkr. Diarization many binary keys will need to be computed with different sets of acoustic features. We split the process in 2 steps: 1. Compute the K-best KBM Gaussians for each acoustic feature vector <- only done once 2. For any subset of K-best binarized vectors compute the binary key as usual
MFCC features vectors KBM N Gauss Initially we have a set of acoustic features and the KBM model
MFCC features vectors 0 KBM N Gauss For each feature vector we obtain a binary vector with a 1 on the Gaussians with highest Lkld values. N-1
MFCC features vectors 0 KBM N Gauss N-1
MFCC features vectors KBM N Gauss 0 Such binarized vectors can be stored in memory in a compact way by just storing the positions of the most relevant Gaussians for each feature vector 2 5 11 12 16 3 4 8 13 17 1 5 10 13 16 4 7 10 14 16 2 7 11 16 17 0 4 10 12 14 t disk N-1
MFCC features vectors To obtain a fingerprint for any segment we first accumulate the counts of all previously selected Gaussians. 0 1 1 2 1 3 2 0 2 1 0 3 2 2 2 1 0 3 2 KBM N Gauss 2 5 11 12 16 3 4 8 13 17 1 5 10 13 16 4 7 10 14 16 2 7 11 16 17 0 4 10 12 14 t disk N-1
MFCC features vectors And finally, we get the binary key by turning to 1 the best cells 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 N-1 0 1 1 2 1 3 2 0 2 1 0 3 2 2 2 1 0 3 2 KBM N Gauss 2 5 11 12 16 3 4 8 13 17 1 5 10 13 16 4 7 10 14 16 2 7 11 16 17 0 4 10 12 14 t disk N-1
Clustering initialization We need to define a set of N init initial clusters. We reuse the info in the KBM to do so: 5th 2nd 1st 6th 4th 3rd 5 1 2 4 6 3 Viterbi/seg mental assignment Acoustic features KBM model
Agglomerative clustering Initial clusters Clusters training Segmental Assignment Obtain the fingerprint for frames associated to each cluster Clusters training Select best clustering Yes Closest pair merging Reached one cluster? No Compute the binary distance between all cluster pairs and merge the most similar
Segmental assignment We perform a fast assignment of segments to clusters based on signature similarities Binary Cluster models Binary comparison 1 sec. 1 sec. 1 sec. Clus. 3 Clus. 1 Clus. 3 Clus. 2
Best clustering selection From N init to 1 we select the optimum clustering using the student-t test T s metric inspired in [1] The intra and inter-cluster distances are used to obtain two comparing distributions. d 1 1 sec We select the clustering with biggest T s T s = µ µ 1 2 2 σ 1 + σ 2 2 n 1 n 2 d2 d 1 d 1 d 2 D 1 : intra-cluster distances D 2 : inter-cluster distances Note that all segment-distances need to be pre-computed just once at the beginning [1] T-testdistance and clustering criterion for speaker diarization, Trung Hieu Nguyen, Eng Siong Chng and Haizhou Li, in Proc. Interspeech, 2008.
Some cluster selection examples #clusters = #speakers Optimum diarization result
Evaluation We used ALLNIST Rich Transcription Datasets We evaluate it using: Diarization error rate (DER): percentage of time where the wrong label is assigned, including overlap. Realtime factor (computed over the speech data) To compare we use a baseline acoustic-based system similar to [2] [2] A robust speaker clustering algorithm, Jitendra Ajmera and Chuck Wooters, in Proc. of IEEE ASRU, US Virgin Islands, USA, Dec. 2003.
Results (I) Standard GMM-like training of the KBM Optimum results when stopping criterion is perfect
Results (II) DER as a function of # Gaussians in the KBM
Comparison results Meeting-by-meeting comparison between baseline (blue) and proposed system (red)
Conclusions and future work Progress in speaker diarization seems stagnant and doomed to long processing times We propose a very fast system by using a recently proposed binary speaker modeling technique We achieve DER scores that are close to GMM-based DER Next we are working on Improving the binary key fingerprint Finding a better stopping criterion Further speeding up the system
Thanks! Xavier Anguera xanguera@tid.es www.xavieranguera.com