Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre

The 2 Parts HDM based diarization System The homogeneity measure 2

Outline 1: HDM based diarization System Viterbi-Based HMM Short Summary Does Fudge Factor Just a Trick? Hidden-Distortion-Models (HDM) Motivation General HDM Speaker Diarization with HDM Experiments and Results Conclusions 3

Viterbi-Based HMM Short Summary Motivation HMMs prove themselves in sequential data machine learning for data segmentation and clustering. It applied successfully in: Speaker Diarization Video Segmentation Bio-Signals (ECG, EEG, fmri) Coding and Segmentation DNA and Protein-Chains Modeling Seismic Signals 4

Viterbi-Based HMM Short Summary (cont.) HMM Parameters The HMM K states model M is be defined by: 1. π: Vector of probabilities to be at each state at time n=1 P(s n=1 =k). 2. A: Matrix of transition probabilities a qk =P(s n =q s n-1 =k). 3. b: Vector of probability density function at each state b k (x)=p(x s=k). M = { π, Ab, } 5

Viterbi-Based HMM Short Summary (cont.) Viterbi-based HMM problems 1. Given an observation sequence X=(x 1,,x N ), and the model M, it is necessary to find the optimal state sequence S=(s 1,,s N ): S*=argmax s p(x, S M ). 2. Estimation of the model s parameters M = {π, A,b} that maximize p(x, S* M ) : M *=argmax M p(x, S* M ). 6

Viterbi-Based HMM Short Summary(cont.) 1 st problem solution Given an observation sequence X=(x 1, x N ), and the model M, it is necessary to find the optimal state sequence S=(s 1,,s N ): S*=argmax S p (X,S M ). (, M ) π ( 1) ( 2) ( N ) P X S = b x a b x a b x = s s s s s s s s 1 1 21 2 N N 1 N N N 1 n n 1 n s ss s n= 2 n= 1 ( ) = π a b x n Transitions dependent - S * Emission dependent b(x t ) 7

Viterbi-Based HMM Short Summary 2 nd problem solution Estimation of the model s parameters M ={π, A, b} that maximize p(x, S* M ) : M *=argmax M p(x, S* M ). Solution using Viterbi statistics N (, M ) π ( n ) P X S a b x N = s1 ss n n 1 s n= 2 n= 1 (, ) ln ( M P X, S M ) C X S (cont.) Cost function = = N π s 1 ss n n 1 s n n= 2 n= 1 n N ( ) = ln + ln a + lnb x n 8

Viterbi-Based HMM Short Summary(cont.) 2 nd problem solution (Cont.) Objective function to maximize the transition probabilities N K K J( A) = ln a + λk 1 a ss n n 1 n= 2 k= 1 q= 1 a qk = N N qk k qk Emission probabilities trained, for example, via GMM Expectation Maximization (EM) using the data associated to the each state 9

Viterbi-Based HMM Short Summary PDF log-likelihood ratio 0.4 0.3 0.2 0.1 The problem 0-3 -2-1 0 1 2 3 α 5 0-5 (a) (b) Frequent changes (cont.) Rare changes In Viterbi, the decision: to which state to move depends on the ratio: ( ) ( ) ( a ) ( ) ( ) 11 a22 59 60 a a 1 60 ln = ln = ln 4.1 12 21 ( ) ( ) aqkbq x aqk bq x = arkbr x a rk br x -3-2 -1 0 1 2 3 α ( a ) ( ) ( ) 11 a22 799 800 a a 1 800 ln = ln = ln 6.7 12 21 10

Hidden-Distortion-Models (HDM) Motivation Unbalanced transition and emission probabilities Parameters scaling required. Not always the emission models are probabilistic: VQ for data transmition(euclidian distance) Binary data (Hamming distance) In this case: the transition can be non-probabilistic costs. A more general models should be defined. 14

Speaker Diarization with HDM Speaker Diarization Definition The goal is to separate the conversation into R clusters each cluster, hopefully, contains single speaker data. Additional clusters can be added for nonspeech, simultaneous speech, etc Number of speakers, R, can be known; otherwise has to be estimated. In our application R=2. 20

Speaker Diarization with HDM (cont.) General Blocks 21

Experiments and Results Experiments Setup Experiments on telephone conversations 3 states HMM non-speech/speaker 1/speaker 2. 5 iterations 20 tied states (200msec), and 1 iteration 10 tied states. 108 conversations - LDC. 2048 conversations NIST-05 (Dev. Set 500; Eval. Set 1548). Models: 1. HMM 2. HDM with different constraints State Models: 1. SOM 6x10 2. GMM 21 full covariance EM training 24

Experiments and Results (cont.) Speaker Diarization with SOM LDC 30 25 ~26.0% Improvement 20 15 10 5 No Cost Geometrical Mean (0.5) Powered Inverse Sum (1.0) Scaled Inverse Sum (1.0) Scaled Log- Likelihood (0.2) Baseline 0 25

Experiments and Results (cont.) Speaker Diarization with SOM (cont.) Full NIST-05 with LDC optimization 25 No Improvement 20 15 10 5 No Cost Geometrical Mean (0.5) Powered Inverse Sum (1.0) Scaled Inverse Sum (1.0) Scaled Log- Likelihood (0.2) Baseline 0 27

Experiments and Results (cont.) Speaker Diarization with SOM (cont.) Eval NIST-05 with NIST Dev optimization 25 ~1.8% Improvement 20 15 10 5 No Cost Geometrical Mean (50) Powered Inverse Sum (1.5) Scaled Inverse Sum (1.5) Scaled Log- Likelihood (0.8) Baseline 0 28

Experiments and Results (cont.) Speaker Diarization with GMM LDC ~10.4% Improvement 30 25 20 15 10 5 No Cost Geometrical Mean (0.15) Powered Inverse Sum (0.4) Scaled Inverse Sum (0.2) Scaled Log- Likelihood (0.3) Baseline 0 29

Conclusions 1. LDC HMM costs are extremely not-optimal: Fudge factor = 0.2, i.e. costs multiplied by factor of five. 2. Scaling is data dependent. Optimal parameters for LDC leads to poor performances on NIST-05. 3. NIST-05: almost no improvement as HMM costs are almost optimal. Fudge factor = 0.8, i.e. costs multiplied by factor of 1.25. 4. The cost and the state models should be chosen together with the hyper-parameter, and are task dependent. 5. Much more work should be done in order to have deeper understanding about the parameter relations in the HDM. 30

Outline 2: The homogeneity measure Motivation The homogeneity measure Experiments and Results Conclusions 31

Motivation Can we compare these two identities? Is the decision meaningful even if the answer is correct? Not enough common data for comparison 32

Motivation cont. Can we compare these two identities? Is the decision meaningful even if the answer is correct? Not enough data for comparison 33

Motivation cont. Can we compare these two identities? Is the decision meaningful even if the answer is wrong? Sufficient amount of common data for comparison 34

Motivation cont. What are the conditions for a good homogeneity measure? 1. We need sufficient amount of common data for comparison 2. The measure should relay on the data itself 3. The measure have to be uncorrelated with the score 4. The measure have to be correlated with system performance 35

The homogeneity measure Can we compare these two identities? Which decision is meaningful? 36

The homogeneity measure cont. GMM M {,, } m m m m 1 λ = ω µ Σ = Posterior probability of the m th Gaussian γ ( mx, ) = ω N µ M κ = 1 ( x;, Σ ) m m m ω N µ 38 ( x;, Σ ) κ κ κ

The homogeneity measure cont. Datasets of the two utterances: Definitions X A = X B = { x } 1,, xn A { x } 1,, xn B Gaussian mixture occupation: ( m) γ ( mx, ) Q { AB, } γq = xn X n Q γ ( m) = γ A( m) + γ B( m) Bit distribution: Bit entropy: B ( ) ( m) ( m) γ m m γ m m A p p = p = 1 p = ( ) ( ) γ γ ( m) ( m) H p = p log p p log p 0 H p 1 m m m m m m 39 B

The homogeneity measure cont. Normalized measure: 1. The measure is bounded: o Same utterances: X o Disjoint data: Measures 2. Dose not take into account the amount of the information in the data. { ( )} M ω ( ) Hˆ = E H p = = H p m m m 1 m m 0 Hˆ 1 = X ˆ = 1 A B H ( ) γ ( ) ˆ m γ m = 0 m = 0 H 0 A B 40

The homogeneity measure cont. Measures Non-normalized measure: 1. Double the data, the measure will also double, but without any new information. 2. It works. Alternative option: Hˆ = NHˆ N = # X + # X non A B o M is the number of Gaussian mixtures to reach Y% of the total occupation (N). Hˆ non = MHˆ 41

Experiments and results Baseline SV system: 1. Features: 19 Mel-cepstrum + 11 Features + Var. Norm. 2. 512 UBM-GMM. 3. 400 dimensional i-vectors. 4. PLDA scoring. Experiment protocol: 1. NIST 2008 det 1 short2-sort3 2. 39433 tests: 8290 targets, 31143 non-targets 42

Experiments and results cont. Evaluation of the measure: 1. For every trail the homogeneity measure is calculated. 2. All the measures are sorted. 3. The measures chunked into chunks of size 1500 with sliding window of size 1000. 4. For every chunk False Alarm (FA), False Reject (FR), and mincllr (log-likelihood-ratio cost) are calculated. o CLRR gives systems cost of loss and not hard limiter like EER. Lower the CLRR, the better system performances. 43

Experiments and results cont. Normalized Homogeneity Measure Non-Normalized Homogeneity Measure 44

Conclusions Correct results due to the wrong reasons is bad. In order to have a valid examination, the data to be compare, must be comparable. In order to have a valid examination, the data to be compare, should be sufficient. This why the normalized measure does not work. High homogeneity measure leads to a low mincllr, i.e., high system s performance. 45