Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

Similar documents
Around the Speaker De-Identification (Speaker diarization for de-identification ++) Itshak Lapidot Moez Ajili Jean-Francois Bonastre

Speaker Verification Using Accumulative Vectors with Support Vector Machines

Joint Factor Analysis for Speaker Verification

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Model-based unsupervised segmentation of birdcalls from field recordings

Automatic Speech Recognition (CS753)

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Support Vector Machines using GMM Supervectors for Speaker Verification

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

Session Variability Compensation in Automatic Speaker Recognition

Speaker recognition by means of Deep Belief Networks

speaker recognition using gmm-ubm semester project presentation

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

A Small Footprint i-vector Extractor

CSE446: Clustering and EM Spring 2017

Augmented Statistical Models for Speech Recognition

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Multiclass Discriminative Training of i-vector Language Recognition

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Robust Sound Event Detection in Continuous Audio Environments

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Harmonic Structure Transform for Speaker Recognition

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Mixtures of Gaussians with Sparse Structure

Multi-level Gaussian selection for accurate low-resource ASR systems

Session 1: Pattern Recognition

University of Cambridge. MPhil in Computer Speech Text & Internet Technology. Module: Speech Processing II. Lecture 2: Hidden Markov Models I

Lecture 3: Pattern Classification

When Dictionary Learning Meets Classification

Bayesian Analysis of Speaker Diarization with Eigenvoice Priors

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Robust Speaker Identification

L11: Pattern recognition principles

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Independent Component Analysis and Unsupervised Learning

CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System

Brief Introduction of Machine Learning Techniques for Content Analysis

Outline of Today s Lecture

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

ECE 661: Homework 10 Fall 2014

Automatic Speech Recognition (CS753)

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Monaural speech separation using source-adapted models

Proc. of NCC 2010, Chennai, India

Comparison of Log-Linear Models and Weighted Dissimilarity Measures

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

TNO SRE-2008: Calibration over all trials and side-information

Statistical Pattern Recognition

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

Front-End Factor Analysis For Speaker Verification

CSCE 471/871 Lecture 3: Markov Chains and

Pattern Classification

Pivot Selection Techniques

Subspace based/universal Background Model (UBM) based speech modeling This paper is available at

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

A New Unsupervised Event Detector for Non-Intrusive Load Monitoring

Noise Compensation for Subspace Gaussian Mixture Models

Information Theoretic Imaging

Expectation Maximization

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Symmetric Distortion Measure for Speaker Recognition

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

HIGH PERFORMANCE CTC TRAINING FOR END-TO-END SPEECH RECOGNITION ON GPU

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Hidden Markov Model and Speech Recognition

Hidden Markov Models. Dr. Naomi Harte

Anomaly Detection for the CERN Large Hadron Collider injection magnets

IBM Research Report. A Convex-Hull Approach to Sparse Representations for Exemplar-Based Speech Recognition

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Lecture 3: Pattern Classification. Pattern classification

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

A Generative Score Space for Statistical Dialog Characterization in Social Signalling

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Lecture 3. Gaussian Mixture Models and Introduction to HMM s. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 3: Machine learning, classification, and generative models

Clustering. Léon Bottou COS 424 3/4/2010. NEC Labs America

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

Automatic Speech Recognition (CS753)

i-vector and GMM-UBM Bie Fanhu CSLT, RIIT, THU

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

How to Deal with Multiple-Targets in Speaker Identification Systems?

Support Vector Machine. Industrial AI Lab.

Allpass Modeling of LP Residual for Speaker Recognition

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Gaussian Mixture Model Uncertainty Learning (GMMUL) Version 1.0 User Guide

Engineering Part IIB: Module 4F11 Speech and Language Processing Lectures 4/5 : Speech Recognition Basics

Introduction to Machine Learning Midterm, Tues April 8

Statistical NLP Spring Digitizing Speech

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Pattern Recognition Applied to Music Signals

Transcription:

Fast speaker diarization based on binary keys Xavier Anguera and Jean François Bonastre

Outline Introduction Speaker diarization Binary speaker modeling Binary speaker diarization system Experiments Conclusions and future work

What is speaker diarization? (in case no one has told you already) Given a multi-speaker recording, identify who speaks when, setting each speaker with a generic ID. No information a priori is given regarding the number of speakers or their identity

Standard speaker diarization approaches

Standard Speaker Diarization system

State of the art Speaker diarization has reached very competitive accuracy levels 7-10% for Broadcast news (LIMSI RT04) 12-14% for Meetings (I2R RT09) but is currently too slow for many real-life applications standard: >> 1xRT ICSI (mono-core): 0.97xRT ICSI (GPU): 0.07xRT

What do we want? To dramatically speedup the processing while maintaining the accuracy level (DER) How do we do it? By adapting a recently proposed binary speaker modeling to diarization

Review of binary speaker modeling

Typical speaker modeling using GMM Training x[n] Acoustic param. EM-ML training GMM model λ Testing y[n] Acoustic param. Model evaluatio n Lkld(y[n] λ)

Problems of using GMM modeling for Diarization Lack of precision in modeling a particular speaker Very dependent on the model initialization or the UBM it is adapted from Statistical features usually model most occurring characteristics instead of speaker specific information Very slow when using iterative EM-ML and Viterbi

A new modeling paradigm Constraints Fast to compare two speaker models Should allow to model a speaker dynamically-> more than 1 vector per speaker file Noise robustness Should be possible to EXPLAIN a decision Solution Large space, to be discriminant between speakers But reduced quantification -> binary 11

Binary speaker modeling (I) Acoustic data Acoustic parameters extraction Binary key computation Binary Keys Background model (KBM) 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 Binary key

Binary speaker modeling (II) General KBM components Selected KBM component (defines the in-interest subspace) 0 0 1 0 In-interest area for the input data For a given input data, different sub-areas of the acoustic space are selected (each corresponding to one UBM component) 13

Binary speaker modeling (III) Selection of n best specificities Outputed values (bronwn data) 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 Outputed values (green data) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 Selection of n best specificities 14

Obtaining the binary fingerprint

Similarity between binary vectors It is very fast to compute Any binary measure can be used, for example: S(v 1,v 2 ) = N i =1 N i =1 (v 1 [i] v 2 [i]) (v 1 [i] v 2 [i]) v 1 1 0 1 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 v 2 0 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 1 v 1 v 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 v 1 v 2 1 1 1 0 1 1 1 0 0 1 1 0 1 1 0 0 1 1 2 12 S(v 1,v 2 ) = 2 12 = 0.166

Preliminary speaker modeling experiments Initial experiments on a small database show that binary speaker models are quite discriminant for KBM > 512 Gauss

Binary speaker diarization system

Speaker diarization main blocks NOTE: we are still using the agglomerative clustering approach, but performed over binary keys

Acoustic + binary processing Acoustic modeling is only used in the initialization step. Thereafter everything is done in the binary space. We use standard acoustic features 19 MFCC (no Energy, no deltas) extracted every 10ms with 25ms window.

KBM model It is a special UBM trained from the test data No external data is used Its complexity is N>=512 Gaussians Performance does not usually improve above N=2000 Gaussians Standard Divisive (EM-ML) training approaches cannot be used as the Gaussian means are not representing particular speakers, but rather averages of all.

Building the KBM model

KBM training for Diarization We aim at training the KBM from the test data with no a priori knowledge on the speakers Select 1 st Gauss. Initialize v_kl2 argmax Lkld(x i,θ i ) Initialize v_kl2 v KL2 [i] =KL2(θ i,θ 1st ) i Gaussian Pool Iterate until N Gauss. Update KL2 distances Select Gauss with biggest KL2 dist. v KL 2 [i] =min(v KL 2 [i],kl2(θ',θ i ))

Efficient binarization For spkr. Diarization many binary keys will need to be computed with different sets of acoustic features. We split the process in 2 steps: 1. Compute the K-best KBM Gaussians for each acoustic feature vector <- only done once 2. For any subset of K-best binarized vectors compute the binary key as usual

MFCC features vectors KBM N Gauss Initially we have a set of acoustic features and the KBM model

MFCC features vectors 0 KBM N Gauss For each feature vector we obtain a binary vector with a 1 on the Gaussians with highest Lkld values. N-1

MFCC features vectors 0 KBM N Gauss N-1

MFCC features vectors KBM N Gauss 0 Such binarized vectors can be stored in memory in a compact way by just storing the positions of the most relevant Gaussians for each feature vector 2 5 11 12 16 3 4 8 13 17 1 5 10 13 16 4 7 10 14 16 2 7 11 16 17 0 4 10 12 14 t disk N-1

MFCC features vectors To obtain a fingerprint for any segment we first accumulate the counts of all previously selected Gaussians. 0 1 1 2 1 3 2 0 2 1 0 3 2 2 2 1 0 3 2 KBM N Gauss 2 5 11 12 16 3 4 8 13 17 1 5 10 13 16 4 7 10 14 16 2 7 11 16 17 0 4 10 12 14 t disk N-1

MFCC features vectors And finally, we get the binary key by turning to 1 the best cells 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 N-1 0 1 1 2 1 3 2 0 2 1 0 3 2 2 2 1 0 3 2 KBM N Gauss 2 5 11 12 16 3 4 8 13 17 1 5 10 13 16 4 7 10 14 16 2 7 11 16 17 0 4 10 12 14 t disk N-1

Clustering initialization We need to define a set of N init initial clusters. We reuse the info in the KBM to do so: 5th 2nd 1st 6th 4th 3rd 5 1 2 4 6 3 Viterbi/seg mental assignment Acoustic features KBM model

Agglomerative clustering Initial clusters Clusters training Segmental Assignment Obtain the fingerprint for frames associated to each cluster Clusters training Select best clustering Yes Closest pair merging Reached one cluster? No Compute the binary distance between all cluster pairs and merge the most similar

Segmental assignment We perform a fast assignment of segments to clusters based on signature similarities Binary Cluster models Binary comparison 1 sec. 1 sec. 1 sec. Clus. 3 Clus. 1 Clus. 3 Clus. 2

Best clustering selection From N init to 1 we select the optimum clustering using the student-t test T s metric inspired in [1] The intra and inter-cluster distances are used to obtain two comparing distributions. d 1 1 sec We select the clustering with biggest T s T s = µ µ 1 2 2 σ 1 + σ 2 2 n 1 n 2 d2 d 1 d 1 d 2 D 1 : intra-cluster distances D 2 : inter-cluster distances Note that all segment-distances need to be pre-computed just once at the beginning [1] T-testdistance and clustering criterion for speaker diarization, Trung Hieu Nguyen, Eng Siong Chng and Haizhou Li, in Proc. Interspeech, 2008.

Some cluster selection examples #clusters = #speakers Optimum diarization result

Evaluation We used ALLNIST Rich Transcription Datasets We evaluate it using: Diarization error rate (DER): percentage of time where the wrong label is assigned, including overlap. Realtime factor (computed over the speech data) To compare we use a baseline acoustic-based system similar to [2] [2] A robust speaker clustering algorithm, Jitendra Ajmera and Chuck Wooters, in Proc. of IEEE ASRU, US Virgin Islands, USA, Dec. 2003.

Results (I) Standard GMM-like training of the KBM Optimum results when stopping criterion is perfect

Results (II) DER as a function of # Gaussians in the KBM

Comparison results Meeting-by-meeting comparison between baseline (blue) and proposed system (red)

Conclusions and future work Progress in speaker diarization seems stagnant and doomed to long processing times We propose a very fast system by using a recently proposed binary speaker modeling technique We achieve DER scores that are close to GMM-based DER Next we are working on Improving the binary key fingerprint Finding a better stopping criterion Further speeding up the system

Thanks! Xavier Anguera xanguera@tid.es www.xavieranguera.com