Presented By: Omer Shmueli and Sivan Niv

Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan Niv

(Source: lilt.cslt.org/talks/160520- Deep%20Speaker%20Embedding%20for%20SRE.pptx) Where is he/she from? What language was spoken? What was spoken? Accent Recognition Language Recognition Speech Recognition Emotion Recognition Gender Recognition Speaker Recognition Positive? Negative? Happy? Sad? Male or Female? Who spoke?

Speaker Recognition motivation Speaker recognition may be used for authentication. Enable the use of app only by specific user. Computer games. Improve other speech signal tasks.

The System

MFCC-mel-frequency cepstral coefficient front processing

mel-frequency Transform Human ear perception. logarithmic linear

The System

Average Layer The average layer in the network averages all feature dimensions over the time axis T 1 h = 1 x t T t=0 T is the number of frames in the utterance

Affine Layer The affine layer is a normal fully connected layer Reduces feature dimension Normalization Layer Normalize inputs for Triplet Loss

The System

Triplet Loss

Triplet Loss FaceNet

Triplet Loss Anchor Negative Positive learning Positive Negative

Triplet Loss Hard Negative The constraint is easily meet. In order to find AP that aren't satisfying the constraint multiple GPU are used.

The System Softmax

Softmax vs Triplet Loss Triplet Loss has better performance than Softmax. Triplet Loss with preprocessed Softmax is even better. Epoch trained Epoch trained

The System ResCNN

ResCNN 24 million parameters in total.

The System GRU

LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)

Peephole LSTM f t =σ g W f x t + U f c t 1 + b f i t =σ g W i x t + U i c t 1 + b i o t =σ g W o x t + U o c t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: Graves, A., Mohamed, A. R., & Hinton, G. "Speech recognition with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013)

GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r o t =σ g W o x t + U o h t 1 + b o h t =z t h t 1 + (1 z t ) σ h W h x t + U h r t h t 1 + b h h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)

GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r h t =z t h t 1 + 1 z t σ h W h x t + U h r t h t 1 + b h z t r t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation

GRU (Gated recurrent unit) 23 million parameters in total.

The Datasets Chinese anonymized voice search queries from anonymous phone users Chinese The phrase Xiaodu, Xiaodu Like Okay, Google but Chinese English scripted English utterances from the Amazon Mechanical Turk

EER (equal error rate)

DNN i-vector The i-vector approach assumes that the sound is composed of: s Conversation side supervector = m Speaker independent component + T Total variability matrix w i vector The DNN i-vector learns m and T To calculate who is the speaker find the closest vector between w and the anchor i vector Prior to this paper the DNN i-vector system was the state-of-the-art

Results Mandarin text-independent speaker English text-independent speaker (MTurk dataset)

Conclusions The experiments show that the Deep Speaker algorithm significantly improves the text-independent speaker recognition system as compared to the traditional DNN-based i-vector approach. Mandarin dataset UIDs, the EER decreases roughly 50% relative and error decreases by 60% In the English dataset MTurk, the equal error rate decreases by 30% relatively, and error decreases by 50%.

Resources / References Papers: Deep Speaker: an End-to-End Neural Speaker Embedding System FaceNet: A Unified Embedding for Face Recognition and Clustering Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation, Cho et al. 2014 Slides and graphs: lilt.cslt.org/talks/160520-deep%20speaker%20embedding%20for%20sre.pptx http://colah.github.io/posts/2015-08-understanding-lstms Tronci R, Giacinto G, Roli F. Dynamic Score Combination: A Supervised and Unsupervised Score Combination Method. InMLDM 2009 Jul 21 (pp. 163-177). The EER graph