Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan Niv
(Source: lilt.cslt.org/talks/160520- Deep%20Speaker%20Embedding%20for%20SRE.pptx) Where is he/she from? What language was spoken? What was spoken? Accent Recognition Language Recognition Speech Recognition Emotion Recognition Gender Recognition Speaker Recognition Positive? Negative? Happy? Sad? Male or Female? Who spoke?
Speaker Recognition motivation Speaker recognition may be used for authentication. Enable the use of app only by specific user. Computer games. Improve other speech signal tasks.
The System
MFCC-mel-frequency cepstral coefficient front processing
mel-frequency Transform Human ear perception. logarithmic linear
The System
Average Layer The average layer in the network averages all feature dimensions over the time axis T 1 h = 1 x t T t=0 T is the number of frames in the utterance
Affine Layer The affine layer is a normal fully connected layer Reduces feature dimension Normalization Layer Normalize inputs for Triplet Loss
The System
Triplet Loss
Triplet Loss FaceNet
Triplet Loss Anchor Negative Positive learning Positive Negative
Triplet Loss Hard Negative The constraint is easily meet. In order to find AP that aren't satisfying the constraint multiple GPU are used.
The System Softmax
Softmax vs Triplet Loss Triplet Loss has better performance than Softmax. Triplet Loss with preprocessed Softmax is even better. Epoch trained Epoch trained
The System ResCNN
ResCNN 24 million parameters in total.
The System GRU
LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)
Peephole LSTM f t =σ g W f x t + U f c t 1 + b f i t =σ g W i x t + U i c t 1 + b i o t =σ g W o x t + U o c t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: Graves, A., Mohamed, A. R., & Hinton, G. "Speech recognition with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013)
LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)
GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r o t =σ g W o x t + U o h t 1 + b o h t =z t h t 1 + (1 z t ) σ h W h x t + U h r t h t 1 + b h h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)
GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r h t =z t h t 1 + 1 z t σ h W h x t + U h r t h t 1 + b h z t r t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation
GRU (Gated recurrent unit) 23 million parameters in total.
The Datasets Chinese anonymized voice search queries from anonymous phone users Chinese The phrase Xiaodu, Xiaodu Like Okay, Google but Chinese English scripted English utterances from the Amazon Mechanical Turk
EER (equal error rate)
DNN i-vector The i-vector approach assumes that the sound is composed of: s Conversation side supervector = m Speaker independent component + T Total variability matrix w i vector The DNN i-vector learns m and T To calculate who is the speaker find the closest vector between w and the anchor i vector Prior to this paper the DNN i-vector system was the state-of-the-art
Results Mandarin text-independent speaker English text-independent speaker (MTurk dataset)
Conclusions The experiments show that the Deep Speaker algorithm significantly improves the text-independent speaker recognition system as compared to the traditional DNN-based i-vector approach. Mandarin dataset UIDs, the EER decreases roughly 50% relative and error decreases by 60% In the English dataset MTurk, the equal error rate decreases by 30% relatively, and error decreases by 50%.
Resources / References Papers: Deep Speaker: an End-to-End Neural Speaker Embedding System FaceNet: A Unified Embedding for Face Recognition and Clustering Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation, Cho et al. 2014 Slides and graphs: lilt.cslt.org/talks/160520-deep%20speaker%20embedding%20for%20sre.pptx http://colah.github.io/posts/2015-08-understanding-lstms Tronci R, Giacinto G, Roli F. Dynamic Score Combination: A Supervised and Unsupervised Score Combination Method. InMLDM 2009 Jul 21 (pp. 163-177). The EER graph