Presented By: Omer Shmueli and Sivan Niv

Size: px

Start display at page:

Download "Presented By: Omer Shmueli and Sivan Niv"

Giles O’Neal’
5 years ago
Views:

1 Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan Niv

2 (Source: lilt.cslt.org/talks/ Deep%20Speaker%20Embedding%20for%20SRE.pptx) Where is he/she from? What language was spoken? What was spoken? Accent Recognition Language Recognition Speech Recognition Emotion Recognition Gender Recognition Speaker Recognition Positive? Negative? Happy? Sad? Male or Female? Who spoke?

3 Speaker Recognition motivation Speaker recognition may be used for authentication. Enable the use of app only by specific user. Computer games. Improve other speech signal tasks.

4 The System

5 MFCC-mel-frequency cepstral coefficient front processing

6 mel-frequency Transform Human ear perception. logarithmic linear

7 The System

8 Average Layer The average layer in the network averages all feature dimensions over the time axis T 1 h = 1 x t T t=0 T is the number of frames in the utterance

9 Affine Layer The affine layer is a normal fully connected layer Reduces feature dimension Normalization Layer Normalize inputs for Triplet Loss

10 The System

11 Triplet Loss

12 Triplet Loss FaceNet

13 Triplet Loss Anchor Negative Positive learning Positive Negative

14 Triplet Loss Hard Negative The constraint is easily meet. In order to find AP that aren't satisfying the constraint multiple GPU are used.

15 The System Softmax

16 Softmax vs Triplet Loss Triplet Loss has better performance than Softmax. Triplet Loss with preprocessed Softmax is even better. Epoch trained Epoch trained

17 The System ResCNN

18 ResCNN 24 million parameters in total.

19 The System GRU

20 LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: LSTMs)

hyperbolic tangent function or a variation (Source: Graves, A., Mohamed, A. R., & Hinton, G.

21 Peephole LSTM f t =σ g W f x t + U f c t 1 + b f i t =σ g W i x t + U i c t 1 + b i o t =σ g W o x t + U o c t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: Graves, A., Mohamed, A. R., & Hinton, G. "Speech recognition with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013)

22 LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: LSTMs)

+ b h h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic

23 GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r o t =σ g W o x t + U o h t 1 + b o h t =z t h t 1 + (1 z t ) σ h W h x t + U h r t h t 1 + b h h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: LSTMs)

24 GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r h t =z t h t z t σ h W h x t + U h r t h t 1 + b h z t r t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation

25 GRU (Gated recurrent unit) 23 million parameters in total.

26 The Datasets Chinese anonymized voice search queries from anonymous phone users Chinese The phrase Xiaodu, Xiaodu Like Okay, Google but Chinese English scripted English utterances from the Amazon Mechanical Turk

27 EER (equal error rate)

28 DNN i-vector The i-vector approach assumes that the sound is composed of: s Conversation side supervector = m Speaker independent component + T Total variability matrix w i vector The DNN i-vector learns m and T To calculate who is the speaker find the closest vector between w and the anchor i vector Prior to this paper the DNN i-vector system was the state-of-the-art

29 Results Mandarin text-independent speaker English text-independent speaker (MTurk dataset)

30 Conclusions The experiments show that the Deep Speaker algorithm significantly improves the text-independent speaker recognition system as compared to the traditional DNN-based i-vector approach. Mandarin dataset UIDs, the EER decreases roughly 50% relative and error decreases by 60% In the English dataset MTurk, the equal error rate decreases by 30% relatively, and error decreases by 50%.

32 Resources / References Papers: Deep Speaker: an End-to-End Neural Speaker Embedding System FaceNet: A Unified Embedding for Face Recognition and Clustering Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation, Cho et al Slides and graphs: lilt.cslt.org/talks/ deep%20speaker%20embedding%20for%20sre.pptx Tronci R, Giacinto G, Roli F. Dynamic Score Combination: A Supervised and Unsupervised Score Combination Method. InMLDM 2009 Jul 21 (pp ). The EER graph

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech