Presented By: Omer Shmueli and Sivan Niv

Similar documents
Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

WaveNet: A Generative Model for Raw Audio

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Long-Short Term Memory and Other Gated RNNs

Deep Neural Networks

EE-559 Deep learning Recurrent Neural Networks

Deep Learning for Speech Recognition. Hung-yi Lee

An exploration of dropout with LSTMs

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Why DNN Works for Acoustic Modeling in Speech Recognition?

Deep Recurrent Neural Networks

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Memory-Augmented Attention Model for Scene Text Recognition

Improved Learning through Augmenting the Loss

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

Sequence Modeling with Neural Networks

arxiv: v1 [cs.lg] 27 Oct 2017

Speaker recognition by means of Deep Belief Networks

Unfolded Recurrent Neural Networks for Speech Recognition

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Robust Speaker Identification

EE-559 Deep learning LSTM and GRU

Task-Oriented Dialogue System (Young, 2000)

Lecture 5 Neural models for NLP

Recurrent Neural Networks. Jian Tang

Usually the estimation of the partition function is intractable and it becomes exponentially hard when the complexity of the model increases. However,

Deep Learning Architectures and Algorithms

Machine Learning. Boris

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Making Deep Learning Understandable for Analyzing Sequential Data about Gene Regulation

Recurrent Neural Network

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Recurrent and Recursive Networks

NEURAL LANGUAGE MODELS

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Biologically Plausible Speech Recognition with LSTM Neural Nets

Tracking the World State with Recurrent Entity Networks

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Introduction to RNNs!

Combining Static and Dynamic Information for Clinical Event Prediction

Attention Based Joint Model with Negative Sampling for New Slot Values Recognition. By: Mulan Hou

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

arxiv: v1 [cs.cl] 21 May 2017

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Robust Sound Event Detection in Continuous Audio Environments

Gated Recurrent Neural Tensor Network

Deep Learning for Automatic Speech Recognition Part I

arxiv: v2 [cs.sd] 7 Feb 2018

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753)

A QUESTION ANSWERING SYSTEM USING ENCODER-DECODER, SEQUENCE-TO-SEQUENCE, RECURRENT NEURAL NETWORKS. A Project. Presented to

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

APPLIED DEEP LEARNING PROF ALEXIEI DINGLI

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Speech and Language Processing

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Hidden Markov Model and Speech Recognition

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley

Generating Sequences with Recurrent Neural Networks

Design of Chinese Natural Language in Fuzzy Boundary Determination Algorithm Based on Big Data

Synaptic Devices and Neuron Circuits for Neuron-Inspired NanoElectronics

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

arxiv: v1 [stat.ml] 18 Nov 2017

CSC321 Lecture 16: ResNets and Attention

Today s Lecture. Dropout

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

CSCI 315: Artificial Intelligence through Deep Learning

Introduction to Deep Neural Networks

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Based on the original slides of Hung-yi Lee

Front-End Factor Analysis For Speaker Verification

Sequence Transduction with Recurrent Neural Networks

Faster Training of Very Deep Networks Via p-norm Gates

Artificial Neural Networks. Introduction to Computational Neuroscience Tambet Matiisen

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Neural Architectures for Image, Language, and Speech Processing

arxiv: v1 [cs.ne] 14 Nov 2012

Maximum Likelihood and Maximum A Posteriori Adaptation for Distributed Speaker Recognition Systems

Multimodal Machine Learning

Transcription:

Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan Niv

(Source: lilt.cslt.org/talks/160520- Deep%20Speaker%20Embedding%20for%20SRE.pptx) Where is he/she from? What language was spoken? What was spoken? Accent Recognition Language Recognition Speech Recognition Emotion Recognition Gender Recognition Speaker Recognition Positive? Negative? Happy? Sad? Male or Female? Who spoke?

Speaker Recognition motivation Speaker recognition may be used for authentication. Enable the use of app only by specific user. Computer games. Improve other speech signal tasks.

The System

MFCC-mel-frequency cepstral coefficient front processing

mel-frequency Transform Human ear perception. logarithmic linear

The System

Average Layer The average layer in the network averages all feature dimensions over the time axis T 1 h = 1 x t T t=0 T is the number of frames in the utterance

Affine Layer The affine layer is a normal fully connected layer Reduces feature dimension Normalization Layer Normalize inputs for Triplet Loss

The System

Triplet Loss

Triplet Loss FaceNet

Triplet Loss Anchor Negative Positive learning Positive Negative

Triplet Loss Hard Negative The constraint is easily meet. In order to find AP that aren't satisfying the constraint multiple GPU are used.

The System Softmax

Softmax vs Triplet Loss Triplet Loss has better performance than Softmax. Triplet Loss with preprocessed Softmax is even better. Epoch trained Epoch trained

The System ResCNN

ResCNN 24 million parameters in total.

The System GRU

LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)

Peephole LSTM f t =σ g W f x t + U f c t 1 + b f i t =σ g W i x t + U i c t 1 + b i o t =σ g W o x t + U o c t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: Graves, A., Mohamed, A. R., & Hinton, G. "Speech recognition with deep recurrent neural networks." Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013)

LSTM (review) f t =σ g W f x t + U f h t 1 + b f i t =σ g W i x t + U i h t 1 + b i o t =σ g W o x t + U o h t 1 + b o c t =f t c t 1 + i t σ c W c x t + U c h t 1 + b c h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)

GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r o t =σ g W o x t + U o h t 1 + b o h t =z t h t 1 + (1 z t ) σ h W h x t + U h r t h t 1 + b h h t =o t σ h c t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation (Source: http://colah.github.io/posts/2015-08-understanding- LSTMs)

GRU (Gated recurrent unit) z t =σ g W z x t + U z h t 1 + b z r t =σ g W r x t + U r h t 1 + b r h t =z t h t 1 + 1 z t σ h W h x t + U h r t h t 1 + b h z t r t σ g sigmoid function σ h hyperbolic tangent function σ c hyperbolic tangent function or a variation

GRU (Gated recurrent unit) 23 million parameters in total.

The Datasets Chinese anonymized voice search queries from anonymous phone users Chinese The phrase Xiaodu, Xiaodu Like Okay, Google but Chinese English scripted English utterances from the Amazon Mechanical Turk

EER (equal error rate)

DNN i-vector The i-vector approach assumes that the sound is composed of: s Conversation side supervector = m Speaker independent component + T Total variability matrix w i vector The DNN i-vector learns m and T To calculate who is the speaker find the closest vector between w and the anchor i vector Prior to this paper the DNN i-vector system was the state-of-the-art

Results Mandarin text-independent speaker English text-independent speaker (MTurk dataset)

Conclusions The experiments show that the Deep Speaker algorithm significantly improves the text-independent speaker recognition system as compared to the traditional DNN-based i-vector approach. Mandarin dataset UIDs, the EER decreases roughly 50% relative and error decreases by 60% In the English dataset MTurk, the equal error rate decreases by 30% relatively, and error decreases by 50%.

Resources / References Papers: Deep Speaker: an End-to-End Neural Speaker Embedding System FaceNet: A Unified Embedding for Face Recognition and Clustering Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation, Cho et al. 2014 Slides and graphs: lilt.cslt.org/talks/160520-deep%20speaker%20embedding%20for%20sre.pptx http://colah.github.io/posts/2015-08-understanding-lstms Tronci R, Giacinto G, Roli F. Dynamic Score Combination: A Supervised and Unsupervised Score Combination Method. InMLDM 2009 Jul 21 (pp. 163-177). The EER graph