Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Similar documents
arxiv: v4 [cs.cl] 5 Jun 2017

End-to-end Automatic Speech Recognition

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals

Conditional Random Field

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Deep Recurrent Neural Networks

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Machine Learning for Structured Prediction

Deep Learning for Speech Recognition. Hung-yi Lee

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Unfolded Recurrent Neural Networks for Speech Recognition

SPEECH recognition systems based on hidden Markov

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Based on the original slides of Hung-yi Lee

Deep Learning for Automatic Speech Recognition Part II

Feature-space Speaker Adaptation for Probabilistic Linear Discriminant Analysis Acoustic Models

] Automatic Speech Recognition (CS753)

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

An exploration of dropout with LSTMs

Augmented Statistical Models for Speech Recognition

Conditional Language Modeling. Chris Dyer

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Model and Speech Recognition

arxiv: v5 [stat.ml] 19 Jun 2017

Sequence Transduction with Recurrent Neural Networks

with Local Dependencies

arxiv: v1 [cs.ne] 14 Nov 2012

Deep Learning for Speech Processing An NST Perspective

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Why DNN Works for Acoustic Modeling in Speech Recognition?

Speaker recognition by means of Deep Belief Networks

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Deep Learning for Automatic Speech Recognition Part I

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

Feature-Space Structural MAPLR with Regression Tree-based Multiple Transformation Matrices for DNN

Deep Neural Networks

lecture 6: modeling sequences (final part)

Probabilistic Models for Sequence Labeling

8: Hidden Markov Models

Introduction to Neural Networks

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Very Deep Convolutional Neural Networks for LVCSR

Presented By: Omer Shmueli and Sivan Niv

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

Feedforward Neural Networks

arxiv: v2 [cs.ne] 7 Apr 2015

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

MASK WEIGHTED STFT RATIOS FOR RELATIVE TRANSFER FUNCTION ESTIMATION AND ITS APPLICATION TO ROBUST ASR

Neural Architectures for Image, Language, and Speech Processing

Automatic Speech Recognition (CS753)

Hidden Markov Models and Gaussian Mixture Models

COMP90051 Statistical Machine Learning

On the Influence of the Delta Coefficients in a HMM-based Speech Recognition System

Towards Maximum Geometric Margin Minimum Error Classification

Temporal Modeling and Basic Speech Recognition

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Recurrent Poisson Process Unit for Speech Recognition

arxiv: v1 [cs.cl] 31 May 2015

The THU-SPMI CHiME-4 system : Lightweight design with advanced multi-channel processing, feature enhancement, and language modeling

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

Noise Compensation for Subspace Gaussian Mixture Models

DNN-based uncertainty estimation for weighted DNN-HMM ASR

Augmented Statistical Models for Classifying Sequence Data

arxiv: v1 [cs.cl] 21 May 2017

Graphical models for part of speech tagging

CS230: Lecture 10 Sequence models II

Statistical Machine Learning from Data

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

Unsupervised Model Adaptation using Information-Theoretic Criterion

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Recurrent Neural Network

Geoffrey Zweig May 7, 2009

CSC321 Lecture 16: ResNets and Attention

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Sparse Forward-Backward for Fast Training of Conditional Random Fields

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Random Coattention Forest for Question Answering

Task-Oriented Dialogue System (Young, 2000)

attention mechanisms and generative models

Research Article Deep Neural Networks with Multistate Activation Functions

Lecture 5 Neural models for NLP

Discriminative Models for Speech Recognition

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

Hidden Markov Models and Gaussian Mixture Models

Adapting n-gram Maximum Entropy Language Models with Conditional Entropy Regularization

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

From perceptrons to word embeddings. Simon Šuster University of Groningen

Automatic Speech Recognition (CS753)

Hidden Markov Models

Support Vector Machines using GMM Supervectors for Speaker Verification

Detection-Based Speech Recognition with Sparse Point Process Models

Transcription:

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016

Background A new wave of sequence modelling I. Sutskever, et al., Sequence-to-Sequence Learning with Neural Networks, NIPS 2014 D. Bahdanau, et al., Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015 A. Graves and N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks, ICML 2014 2 of 28

Background Maybe time to review sequence modelling for speech Why speech recognition is special? monotonic alignment long input sequence output sequence is much shorter (word/phonme) 3 of 28

Speech Recognition monotonic alignment challenges for attention models long input sequence expensive for globally (sequence-level) normalised model output sequence is much shorter (word/phonme) length mismatch alignment model or not? 4 of 28

Speech Recognition Hidden Markov Model monotonic alignment long input sequence locally (frame-level) normalised length mismatch hidden states Connectionist Temporal Classification monotonic alignment long input sequence locally normalised length mismatch blank state 5 of 28

Speech Recognition Locally normalised models: conditional independence assumption label bias problem better results given by sequence training: local global normalisation Question: Why not sticking to the globally normalised models from scratch? [1] D. Andor, et al, Globally Normalized Transition-Based Neural Networks, ACL, 2016. [2] D. Povey, et al, Purely sequence-trained neural networks for ASR based on lattice-free MMI Interspeech, 2016 6 of 28

(Segmental) Conditional Random Field CRF segmental CRF 7 of 28

(Segmental) Conditional Random Field CRF [Lafferty et al. 2001] P(y 1:L x 1:T ) = 1 Z(x 1:T ) j ( ) exp w Φ(y j, x 1:T ) (1) where L = T. Segmental (semi-markov) CRF [Sarawagi and Cohen 2004] P(y 1:L, E, x 1:T ) = 1 Z(x 1:T ) j ( ) exp w Φ(y j, e j, x 1:T ) (2) where e j = s j, n j denotes the beginning (s j ) and end (n j ) time tag of y j ; E = {e 1:L } is the latent segment label. 8 of 28

(Segmental) Conditional Random Field 1 Z(x 1:T ) j exp ( w Φ(y j, x 1:T ) ) Learnable parameter w Engineering the feature function Φ( ) Designing Φ( ) is much harder for speech than NLP 9 of 28

Segmental Recurrent Neural Network Using (recurrent) neural networks to learn the feature function Φ( ). y 1 y 2 y 3 x 1 x 2 x 3 x 4 x 5 x 6 10 of 28

Segmental Recurrent Neural Network Training criteria Conditional maximum likelihood L(θ) = log P(y 1:L x 1:T ) = log E P(y 1:L, E x 1:T ) (3) Hinge loss similar to structured SVM. Not studied yet! 11 of 28

Segmental Recurrent Neural Network Viterbi decoding Partially Viterbi decoding y 1:L = arg max log y 1:L E P(y 1:L, E x 1:T ) (4) Fully Viterbi decoding y 1:L, E = arg max y 1:L,E log P(y 1:L, E x 1:T ) (5) 12 of 28

Related works (Segmental) CRFs for speech Neural CRFs Structured SVMs Two good review papers M. Gales, S. Watanabe and E. Fosler-Lussier, Structured Discriminative Models for Speech Recognition, IEEE Signal Processing Magazine, 2012 E. Fosler-Lussier et al. Conditional random fields in speech, audio, and language processing, Proceedings of the IEEE, 2013 13 of 28

Comparison to CTC [1] A. Senior, et al, Acoustic Modelling with CD-CTC-sMBR LSTM RNNs, ASRU 2015. 14 of 28

Comparison to CTC ŷ 1 ŷ 2 ŷ 3 ŷ 4 15 of 28 x 1 x 2 x 3 x 4 p(y x) = p(ŷ 1 x 1 ) p(ŷ 2 x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 )

Comparison to CTC <b> ŷ 2 ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 16 of 28 p(y x) = p( b x 1 ) p(ŷ }{{} 2 x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 ) =1

Comparison to CTC <b> ŷ 2 ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 p(y x) = p( b x 1 ) p(ŷ 2 x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 ) 17 of 28

Comparison to CTC <b> <b> ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 18 of 28 p(y x) = p( b x 1 ) p( b x }{{} 2 ) p(ŷ }{{} 3 x 3 ) p(ŷ 4 x 4 ) =1 =1

Comparison to CTC <b> <b> ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 p(y x) = p( b x 1 ) p( b x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 ) 19 of 28

Comparison to CTC <b> <b> ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 20 of 28

Comparison to CTC <b> <b> ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 CTC loss may do some kind of segmental modelling 21 of 28

Experiment TIMIT dataset 3696 training utterances ( 3 hours) core test set (192 testing utterances) trained on 48 phonemes, and mapped to 39 for scoring log filterbank features (FBANK) using LSTM as an implementation of RNN 22 of 28

Experiment Limit the lengths of segments Recurrent subsampling networks over 10x speedup x 1 x 2 x 3 x 4 a) concatenate / add 23 of 28 x 1 x 2 x 3 x 4 b) skip

Experiment Large model with dropout works the best Table: Results of dropout. Dropout layers hidden PER 3 128 21.2 0.2 3 250 20.1 6 250 19.3 3 128 21.3 0.1 3 250 20.9 6 250 20.4 6 250 21.9 24 of 28

Experiment Table: Results of three types of acoustic features. Features Deltas d(x t ) PER 24-dim FBANK 72 19.3 40-dim FBANK 120 18.9 Kaldi 40 17.3 Kaldi features 39 dimensional MFCCs spliced by a context window of 7, followed by LDA and MLLT transform and with feature-space speaker-dependent MLLR 25 of 28

Experiment Table: Comparison to related works. LM = language model, SD = speaker dependent feature System LM SD PER HMM-DNN 18.5 CTC [Graves 2013] 18.4 RNN transducer [Graves 2013] 17.7 Attention-based RNN [Chorowski 2015] 17.6 Segmental RNN 18.9 Segmental RNN 17.3 26 of 28

Conclusion Segmental CRFs with recurrent neural networks Potential for end-to-end training Computational cost is the main bottleneck Need to evaluate on large vocabulary tasks 27 of 28

28 of 28 Thank you! Questions?