Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Size: px

Start display at page:

Download "Segmental Recurrent Neural Networks for End-to-end Speech Recognition"

Melvin Morrison
5 years ago
Views:

1 Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016

2 Background A new wave of sequence modelling I. Sutskever, et al., Sequence-to-Sequence Learning with Neural Networks, NIPS 2014 D. Bahdanau, et al., Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015 A. Graves and N. Jaitly, Towards end-to-end speech recognition with recurrent neural networks, ICML of 28

3 Background Maybe time to review sequence modelling for speech Why speech recognition is special? monotonic alignment long input sequence output sequence is much shorter (word/phonme) 3 of 28

4 Speech Recognition monotonic alignment challenges for attention models long input sequence expensive for globally (sequence-level) normalised model output sequence is much shorter (word/phonme) length mismatch alignment model or not? 4 of 28

5 Speech Recognition Hidden Markov Model monotonic alignment long input sequence locally (frame-level) normalised length mismatch hidden states Connectionist Temporal Classification monotonic alignment long input sequence locally normalised length mismatch blank state 5 of 28

6 Speech Recognition Locally normalised models: conditional independence assumption label bias problem better results given by sequence training: local global normalisation Question: Why not sticking to the globally normalised models from scratch? [1] D. Andor, et al, Globally Normalized Transition-Based Neural Networks, ACL, [2] D. Povey, et al, Purely sequence-trained neural networks for ASR based on lattice-free MMI Interspeech, of 28

7 (Segmental) Conditional Random Field CRF segmental CRF 7 of 28

8 (Segmental) Conditional Random Field CRF [Lafferty et al. 2001] P(y 1:L x 1:T ) = 1 Z(x 1:T ) j ( ) exp w Φ(y j, x 1:T ) (1) where L = T. Segmental (semi-markov) CRF [Sarawagi and Cohen 2004] P(y 1:L, E, x 1:T ) = 1 Z(x 1:T ) j ( ) exp w Φ(y j, e j, x 1:T ) (2) where e j = s j, n j denotes the beginning (s j ) and end (n j ) time tag of y j ; E = {e 1:L } is the latent segment label. 8 of 28

9 (Segmental) Conditional Random Field 1 Z(x 1:T ) j exp ( w Φ(y j, x 1:T ) ) Learnable parameter w Engineering the feature function Φ( ) Designing Φ( ) is much harder for speech than NLP 9 of 28

10 Segmental Recurrent Neural Network Using (recurrent) neural networks to learn the feature function Φ( ). y 1 y 2 y 3 x 1 x 2 x 3 x 4 x 5 x 6 10 of 28

11 Segmental Recurrent Neural Network Training criteria Conditional maximum likelihood L(θ) = log P(y 1:L x 1:T ) = log E P(y 1:L, E x 1:T ) (3) Hinge loss similar to structured SVM. Not studied yet! 11 of 28

12 Segmental Recurrent Neural Network Viterbi decoding Partially Viterbi decoding y 1:L = arg max log y 1:L E P(y 1:L, E x 1:T ) (4) Fully Viterbi decoding y 1:L, E = arg max y 1:L,E log P(y 1:L, E x 1:T ) (5) 12 of 28

13 Related works (Segmental) CRFs for speech Neural CRFs Structured SVMs Two good review papers M. Gales, S. Watanabe and E. Fosler-Lussier, Structured Discriminative Models for Speech Recognition, IEEE Signal Processing Magazine, 2012 E. Fosler-Lussier et al. Conditional random fields in speech, audio, and language processing, Proceedings of the IEEE, of 28

14 Comparison to CTC [1] A. Senior, et al, Acoustic Modelling with CD-CTC-sMBR LSTM RNNs, ASRU of 28

15 Comparison to CTC ŷ 1 ŷ 2 ŷ 3 ŷ 4 15 of 28 x 1 x 2 x 3 x 4 p(y x) = p(ŷ 1 x 1 ) p(ŷ 2 x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 )

16 Comparison to CTC ŷ 2 ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 16 of 28 p(y x) = p( b x 1 ) p(ŷ }{{} 2 x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 ) =1

17 Comparison to CTC ŷ 2 ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 p(y x) = p( b x 1 ) p(ŷ 2 x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 ) 17 of 28

18 Comparison to CTC ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 18 of 28 p(y x) = p( b x 1 ) p( b x }{{} 2 ) p(ŷ }{{} 3 x 3 ) p(ŷ 4 x 4 ) =1 =1

19 Comparison to CTC ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 p(y x) = p( b x 1 ) p( b x 2 ) p(ŷ 3 x 3 ) p(ŷ 4 x 4 ) 19 of 28

20 Comparison to CTC ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 20 of 28

21 Comparison to CTC ŷ 3 ŷ 4 x 1 x 2 x 3 x 4 CTC loss may do some kind of segmental modelling 21 of 28

22 Experiment TIMIT dataset 3696 training utterances ( 3 hours) core test set (192 testing utterances) trained on 48 phonemes, and mapped to 39 for scoring log filterbank features (FBANK) using LSTM as an implementation of RNN 22 of 28

23 Experiment Limit the lengths of segments Recurrent subsampling networks over 10x speedup x 1 x 2 x 3 x 4 a) concatenate / add 23 of 28 x 1 x 2 x 3 x 4 b) skip

24 Experiment Large model with dropout works the best Table: Results of dropout. Dropout layers hidden PER of 28

25 Experiment Table: Results of three types of acoustic features. Features Deltas d(x t ) PER 24-dim FBANK dim FBANK Kaldi Kaldi features 39 dimensional MFCCs spliced by a context window of 7, followed by LDA and MLLT transform and with feature-space speaker-dependent MLLR 25 of 28

26 Experiment Table: Comparison to related works. LM = language model, SD = speaker dependent feature System LM SD PER HMM-DNN 18.5 CTC [Graves 2013] 18.4 RNN transducer [Graves 2013] 17.7 Attention-based RNN [Chorowski 2015] 17.6 Segmental RNN 18.9 Segmental RNN of 28

27 Conclusion Segmental CRFs with recurrent neural networks Potential for end-to-end training Computational cost is the main bottleneck Need to evaluate on large vocabulary tasks 27 of 28

28 28 of 28 Thank you! Questions?

arxiv: v4 [cs.cl] 5 Jun 2017

arxiv: v4 [cs.cl] 5 Jun 2017 Multitask Learning with CTC and Segmental CRF for Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith Toyota Technological Institute at Chicago, USA School of Computer Science, Carnegie