End-to-end Automatic Speech Recognition

Size: px
Start display at page:

Download "End-to-end Automatic Speech Recognition"

Transcription

1 End-to-end Automatic Speech Recognition Markus Nussbaum-Thom IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598, USA Markus Nussbaum-Thom. February 22, 2017 Nussbaum-Thom: IBM Thomas J. Watson Research Center 1 February 22, 2017

2 Contents 1. Introduction 2. Connectionst Temporal Classification (CTC) 3. Attention Model 4. References Nussbaum-Thom: IBM Thomas J. Watson Research Center 2 February 22, 2017

3 Terminology Features: x, x t, x T 1 := x 1,..., x T. Words: w, u, v, w m, w M 1 := w 1,..., w M. Word sequences: W, W n, V. States: s, s t, s T 1 := s 1,..., s T. Class conditional posterior probability: p(s t x t ), p(w, s T 1 x T 1 ). Nussbaum-Thom: IBM Thomas J. Watson Research Center 3 February 22, 2017

4 Bayes Decision Rule Nussbaum-Thom: IBM Thomas J. Watson Research Center 4 February 22, 2017

5 Towards End-to-End Automatic Speech Recognition Nussbaum-Thom: IBM Thomas J. Watson Research Center 5 February 22, 2017

6 Towards End-to-End Automatic Speech Recognition Nussbaum-Thom: IBM Thomas J. Watson Research Center 6 February 22, 2017

7 End-to-End Automatic Speech Recognition Nussbaum-Thom: IBM Thomas J. Watson Research Center 7 February 22, 2017

8 End-to-End Approach End-to-end: Training all modules to optimize a global performance criterion. (LeCun et al., 98) Easy: Classes do not have a sub-structure e.g. image classification. Difficult: Classes have a sub-structure (sequences, graphs) e.g. automatic speech recognition, automatic handwriting recognition, machine translation. Segmentation problem: Which part of the input related to which part of the sub-structure? Nussbaum-Thom: IBM Thomas J. Watson Research Center 8 February 22, 2017

9 Towards End-to-end Automatic Speech Recognition End-to-end acoustic model: Using characters instead of phonemes. Connectionst temporal classification using recurrent or convolutional neural networks. Purely neural attention model. End-to-end feature extraction: Feature extraction integrated into the acoustic model. Using the raw time signal. Learning a specific type of filter. Towards real end-to-end modeling: Using word as targets instead of characters or phonemes and a massive amount of data. Nussbaum-Thom: IBM Thomas J. Watson Research Center 9 February 22, 2017

10 Basic Problem Input: X = x T 1 = (x 1,..., x T ) Neural network: p( x 1 ),..., p( x T ). Target: W = w M 1 = (w 1,..., w M ) but M << T How do we solve this? Connectionist Temporal Classification (CTC). [Graves et al., 2006, Graves et al., 2009, CTC] Attention Models. [Bahdanau et al., 2016, Chorowski et al., 2015, Chorowski et al., 2015, Attention] Inverted Hidden Markov Models. [Doetsch et al., 2016, Inverted HMM - a Proof of Concept] Nussbaum-Thom: IBM Thomas J. Watson Research Center 10 February 22, 2017

11 Overview CTC Concept. Training. Recognition. Nussbaum-Thom: IBM Thomas J. Watson Research Center 11 February 22, 2017

12 Connectionist Temporal Classification (CTC) Given X = (x 1,..., x 5 ) and W = (a, b, c) Introduce blank state and allow word repititions: x 1 x 2 x 3 x 4 x 5 a b c a b c a a b c..... a b c Blank and repitition removal B: B(a,, b, c,, ) = (a, b, c) Nussbaum-Thom: IBM Thomas J. Watson Research Center 12 February 22, 2017

13 Connectionist Temporal Classification (CTC) Posterior for sentence W = w1 M and features X = x1 T : p(w X ) = p(s1 T X ) s T 1 B 1 (W ) T := p(s t x t ) s1 T B 1 (W ) t=1 Training criterion for training samples (X n, W n ), n = 1,..., N: F CTC (Λ) = 1 N N log p Λ (W n X n ) n=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 13 February 22, 2017

14 Overview CTC Concept. Training. Recognition. Nussbaum-Thom: IBM Thomas J. Watson Research Center 14 February 22, 2017

15 Forward-Backward Decomposition (CTC) α(t, m, v) : Sum over s t 1 B(w m 1 ) for given x t 1 ending in v. β(t, m, v) : Sum over s T t B(w M m ) for given x T t starting in v. Choose t 1,..., T : p(w M 1 x T 1 ) = =... M = m=1 v {w m, } s T 1 B 1 (w M 1 ) p(s T 1 x T 1 ) α(t, m, v) p(v x t ) p(v x t ) β(t, m, v) p(v x t ) Nussbaum-Thom: IBM Thomas J. Watson Research Center 15 February 22, 2017

16 Forward Algorithm (CTC) Nussbaum-Thom: IBM Thomas J. Watson Research Center 16 February 22, 2017

17 Forward Algorithm (CTC) Compute α(1, m, v) = p(v x 1 ) Nussbaum-Thom: IBM Thomas J. Watson Research Center 17 February 22, 2017

18 Forward Algorithm (CTC) Compute α(2, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 18 February 22, 2017

19 Forward Algorithm (CTC) Compute α(3, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 19 February 22, 2017

20 Forward Algorithm (CTC) Compute α(4, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 20 February 22, 2017

21 Forward Algorithm (CTC) Compute α(5, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 21 February 22, 2017

22 Forward Algorithm (CTC) Compute α(6, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 22 February 22, 2017

23 Forward Algorithm (CTC) Compute α(7, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 23 February 22, 2017

24 Forward Algorithm (CTC) Compute α(8, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 24 February 22, 2017

25 Forward Algorithm (CTC) Compute α(9, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 25 February 22, 2017

26 Forward Algorithm (CTC) Compute α(10, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 26 February 22, 2017

27 Forward Algorithm (CTC) Compute α(11, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 27 February 22, 2017

28 Forward Algorithm (CTC) Compute α(12, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 28 February 22, 2017

29 Backward Algorithm (CTC) Nussbaum-Thom: IBM Thomas J. Watson Research Center 29 February 22, 2017

30 Backward Algorithm (CTC) Compute β(12, M, v) = p(v x 1 2) Nussbaum-Thom: IBM Thomas J. Watson Research Center 30 February 22, 2017

31 Backward Algorithm (CTC) Compute β(11, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 31 February 22, 2017

32 Backward Algorithm (CTC) Compute β(10, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 32 February 22, 2017

33 Backward Algorithm (CTC) Compute β(9, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 33 February 22, 2017

34 Backward Algorithm (CTC) Compute β(8, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 34 February 22, 2017

35 Backward Algorithm (CTC) Compute β(7, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 35 February 22, 2017

36 Backward Algorithm (CTC) Compute β(6, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 36 February 22, 2017

37 Backward Algorithm (CTC) Compute β(5, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 37 February 22, 2017

38 Backward Algorithm (CTC) Compute β(4, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 38 February 22, 2017

39 Backward Algorithm (CTC) Compute β(3, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 39 February 22, 2017

40 Backward Algorithm (CTC) Compute β(2, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 40 February 22, 2017

41 Backward Algorithm (CTC) Compute β(1, m, v). Nussbaum-Thom: IBM Thomas J. Watson Research Center 41 February 22, 2017

42 Posterior Decomposition (CTC) Choose t 1,..., T : p(w1 M X ) = p(s1 T X ) s1 T B 1 (w1 M) = = = s1 T B 1 (w1 M) τ=1 M T p(s τ x τ ) m=1 v {w m, } s1 T B 1 (w1 M s ) τ=1 t=v M T p(s τ x τ ) t 1 m=1 v {w m, } s1 T B 1 (w1 M s ) τ=1 t=v definition model assumption decomposition around t T p(s τ x τ ) p(s v x t ) p(s ρ x ρ ) ρ=t+1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 42 February 22, 2017

43 Posterior Decomposition (CTC) = = M t 1 m=1 v {w m, } s1 T B 1 (w1 M s ) τ=1 t=v M m=1 v {w m, } s1 T B 1 (w1 M s ) t=v p(s τ x τ ) p(s v x t ) t p(s τ x τ ) τ=1 p(v x t ) p(v x t ) T ρ=t+1 p(s ρ x ρ ) T p(s ρ x ρ ) ρ=t p(v x t ) Nussbaum-Thom: IBM Thomas J. Watson Research Center 43 February 22, 2017

44 Posterior Decomposition (CTC) = = M m=1 v {w m, } M m=1 v {w m, } s t 1 B 1 (w m 1 ) s t=v α(t, m, v) p(v x t ) t p(v x τ ) τ=1 p(v x t ) p(v x t ) p(v x t ) β(t, m, v) p(v x t ) s T t B 1 (w M m ) s t=v T p(s ρ x ρ ) ρ=t p(v x t ) α(t, m, v) : Sum over s t 1 B(w m 1 ) for given x t 1 ending in v. β(t, m, v) : Sum over s T t B(w M m ) for given x T t starting in v. Nussbaum-Thom: IBM Thomas J. Watson Research Center 44 February 22, 2017

45 Forward Path Decomposition (CTC) Consider a path s t 1 B 1 (w m 1 ), s t = v: st = w m s t 1 1 s t 1 s t w 1... w?? w m w 1... w m w m w m w 1... w m 1 w m w 1... w m 1 w m 1 w m st = s t 1 1 s t 1 s t w 1... w?? w 1... w m w m w 1... w m w m Nussbaum-Thom: IBM Thomas J. Watson Research Center 45 February 22, 2017

46 Forward Probablities (CTC) α(t, m, v) = + = p(v x t ) s1 t B 1 (w1 m w ) τ=1 m=v t p(s τ x τ ) t 1 u {w m 1, } s t 1 1 B 1 (w m 1 1 ) τ=1 s t 1 =u t 1 s t 1 1 B 1 (w1 m s ) τ=1 t 1 =w m p(s τ x τ ), t 1 u {w m, } s t 1 1 B 1 (w1 m s ) τ=1 t 1 =u p(s τ x τ ) p(s τ x τ ), v = w m v = Nussbaum-Thom: IBM Thomas J. Watson Research Center 46 February 22, 2017

47 Forward Probabilities (CTC) = p(v x t ) α(t 1, m 1, u) + α(t 1, m, w m ) u {w m 1, } u {w m, } α(t 1, m, u), v = w m, v = Nussbaum-Thom: IBM Thomas J. Watson Research Center 47 February 22, 2017

48 Backward Probablities (CTC) β(t, m, v) = Sum over all pathes st T starting in a word v. B(w M m ) for given x T t T β(t, m, v) = p(s τ x τ ) p(v x t ) = p( x t ) st T B 1 (wm M ) τ=t w m=v u {w m+1, } u {w m, } β(t + 1, m + 1, u) + β(t + 1, m, w m ), v = w m β(t + 1, m, u), v = Nussbaum-Thom: IBM Thomas J. Watson Research Center 48 February 22, 2017

49 Derivatives CTC Derivative posterior: p(s xt)p(w X ) M = p(s xt) = M m=1 v {w m, } m=1 v {w m, } α(t, m, v) p(v x t ) p(v x t ) α(t, m, s) β(t, m, s) δ(v, s) p 2 (s x t ) β(t, m, v) p(v x t ) log P(W X ) = Derivative training criterion: F CTC (Λ) = 1 N 1 P(W X ) P(W X ) N log p(w n X n ) Nussbaum-Thom: IBM Thomas J. Watson Research Center n=1 49 February 22, 2017

50 CTC Architectures What kind of encoders? DNNs, (bidirectional) LSTMs, CNNs. Subsampling: Reducing framerate through the network. Nussbaum-Thom: IBM Thomas J. Watson Research Center 50 February 22, 2017

51 Subsampling Join input frames. Reshape input to next layer: Return every 2nd frame to the next layer. CNNs: Use strides. Nussbaum-Thom: IBM Thomas J. Watson Research Center 51 February 22, 2017

52 Peaking Behavior [Graves and Jaitly, 2014, citation] Nussbaum-Thom: IBM Thomas J. Watson Research Center 52 February 22, 2017

53 Overview CTC Concept. Training. Recognition. Nussbaum-Thom: IBM Thomas J. Watson Research Center 53 February 22, 2017

54 Hybrid Recognition (CTC) Hybrid: Model: p(x s) p(s x) p(s) = p(s x) p(x)p(s) = Decoding: ŵ N 1 = arg max w N 1 { p(w N 1 ) max s T 1 } T p(x t s t ) τ=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 54 February 22, 2017

55 Weighted Finite State Transudcer Recognition (WFST) Token T : Language Model G: Lexicon L: Search Space: S = T min(det(l G)) Nussbaum-Thom: IBM Thomas J. Watson Research Center 55 February 22, 2017

56 Resources for CTC Keras: Tensorflow: master/keras/backend/tensorflow_backend.py Theano: master/keras/backend/theano_backend.py Example: master/examples/image_ocr.py Baidu: https: //github.com/baidu-research/ba-dls-deepspeech Eesen: Kaldi: Nussbaum-Thom: IBM Thomas J. Watson Research Center 56 February 22, 2017

57 Further Literature on CTC [Miao et al., 2015, EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding] [Collobert et al., 2016, Wav2Letter: an End-to-End ConvNet-based Speech Recognition System] [Zhang et al., 2017, Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks] [Senior et al., 2015, Acoustic modelling with CD-CTC-SMBR LSTM RNNS] [Soltau et al., 2016, Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition] Nussbaum-Thom: IBM Thomas J. Watson Research Center 57 February 22, 2017

58 Attention Model Encoder-Decoder architecture: Encoder: performs a feature extracation/encoding based on the input. Decoder: Produces output sequence output labels from the encoded features. Nussbaum-Thom: IBM Thomas J. Watson Research Center 58 February 22, 2017

59 Attention Encoder-Decoder Architecture What kind of encoders? DNNs, (bidirectional) LSTMs, CNNs. output decoder MLP glimpse encoder Nussbaum-Thom: IBM Thomas J. Watson Research Center 59 February 22, 2017

60 Attention Encoder-Decoder Architecture Encoder-Decoder: Input: x T 1 = x 1,..., x T Encoder: h T /4 1 = h 1,..., h T /4 = Encoder(x1 T ) Decoder: For m = 1,..., M: Attention: α m = Attend(h T /4 1, s m 1, α m 1) T /4 Glimpse: g m = τ=1 α m,th t Generator: y m = Generator(g m, s m 1) c m = RNN(c m 1, g m, s m 1) y m = Softmax(c m) Transition: s m = RNN(s m 1, y m, g m) Nussbaum-Thom: IBM Thomas J. Watson Research Center 60 February 22, 2017

61 Attention Mechanism Content based: (weights: E, W, V and bias: b) ɛ m,t = E tanh(w s m 1 + V h t + b) Location based: (weights: E, W, V, U and bias: b) f = F α m 1 ɛ m,t = E tanh(w s m 1 + V h t + U f m,t + b) Renormalization: (sharpening: γ) α m,t = T /4 exp(γ ɛ m,t) exp(γ ɛ m,t ) t=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 61 February 22, 2017

62 Window Around Median Compute median: τ m = arg min k=1,...,t /4 k α m 1,ρ ρ=1 Compute attention around median: k θ=k+1 α m 1,θ T m = {τ m ω left,..., τ m + ω right } exp(γ ɛ m,t ), t T m exp(γ ɛ α m,t = m,τ ) τ T m 0, otherwise Nussbaum-Thom: IBM Thomas J. Watson Research Center 62 February 22, 2017

63 Other Techniques (Attention) Monotonic regularization: ( T /4 r m = max 0, τ α m,i τ=1 i=1 i=1 τ ) α m 1,i Curriculum learning: Starting with shorter sequences and gradually increase sequence length. Flatstart: Initial positions are chosen according to speaker speed. Nussbaum-Thom: IBM Thomas J. Watson Research Center 63 February 22, 2017

64 Resources for Attention Theano+Bricks+Blocks: Tensorflow: Keras: Nussbaum-Thom: IBM Thomas J. Watson Research Center 64 February 22, 2017

65 Further Literature on Attention [Bahdanau et al., 2016, End-to-end attention-based large vocabulary speech recognition] [Chorowski et al., 2015, Attention-Based Models for Speech Recognition] [Chorowski et al., 2015, End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results] [Kim et al., 2016, Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning] Nussbaum-Thom: IBM Thomas J. Watson Research Center 65 February 22, 2017

66 Evaluation Framework Development and evaluation set different from training set. Levenshtein: Minimum insertions, deletions and substitutions Spoken Referenz Alignment Recognized Hypothese S P E A K E R T E A C H E R correct substitution insertion deletion Nussbaum-Thom: IBM Thomas J. Watson Research Center 66 February 22, 2017

67 Evaluation Framework Levenshtein: Minimum insertions, deletions and substitutions { λ L(w1 N, v1 M ( ) = min 1 δ(ws(i), v t(i) ) )} s,t i=1 { 1, v = w with dem Kronecker delta δ(w, v) = 0, v w Word Error Rate (WER): WER(Spoken R 1, Recognized R 1 ) = R L(Spoken r, Recognized r ) r=1 R Spoken r r=1 Nussbaum-Thom: IBM Thomas J. Watson Research Center 67 February 22, 2017

68 Experimental Results Model CER WER Bhadanau et al. (2015) Attention Attention + bigram LM Attention + trigram LM Attention + extended trigram LM Graves and Jaitly (2014) CTC Hannun et al. (2014) CTC + bigram LM n/a 14.1 Miao et al. (2015) CTC + bigram LM n/a 26.9 CTC for phonemes + lexicon n/a 26.9 CTC for phonemes + trigram LM n/a 7.3 CTC + trigram LM n/a 9.0 Hybrid BGRU (15 h) n/a 2.0 Nussbaum-Thom: IBM Thomas J. Watson Research Center 68 February 22, 2017

69 Attention Modeling Example from Handwriting Nussbaum-Thom: IBM Thomas J. Watson Research Center 69 February 22, 2017

70 Inverted Hidden Markov Model (HMM) Traditional HMM: p(w N 1, x T 1 ) = p(w N 1 ) p(x T 1 w N 1 ) = p(w1 N ) s1 T N = p(w n w1 n 1 ) n=1 s1 T Inverted HMM: p(w N 1 x T 1 ) = t N 1 = t N 1 p(s T 1, x T 1 w N 1 ) T t=1 p(w N 1, t N 1 x T 1 ) N n=1 p(s t, x t s t 1 1, x t 1 1, w N 1 ) p(w n, t n w n 1 1, t n 1 1, x T 1 ) [Doetsch et al., 2016, Inverted HMM - a Proof of Concept] Nussbaum-Thom: IBM Thomas J. Watson Research Center 70 February 22, 2017

71 Unsolved Problems for End-to-End ASR Error rates: Still higher than traditional HMM-based system (one exception). Global search: Still a transducer-based or HMM-based search. Acoustic model: Word and character-based End-to-End learning. Language model: No integration with the language model in training yet. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

72 Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., and Bengio, Y. (2016). End-to-end attention-based large vocabulary speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016, pages Chorowski, J., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y. (2015). Attention-based models for speech recognition. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages Collobert, R., Puhrsch, C., and Synnaeve, G. (2016). Wav2letter: an end-to-end convnet-based speech recognition system. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

73 CoRR, abs/ Doetsch, P., Hegselmann, S., Schlà 1 4ter, R., and Ney, H. (2016). Inverted hmm - a proof of concept. In Neural Information Processing Systems Workshop, Barcelona, Spain. Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML 06, pages , New York, NY, USA. ACM. Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

74 In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, June 2014, pages Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell., 31(5): Kim, S., Hori, T., and Watanabe, S. (2016). Joint ctc-attention based end-to-end speech recognition using multi-task learning. CoRR, abs/ Miao, Y., Gowayyed, M., and Metze, F. (2015). EESEN: end-to-end speech recognition using deep RNN models and wfst-based decoding. Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

75 In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015, pages Senior, A. W., Sak, H., de Chaumont Quitry, F., Sainath, T. N., and Rao, K. (2015). Acoustic modelling with CD-CTC-SMBR LSTM RNNS. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015, pages Soltau, H., Liao, H., and Sak, H. (2016). Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition. CoRR, abs/ Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., and Courville, A. C. (2017). Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

76 Towards end-to-end speech recognition with deep convolutional neural networks. CoRR, abs/ Nussbaum-Thom: IBM Thomas J. Watson Research Center 71 February 22, 2017

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

arxiv: v4 [cs.cl] 5 Jun 2017

arxiv: v4 [cs.cl] 5 Jun 2017 Multitask Learning with CTC and Segmental CRF for Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith Toyota Technological Institute at Chicago, USA School of Computer Science, Carnegie

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Deep Recurrent Neural Networks

Deep Recurrent Neural Networks Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

More information

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Haşim Sak, Andrew Senior, Oriol Vinyals, Georg Heigold, Erik McDermott, Rajat Monga, Mark Mao, Françoise Beaufays

More information

ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING

ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING ADVANCING CONNECTIONIST TEMPORAL CLASSIFICATION WITH ATTENTION MODELING Amit Das, Jinyu Li, Rui Zhao, Yifan Gong Microsoft AI and Research, One Microsoft Way, Redmond, WA 98052 amitdas@illinois.edu,{jinyli,

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Deep Learning for Automatic Speech Recognition Part II

Deep Learning for Automatic Speech Recognition Part II Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM

More information

Very Deep Convolutional Neural Networks for LVCSR

Very Deep Convolutional Neural Networks for LVCSR INTERSPEECH 2015 Very Deep Convolutional Neural Networks for LVCSR Mengxiao Bi, Yanmin Qian, Kai Yu Key Lab. of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab,

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data Statistical Machine Learning from Data Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne (EPFL),

More information

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com

More information

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ.

Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Deep Structured Prediction in Handwriting Recognition Juan José Murillo Fuentes, P. M. Olmos (Univ. Carlos III) and J.C. Jaramillo (Univ. Sevilla) Computational and Biological Learning Lab Dep. of Engineering

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Unfolded Recurrent Neural Networks for Speech Recognition

Unfolded Recurrent Neural Networks for Speech Recognition INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim 1, Mostafa El-Khamy 1, Jungwon Lee 1 1 Samsung Semiconductor,

More information

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network

More information

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton SPEECH RECOGITIO WITH DEEP RECURRET EURAL ETWORKS Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton Department of Computer Science, University of Toronto ABSTRACT Recurrent neural networks (Rs) are

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

Sequence Transduction with Recurrent Neural Networks

Sequence Transduction with Recurrent Neural Networks Alex Graves graves@cs.toronto.edu Department of Computer Science, University of Toronto, Toronto, ON M5S 3G4 Abstract Many machine learning tasks can be expressed as the transformation or transduction

More information

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition

Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Trajectory-based Radical Analysis Network for Online Handwritten Chinese Character Recognition Jianshu Zhang, Yixing Zhu, Jun Du and Lirong Dai National Engineering Laboratory for Speech and Language Information

More information

Compression of End-to-End Models

Compression of End-to-End Models Interspeech 2018 2-6 September 2018, Hyderabad Compression of End-to-End Models Ruoming Pang, Tara N. Sainath, Rohit Prabhavalkar, Suyog Gupta, Yonghui Wu, Shuyuan Zhang, Chung-cheng Chiu Google Inc.,

More information

arxiv: v1 [cs.ne] 14 Nov 2012

arxiv: v1 [cs.ne] 14 Nov 2012 Alex Graves Department of Computer Science, University of Toronto, Canada graves@cs.toronto.edu arxiv:1211.3711v1 [cs.ne] 14 Nov 2012 Abstract Many machine learning tasks can be expressed as the transformation

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

Based on the original slides of Hung-yi Lee

Based on the original slides of Hung-yi Lee Based on the original slides of Hung-yi Lee New Activation Function Rectified Linear Unit (ReLU) σ z a a = z Reason: 1. Fast to compute 2. Biological reason a = 0 [Xavier Glorot, AISTATS 11] [Andrew L.

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 Machine Learning for Signal Processing Neural Networks Continue Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016 1 So what are neural networks?? Voice signal N.Net Transcription Image N.Net Text

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS Hyungui Lim 1, Jeongsoo Park 1,2, Kyogu Lee 2, Yoonchang Han 1 1 Cochlear.ai, Seoul, Korea 2 Music and Audio Research Group,

More information

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property Jianshu Chen Department of Electrical Engineering University of California Los Angeles, CA 90034, USA

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,

More information

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries Yu-Hsuan Wang, Cheng-Tao Chung, Hung-yi

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Augmented Statistical Models for Speech Recognition

Augmented Statistical Models for Speech Recognition Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent

More information

FASTER SEQUENCE TRAINING. Albert Zeyer, Ilia Kulikov, Ralf Schlüter, Hermann Ney

FASTER SEQUENCE TRAINING. Albert Zeyer, Ilia Kulikov, Ralf Schlüter, Hermann Ney FASTER SEQUENCE TRAINING Albert Zeyer, Ilia Kulikov, Ralf Schlüter, Hermann Ney Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, 506 Aachen, Germany

More information

Unsupervised Model Adaptation using Information-Theoretic Criterion

Unsupervised Model Adaptation using Information-Theoretic Criterion Unsupervised Model Adaptation using Information-Theoretic Criterion Ariya Rastrow 1, Frederick Jelinek 1, Abhinav Sethy 2 and Bhuvana Ramabhadran 2 1 Human Language Technology Center of Excellence, and

More information

An exploration of dropout with LSTMs

An exploration of dropout with LSTMs An exploration of out with LSTMs Gaofeng Cheng 1,3, Vijayaditya Peddinti 4,5, Daniel Povey 4,5, Vimal Manohar 4,5, Sanjeev Khudanpur 4,5,Yonghong Yan 1,2,3 1 Key Laboratory of Speech Acoustics and Content

More information

Speech and Language Processing

Speech and Language Processing Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course Takahiro Shinoaki 08//6 Lecture Plan (Shinoaki s part) I gives

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v2 [cs.ne] 7 Apr 2015 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies

More information

Forward algorithm vs. particle filtering

Forward algorithm vs. particle filtering Particle Filtering ØSometimes X is too big to use exact inference X may be too big to even store B(X) E.g. X is continuous X 2 may be too big to do updates ØSolution: approximate inference Track samples

More information

arxiv: v3 [cs.lg] 14 Jan 2018

arxiv: v3 [cs.lg] 14 Jan 2018 A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation Gang Chen Department of Computer Science and Engineering, SUNY at Buffalo arxiv:1610.02583v3 [cs.lg] 14 Jan 2018 1 abstract We describe

More information

Hidden Markov Models Hamid R. Rabiee

Hidden Markov Models Hamid R. Rabiee Hidden Markov Models Hamid R. Rabiee 1 Hidden Markov Models (HMMs) In the previous slides, we have seen that in many cases the underlying behavior of nature could be modeled as a Markov process. However

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Why DNN Works for Acoustic Modeling in Speech Recognition?

Why DNN Works for Acoustic Modeling in Speech Recognition? Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,

More information

arxiv: v5 [stat.ml] 19 Jun 2017

arxiv: v5 [stat.ml] 19 Jun 2017 Chong Wang 1 Yining Wang 2 Po-Sen Huang 1 Abdelrahman Mohamed 3 Dengyong Zhou 1 Li Deng 4 arxiv:1702.07463v5 [stat.ml] 19 Jun 2017 Abstract Segmental structure is a common pattern in many types of sequences

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal

Natural Language Understanding. Kyunghyun Cho, NYU & U. Montreal Natural Language Understanding Kyunghyun Cho, NYU & U. Montreal 2 Machine Translation NEURAL MACHINE TRANSLATION 3 Topics: Statistical Machine Translation log p(f e) =log p(e f) + log p(f) f = (La, croissance,

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Highway-LSTM and Recurrent Highway Networks for Speech Recognition Highway-LSTM and Recurrent Highway Networks for Speech Recognition Golan Pundak, Tara N. Sainath Google Inc., New York, NY, USA {golan, tsainath}@google.com Abstract Recently, very deep networks, with

More information

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

Department of Computer Science and Engineering, Department of Electronic and Computer Engineering, HKUST, Hong Kong, Dec. 04, 2012

Department of Computer Science and Engineering, Department of Electronic and Computer Engineering, HKUST, Hong Kong, Dec. 04, 2012 Department of Computer Science and Engineering, Department of Electronic and Computer Engineering, HKUST, Hong Kong, Dec. 04, 2012 The Statistical Approach to Speech Recognition and Natural Language Processing:

More information

CSC321 Lecture 10 Training RNNs

CSC321 Lecture 10 Training RNNs CSC321 Lecture 10 Training RNNs Roger Grosse and Nitish Srivastava February 23, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 10 Training RNNs February 23, 2015 1 / 18 Overview Last time, we saw

More information

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition INTERSPEECH 2014 Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition Zhen Huang 1, Jinyu Li 2, Chao Weng 1, Chin-Hui Lee

More information

CSC321 Lecture 7 Neural language models

CSC321 Lecture 7 Neural language models CSC321 Lecture 7 Neural language models Roger Grosse and Nitish Srivastava February 1, 2015 Roger Grosse and Nitish Srivastava CSC321 Lecture 7 Neural language models February 1, 2015 1 / 19 Overview We

More information

arxiv: v5 [cs.lg] 2 Nov 2015

arxiv: v5 [cs.lg] 2 Nov 2015 A Recurrent Latent Variable Model for Sequential Data arxiv:1506.02216v5 [cs.lg] 2 Nov 2015 Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron Courville, Yoshua Bengio Department of Computer

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Robust Sound Event Detection in Continuous Audio Environments

Robust Sound Event Detection in Continuous Audio Environments Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University

More information

Fun with weighted FSTs

Fun with weighted FSTs Fun with weighted FSTs Informatics 2A: Lecture 18 Shay Cohen School of Informatics University of Edinburgh 29 October 2018 1 / 35 Kedzie et al. (2018) - Content Selection in Deep Learning Models of Summarization

More information

A GRU-based Encoder-Decoder Approach with Attention for Online Handwritten Mathematical Expression Recognition

A GRU-based Encoder-Decoder Approach with Attention for Online Handwritten Mathematical Expression Recognition A -based Encoder-Decoder Approach with Attention for Online Handwritten Mathematical Expression Recognition Jianshu Zhang, Jun Du and Lirong Dai National Engineering Laboratory for Speech and Language

More information

A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR

A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR A TIME-RESTRICTED SELF-ATTENTION LAYER FOR ASR Daniel Povey,2, Hossein Hadian, Pegah Ghahremani, Ke Li, Sanjeev Khudanpur,2 Center for Language and Speech Processing, Johns Hopkins University, Baltimore,

More information

Deep Learning for Automatic Speech Recognition Part I

Deep Learning for Automatic Speech Recognition Part I Deep Learning for Automatic Speech Recognition Part I Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY 10598 Fall, 2018 Outline A brief history of automatic speech recognition Speech

More information

EE-559 Deep learning LSTM and GRU

EE-559 Deep learning LSTM and GRU EE-559 Deep learning 11.2. LSTM and GRU François Fleuret https://fleuret.org/ee559/ Mon Feb 18 13:33:24 UTC 2019 ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE The Long-Short Term Memory unit (LSTM) by Hochreiter

More information

RNNLIB: Connectionist Temporal Classification and Transcription Layer

RNNLIB: Connectionist Temporal Classification and Transcription Layer RNNLIB: Connectionist Temporal Classification and Transcription Layer Wantee Wang 2015-02-08 16:56:40 +0800 Contents 1 The Name 2 2 The Theory 2 2.1 List of Symbols............................ 2 2.2 Training

More information

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC Andrew Senior 1 Speech recognition Lecture 14: Neural Networks Andrew Senior Google NYC December 12, 2013 Andrew Senior 2 1

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Recurrent Neural Networks

Recurrent Neural Networks Charu C. Aggarwal IBM T J Watson Research Center Yorktown Heights, NY Recurrent Neural Networks Neural Networks and Deep Learning, Springer, 218 Chapter 7.1 7.2 The Challenges of Processing Sequences Conventional

More information

Statistical Phrase-Based Speech Translation

Statistical Phrase-Based Speech Translation Statistical Phrase-Based Speech Translation Lambert Mathias 1 William Byrne 2 1 Center for Language and Speech Processing Department of Electrical and Computer Engineering Johns Hopkins University 2 Machine

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

4F10: Deep Learning. Mark Gales. Michaelmas 2016

4F10: Deep Learning. Mark Gales. Michaelmas 2016 4F10: Deep Learning Mark Gales Michaelmas 2016 What is Deep Learning? From Wikipedia: Deep learning is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions

More information

Memory-Augmented Attention Model for Scene Text Recognition

Memory-Augmented Attention Model for Scene Text Recognition Memory-Augmented Attention Model for Scene Text Recognition Cong Wang 1,2, Fei Yin 1,2, Cheng-Lin Liu 1,2,3 1 National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy of Sciences

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

High Order LSTM/GRU. Wenjie Luo. January 19, 2016 High Order LSTM/GRU Wenjie Luo January 19, 2016 1 Introduction RNN is a powerful model for sequence data but suffers from gradient vanishing and explosion, thus difficult to be trained to capture long

More information

Neural Hidden Markov Model for Machine Translation

Neural Hidden Markov Model for Machine Translation Neural Hidden Markov Model for Machine Translation Weiyue Wang, Derui Zhu, Tamer Alkhouli, Zixuan Gan and Hermann Ney {surname}@i6.informatik.rwth-aachen.de July 17th, 2018 Human Language Technology and

More information

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details

More information

Deep Learning for Speech Processing An NST Perspective

Deep Learning for Speech Processing An NST Perspective Deep Learning for Speech Processing An NST Perspective Mark Gales University of Cambridge June 2016 September 2016 Natural Speech Technology (NST) EPSRC (UK Government) Programme Grant: collaboration significantly

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

To Separate Speech! A System for Recognizing Simultaneous Speech

To Separate Speech! A System for Recognizing Simultaneous Speech A System for Recognizing Simultaneous Speech John McDonough 1,2,Kenichi Kumatani 2,3,Tobias Gehrig 4, Emilian Stoimenov 4, Uwe Mayer 4, Stefan Schacht 1, Matthias Wölfel 4 and Dietrich Klakow 1 1 Spoken

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information