Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Similar documents
Deep Recurrent Neural Networks

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Unfolded Recurrent Neural Networks for Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

An exploration of dropout with LSTMs

FASTER SEQUENCE TRAINING. Albert Zeyer, Ilia Kulikov, Ralf Schlüter, Hermann Ney

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Deep Learning for Automatic Speech Recognition Part II

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v3 [cs.lg] 14 Jan 2018

End-to-end Automatic Speech Recognition

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

arxiv: v1 [cs.ne] 14 Nov 2012

Generating Sequences with Recurrent Neural Networks

Sequence Transduction with Recurrent Neural Networks

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Improved Learning through Augmenting the Loss

Slide credit from Hung-Yi Lee & Richard Socher

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

Long-Short Term Memory and Other Gated RNNs

arxiv: v4 [cs.cl] 5 Jun 2017

Framewise Phoneme Classification with Bidirectional LSTM Networks

Deep Learning Recurrent Networks 2/28/2018

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Recurrent Neural Network

Deep Learning for Speech Recognition. Hung-yi Lee

Introduction to Neural Networks

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

Recurrent and Recursive Networks

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

RECURRENT NETWORKS I. Philipp Krähenbühl

Biologically Plausible Speech Recognition with LSTM Neural Nets

Deep Neural Networks

Based on the original slides of Hung-yi Lee

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Why DNN Works for Acoustic Modeling in Speech Recognition?

A New Concept using LSTM Neural Networks for Dynamic System Identification

EE-559 Deep learning LSTM and GRU

Based on the original slides of Hung-yi Lee

Very Deep Convolutional Neural Networks for LVCSR

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

Presented By: Omer Shmueli and Sivan Niv

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Lecture 11 Recurrent Neural Networks I

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

arxiv: v1 [cs.cl] 22 Feb 2018

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Introduction to RNNs!

Stephen Scott.

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Augmented Statistical Models for Speech Recognition

On the use of Long-Short Term Memory neural networks for time series prediction

Long-Short Term Memory

Recurrent Neural Networks. Jian Tang

Learning Recurrent Neural Networks with Hessian-Free Optimization: Supplementary Materials

arxiv: v1 [cs.cl] 21 May 2017

Lecture 11 Recurrent Neural Networks I

Speech and Language Processing

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Hidden Markov Models and other Finite State Automata for Sequence Processing

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

COMPARING FIXED AND ADAPTIVE COMPUTATION TIME FOR RE-

Automatic Speech Recognition (CS753)

] Automatic Speech Recognition (CS753)

Neural Architectures for Image, Language, and Speech Processing

EE-559 Deep learning Recurrent Neural Networks

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Neural Turing Machine. Author: Alex Graves, Greg Wayne, Ivo Danihelka Presented By: Tinghui Wang (Steve)

Seq2Tree: A Tree-Structured Extension of LSTM Network

UNSUPERVISED LEARNING

Backstitch: Counteracting Finite-sample Bias via Negative Steps

EXPLOITING LSTM STRUCTURE IN DEEP NEURAL NETWORKS FOR SPEECH RECOGNITION

Recurrent Neural Networks

Statistical Machine Learning from Data

Language Modelling. Steve Renals. Automatic Speech Recognition ASR Lecture 11 6 March ASR Lecture 11 Language Modelling 1

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Compression of End-to-End Models

SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong

Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

CSC321 Lecture 15: Exploding and Vanishing Gradients

Recurrent Neural Networks. deeplearning.ai. Why sequence models?

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Recurrent Neural Network Training with Preconditioned Stochastic Gradient Descent

Semi-supervised Maximum Mutual Information Training of Deep Neural Network Acoustic Models

How to do backpropagation in a brain

Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks

4F10: Deep Learning. Mark Gales. Michaelmas 2016

Natural Language Processing and Recurrent Neural Networks

CSC321 Lecture 10 Training RNNs

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Deep Learning & Neural Networks Lecture 4

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Transcription:

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks Haşim Sak, Andrew Senior, Oriol Vinyals, Georg Heigold, Erik McDermott, Rajat Monga, Mark Mao, Françoise Beaufays andrewsenior/hasim@google.com See Sak et al. [2014b,a]

Overview Recurrent neural networks Training RNNs Long short-term memory recurrent neural networks Distributed training of LSTM RNNs Acoustic modeling experiments Sequence training LSTM RNNs Google Speech LVCSR with LSTM RNNs 2/37

Recurrent neural networks An extension of feed-forward neural networks Output fed back as input with time delay. A dynamic time-varying neural network Recurrent layer activations encode a state. Sequence labelling, classification, prediction, mapping. Speech recognition [Robinson et al., 1993] x 1 x 2 x 3 x 4 r 1 r 2 r 3 r 4 r 5 r 6 y 1 y 2 y 3 y 4 y 5 Google Speech LVCSR with LSTM RNNs 3/37

Back propagation through time Unroll the recurrent network through time. y 1 x 1 y 2 x 2 y 3 x 3 y 4 x 4 Truncating at some limit bptt steps it looks like a DNN. External gradients provided at the outputs e.g. gradient of cross entropy loss Internal gradients computed with the chain rule (backpropagation). Google Speech LVCSR with LSTM RNNs 4/37

Simple RNN Simple RNN architecture in two alternative representations: input x t x t 1 x t hidden h t W hx W hx h t 1 W hh h t W hx... W yh W hh W yh W yh output y t y t 1 y t (a) RNN (b) RNN unrolled in time RNN hidden and output layer activations: h t = σ(w hx x t + W hh h t 1 + b h ) y t = φ(w yh h t + b y ) Google Speech LVCSR with LSTM RNNs 5/37

Training RNNs Forward pass: calculate activations for each input sequentially and update network state Backward pass: calculate error and back propagate through network and time (back-propagation through time (BPTT)) Update weights with the gradients summed over all time steps for each weight Truncated BPTT: error is truncated after a specified back-propagation time steps Google Speech LVCSR with LSTM RNNs 6/37

Backpropagation through time Acoustic features θ θ' θ'' θ''' State posteriors θ θ' θ'' θ''' External gradients θ θ' θ'' θ''' Internal gradients } } } } + + + + δθ δθ' δθ'' δθ''' Google Speech LVCSR with LSTM RNNs 7/37

Long Short-Term Memory (LSTM) RNN Learning long-term dependencies is difficult with simple RNNs, unstable training due to vanishing gradients problem [Hochreiter, 1991] Limited capability (5-10 time steps) to model long-term dependencies LSTM RNN architecture designed to address these problems [Hochreiter and Schmidhuber, 1997] LSTM memory block: memory cell storing temporal state of network and 3 multiplicative units (gates) controlling the flow of information Google Speech LVCSR with LSTM RNNs 8/37

Long Short-Term Memory Recurrent Neural Networks Replace the units of an RNN with memory cells with sigmoid Input gate Forget gate Output gate Input Input Gate Forget Gate Δt Δt Cell State Output Output Gate Enables long-term dependency learning Reduces the vanishing/exploding gradient problems 4 more parameters than RNN Google Speech LVCSR with LSTM RNNs 9/37

LSTM RNN Architecture Input gate: controls flow of input activations into cell Output gate: controls output flow of cell activations Forget gate: process continuous input streams [Gers et al., 2000] Peephole connections added from cells to gates to learn precise timing of outputs [Gers et al., 2003] input: xt i t f t c t o t g cell h m t output: yt Google Speech LVCSR with LSTM RNNs 10/37

LSTM RNN Related Work LSTM performs better than RNN for learning context-free and context-sensitive languages [Gers and Schmidhuber, 2001] Bidirectional LSTM for phonetic labeling of acoustic frames on the TIMIT [Graves and Schmidhuber, 2005] Online and offline handwriting recognition with bidirectional LSTM better than HMM-based system [Graves et al., 2009] Deep LSTM - stack of multiple LSTM layers - combined with CTC and RNN transducer predicting phone sequences gets state of the art results on TIMIT [Graves et al., 2013] Google Speech LVCSR with LSTM RNNs 11/37

LSTM RNN Activation Equations An LSTM network computes a mapping from an input sequence x = (x 1,..., x T ) to an output sequence y = (y 1,..., y T ) by calculating the network unit activations using the following equations iteratively from t = 1 to T : i t = σ(w ix x t + W im m t 1 + W ic c t 1 + b i ) (1) f t = σ(w fx x t + W fm m t 1 + W fc c t 1 + b f ) (2) c t = f t c t 1 + i t g(w cx x t + W cm m t 1 + b c ) (3) o t = σ(w ox x t + W om m t 1 + W oc c t + b o ) (4) m t = o t h(c t ) (5) y t = φ(w ym m t + b y ) (6) Google Speech LVCSR with LSTM RNNs 12/37

Proposed LSTM Projected (LSTMP) RNN O(N) learning computational complexity with stochastic gradient descent (SGD) per time step Recurrent connections from cell output units (n c ) to cell input units, input gates, output gates and forget gates Cell output units connected to network output units Learning computational complexity dominated by n c (4 n c + n o ) parameters For more effective use of parameters, add a recurrent projection layer with n r linear projections (n r < n c ) after LSTM layer. Now n r (4 n c + n o ) parameters Google Speech LVCSR with LSTM RNNs 13/37

LSTM RNN Architectures LSTM RNN architectures input LSTM output (a) LSTM input LSTM recurrent output (b) LSTMP Google Speech LVCSR with LSTM RNNs 14/37

LSTMP RNN Activation Equations With the proposed LSTMP architecture, the equations for the activations of network units change slightly, the m t 1 activation vector is replaced with r t 1 and the following is added: i t = σ(w ix x t + W im r t 1 + W ic c t 1 + b i ) (7) f t = σ(w fx x t + W fm r t 1 + W fc c t 1 + b f ) (8) c t = f t c t 1 + i t g(w cx x t + W cm r t 1 + b c ) (9) o t = σ(w ox x t + W om r t 1 + W oc c t + b o ) (10) m t = o t h(c t ) (11) r t = W rm m t (12) y t = φ(w yr r t + b y ) (13) where the r denote the recurrent unit activations. Google Speech LVCSR with LSTM RNNs 15/37

Deep LSTM RNN Architectures LSTM RNN architectures input input input input LSTM LSTM LSTM LSTM output LSTM recurrent recurrent (a) LSTM output (b) DLSTM output (c) LSTMP LSTM recurrent output (d) DLSTMP Google Speech LVCSR with LSTM RNNs 16/37

Distributed Training of LSTM RNNs Asynchronous stochastic gradient descent (ASGD) to optimize network parameters Google Brain s distributed parameter server: store, read and update the model parameters (50 shards) Training replicas on 200 machines (data parallelism) 3 synchronized threads in each machine (data parallelism) Each thread operating on mini batch of 4 sequences simultaneously TBPTT: 20 time steps of forward and backward pass Training: read fresh parameters, process 3 4 20 time steps of input, send gradients to parameter server Clip cell activations to [-50, 50] range for long utterances Google Speech LVCSR with LSTM RNNs 17/37

Asynchronous Stochastic Gradient Descent 1 Replica 4 Utterances per thread Thread 2 Thread 1 Thread 0 Internal gradients Parameter server shards Google Speech LVCSR with LSTM RNNs 18/37

Asynchronous Stochastic Gradient Descent 199 more replicas 1 Replica 4 Utterances per thread Thread 2 Thread 1 Thread 0 Internal gradients Parameter server shards Google Speech LVCSR with LSTM RNNs 19/37

Asynchrony Three forms of asynchrony: Within a replica every bptt steps frame chunk is computed with different parameters. State is carried over from one chunk to the next. Each replica is updating independently. Each shard of the parameter server is updated independently. Google Speech LVCSR with LSTM RNNs 20/37

System Google Voice Search in US English 3M (1900hours) 8kHz anonymized training utterances 600M 25ms frames (10ms offset) Normalized 40-dimensional log-filterbank energy features 3-state HMMs with 14, 000 context-dependent states Cross-entropy loss Targets from DNN Viterbi forced-alignment 5 frame output delay Hybrid Unidirectional DLSMTP 2 layers of 800 cells with 512 linear projection layer. 13M parameters Google Speech LVCSR with LSTM RNNs 21/37

Evaluation Scale posteriors by priors for inference. Deweight silence prior. Evaluate ASR on a test set of 22, 500 utterances First pass LM of 23 million n-grams, lattice rescoring with an LM of 1 billion 5-grams Google Speech LVCSR with LSTM RNNs 22/37

Results for LSTM RNN Acoustic Models WERs and frame accuracies on development and training sets: L number of layers, for shallow (1L) and deep (2,4,5,7L) networks C number of memory cells and N total number of parameters C Depth N Dev Train WER (%) (%) (%) 840 5L 37M 67.7 70.7 10.9 440 5L 13M 67.6 70.1 10.8 600 2L 13M 66.4 68.5 11.3 385 7L 13M 66.2 68.5 11.2 750 1L 13M 63.3 65.5 12.4 Google Speech LVCSR with LSTM RNNs 23/37

Results for LSTMP RNN Acoustic Models WERs and frame accuracies on development and training sets: L number of layers, for shallow (1L) and deep (2,4,5,7L) networks C number of memory cells, P number of recurrent projection units, and N total number of parameters C P Depth N Dev Train WER (%) (%) (%) 6000 800 1L 36M 67.3 74.9 11.8 2048 512 2L 22M 68.8 72.0 10.8 1024 512 3L 20M 69.3 72.5 10.7 1024 512 2L 15M 69.0 74.0 10.7 800 512 2L 13M 69.0 72.7 10.7 2048 512 1L 13M 67.3 71.8 11.3 Google Speech LVCSR with LSTM RNNs 24/37

LSTMP RNN models with various depths and sizes C P Depth N WER (%) 1024 512 3L 20M 10.7 1024 512 2L 15M 10.7 800 512 2L 13M 10.7 700 400 2L 10M 10.8 600 350 2L 8M 10.9 Google Speech LVCSR with LSTM RNNs 25/37

Sequence training Conventional training minimizes the frame-level cross entropy between the output and the target distribution given by forced-alignment. Alternative criteria come closer to approximating the Word Error Rate and take into account the language model: Instead of driving the output probabilities closer to the targets, adjust the parameters to correct mistakes that we see in decoding actual utterances. Since these critera are computed on whole sequences we have sequence discriminative training [Kingsbury, 2009]. e.g. Maximum Mutual Information or state-level Minimum Bayes Risk. Google Speech LVCSR with LSTM RNNs 26/37

Sequence training criteria Maximum mutual information is defined as: F MMI (θ) = 1 T p θ (X u W u ) κ p(w u ) log W p θ(x u W ) κ p(w ). (14) u State-level Minimum Bayes Risk (smbr) is the expected frame state accuracy: F smbr (θ) = 1 T p θ (X u W ) κ p(w ) W p θ (X u W ) κ p(w ) δ(s, s ut). (15) u W Google Speech LVCSR with LSTM RNNs 27/37

Sequence training details Discard frames with state occupancy close to zero, [Veselý et al., 2013] Use a weak language model p(w u ) and attach the reciprocal of the language model weight, κ, to the acoustic model. No regularization. (Such as l 2 -regularization around the initial network) or smoothing such as the H-criterion [Su et al., 2013] Google Speech LVCSR with LSTM RNNs 28/37

Computing gradients features Speech utterance reference input layer LSTM layers output layer alignment / lattice rescoring numerator lattice numerator forward-backward HCLG decoding / lattice rescoring denominator lattice denominator forward-backward f(numerator, denominator) MMI / smbr outer derivatives Saturday, March 22, 14 Figure: Pipeline to compute outer derivatives. Google Speech LVCSR with LSTM RNNs 29/37

Algorithm U the data set of utterances with transcripts U randomize(u) θ is the model parameters for all u U do θ read from the parameter server (MMI /smbr) θ,ut calculate κl (s) for u for all s subsequences(u, bptt steps) do θ read from the parameter server forward pass(s, θ, bptt steps) θ backward pass(s, θ, bptt steps) θ sum gradients( θ, bptt steps) send θ to the parameter server end for end for Acoustic features State posteriors External gradients θ θ Full sequence Batches θ' θ'' θ''' θ'''' Internal gradients Google Speech LVCSR with LSTM RNNs 30/37

Asynchronous sequence training system Parameter Server θ' = θ η θ Distributed Trainers θ BPTT θ outer gradients θ BPTT θ outer gradients activations activations Model Replicas LSTM θ posteriors... LSTM θ posteriors Decoders MMI smbr MMI smbr features transcript features transcript Utterances Figure: Asynchronous SGD: Model replicas asynchronously fetch parameters θ and push gradients θ to the parameter server. Google Speech LVCSR with LSTM RNNs 31/37

Results: Choice of Language model How powerful should the sequence training language model be? Table: orders. WERs for smbr training with LMs of various n-gram CE 1-gram 2-gram 3-gram 10.7 10.9 10.0 10.1 Google Speech LVCSR with LSTM RNNs 32/37

Results Table: WERs for smbr training of LSTM RNN bootstrapped with CE training on DNN versus LSTM RNN alignments. Alignment CE smbr DNN Alignment 10.7 10.1 LSTM RNN Alignment 10.7 10.0 Google Speech LVCSR with LSTM RNNs 33/37

Switching from CE to sequence training Table: WERs achieved by MMI/sMBR training for around 3 days when we switch from CE training at different times before convergence. indicates the best WER achieved after 2 weeks of smbr training. CE WER at switch MMI smbr 15.9 13.8-14.9 12.0-12.0 10.8 10.7 11.2 10.8 10.3 10.7 10.5 10.0 (9.8 ) An 85M parameter DNN achieves 11.3% WER (CE) and 10.4% WER (smbr). Google Speech LVCSR with LSTM RNNs 34/37

Conclusions LSTMs for LVCSR outperform much larger DNNs, both CE (5%) and sequence-trained (6%). Distributed sequence training for LSTMs was a straight forward extension of DNN sequence training. LSTMs for LVCSR improved (8% relative) by sequence training smbr gives better results than MMI. Sequence training needs to start from a converged model. Google Speech LVCSR with LSTM RNNs 35/37

Ongoing work Alternative architectures Bidirectional Modelling units Other tasks Noise robustness Speaker ID Language ID Pronunciation modelling Language modelling Keyword spotting Google Speech LVCSR with LSTM RNNs 36/37

Bibliography I Appendix References Felix A. Gers and Jürgen Schmidhuber. LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333 1340, 2001. Felix A. Gers, Jürgen Schmidhuber, and Fred Cummins. Learning to forget: Continual prediction with LSTM. Neural Computation, 12 (10):2451 2471, 2000. Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3:115 143, March 2003. ISSN 1532-4435. doi: 10.1162/153244303768966139. Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 12:5 6, 2005. Alex Graves, Marcus Liwicki, Santiago Fernandez, Roman Bertolami, Horst Bunke, and Jürgen Schmidhuber. A novel connectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(5):855 868, 2009. ISSN 0162-8828. doi: 10.1109/TPAMI.2008.137. Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of ICASSP, 2013. S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Master s thesis, T.U.München, 1991. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735 1780, November 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735. B. Kingsbury. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 3761 3764, Taipei, Taiwan, April 2009. A. J. Robinson, L. Almeida, J. m. Boite, H. Bourlard, F. Fallside, M. Hochberg, D. Kershaw, P. Kohn, Y. Konig, N. Morgan, J. P. Neto, S. Renals, M. Saerens, and C. Wooters. A neural network based, speaker independent, large vocabulary, continuous speech recognition system: The Wernicke project. In PROC. EUROSPEECH 93, pages 1941 1944, 1993. H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott, R. Monga, and M. Mao. Sequence discriminative distributed training of long short-term memory recurrent neural networks. In Proc. Interspeech 2014, 2014a. Hasim Sak, Andrew Senior, and Francoise Beaufays. Long Short-Term Memory Recurrent Neural Network architectures for large scale acoustic modeling. In Proc. Interspeech 2014, 2014b. H. Su, G. Li, D. Yu, and F. Seide. Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 6664 6668, 2013. K. Veselý, A. Ghoshal, L. Burget, and D. Povey. Sequence-discriminative training of deep neural networks. In INTERSPEECH, 2013. Google Speech LVCSR with LSTM RNNs 37/37