Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Similar documents
Introduction to RNNs!

EE-559 Deep learning LSTM and GRU

Unfolded Recurrent Neural Networks for Speech Recognition

End-to-end Automatic Speech Recognition

High Order LSTM/GRU. Wenjie Luo. January 19, 2016

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

An exploration of dropout with LSTMs

Deep Recurrent Neural Networks

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Introduction to Convolutional Neural Networks 2018 / 02 / 23

Deep Learning. Recurrent Neural Network (RNNs) Ali Ghodsi. October 23, Slides are partially based on Book in preparation, Deep Learning

Deep Learning for Automatic Speech Recognition Part II

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Long-Short Term Memory and Other Gated RNNs

Lecture 11 Recurrent Neural Networks I

Very Deep Convolutional Neural Networks for LVCSR

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

RECURRENT NEURAL NETWORKS WITH FLEXIBLE GATES USING KERNEL ACTIVATION FUNCTIONS

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Recurrent Neural Networks (RNN) and Long-Short-Term-Memory (LSTM) Yuan YAO HKUST

Large Vocabulary Continuous Speech Recognition with Long Short-Term Recurrent Networks

Based on the original slides of Hung-yi Lee

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Recurrent and Recursive Networks

Deep Learning Tutorial. 李宏毅 Hung-yi Lee

arxiv: v3 [cs.lg] 14 Jan 2018

RECURRENT NETWORKS I. Philipp Krähenbühl

Faster Training of Very Deep Networks Via p-norm Gates

Deep Neural Networks

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Vanishing and Exploding Gradients. ReLUs. Xavier Initialization

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Slide credit from Hung-Yi Lee & Richard Socher

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Deep Learning for Speech Recognition. Hung-yi Lee

Recurrent Neural Networks 2. CS 287 (Based on Yoav Goldberg s notes)

Lecture 11 Recurrent Neural Networks I

Lecture 17: Neural Networks and Deep Learning

Recurrent Neural Network

Machine Learning for Signal Processing Neural Networks Continue. Instructor: Bhiksha Raj Slides by Najim Dehak 1 Dec 2016

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Deep Learning Recurrent Networks 2/28/2018

Introduction to Deep Neural Networks

Recurrent Neural Networks Deep Learning Lecture 5. Efstratios Gavves

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

Speech and Language Processing

Stephen Scott.

EE-559 Deep learning Recurrent Neural Networks

Based on the original slides of Hung-yi Lee

Deep Residual. Variations

Recurrent Neural Networks. Jian Tang

Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

arxiv: v4 [cs.cl] 5 Jun 2017

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

CSC321 Lecture 10 Training RNNs

Recurrent neural networks

Introduction to Convolutional Neural Networks (CNNs)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Architectures for Image, Language, and Speech Processing

Convolutional Neural Networks II. Slides from Dr. Vlad Morariu

4F10: Deep Learning. Mark Gales. Michaelmas 2016

Improved Learning through Augmenting the Loss

Sequence Modeling with Neural Networks

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep learning for Natural Language Processing and Machine Translation

Introduction to Neural Networks

Gated Recurrent Neural Tensor Network

Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition

Recurrent Neural Networks

arxiv: v1 [cs.cl] 21 May 2017

CSC321 Lecture 15: Exploding and Vanishing Gradients

Presented By: Omer Shmueli and Sivan Niv

Lecture 3 Feedforward Networks and Backpropagation

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Neural Networks 2. 2 Receptive fields and dealing with image inputs

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

Natural Language Processing and Recurrent Neural Networks

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Ch.6 Deep Feedforward Networks (2/3)

Long Short- Term Memory (LSTM) M1 Yuichiro Sawai Computa;onal Linguis;cs Lab. January 15, Deep Lunch

Deep Learning and Lexical, Syntactic and Semantic Analysis. Wanxiang Che and Yue Zhang

CSC321 Lecture 16: ResNets and Attention

From perceptrons to word embeddings. Simon Šuster University of Groningen

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Neural Networks Language Models

arxiv: v1 [cs.cl] 22 Feb 2018

A Primal-Dual Method for Training Recurrent Neural Networks Constrained by the Echo-State Property

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积 LSTM 网络 : 利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23

Index. Santanu Pattanayak 2017 S. Pattanayak, Pro Deep Learning with TensorFlow,

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Conditional Language modeling with attention

Deep Learning Recurrent Networks 10/11/2017

Lecture 3 Feedforward Networks and Backpropagation

Generating Sequences with Recurrent Neural Networks

Unsupervised Model Adaptation using Information-Theoretic Criterion

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Convolutional Neural Networks

Transcription:

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27 th April 2016

Variants of Neural Network Architectures 2 / 72 Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), unidirectional, bidirectional, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Constraints and Regularization, Attention model,

Training 3 / 72 Observations and labels (x n, a n ) R D A for n = 1,..., N. Training criterion: F CE (θ) = 1 N F L (θ) = 1 N Optimization: n=1 N log P(a n x n, θ) n=1 N ω a T n 1 ω P(a T n Tn 1 x1, θ) L(ω, ω n) loss L θ = arg min {F(θ)} θ θ, θ : Free parameters of the model (NN, GMM). ω, ω : Word sequences.

Recap: Gaussian Mixture Model 4 / 72 Recap Gaussian Mixture Model: P(ω x T 1 ) = a T 1 ω T P(x t a t )P(a t a t 1 ) t=1 ω: word sequence x1 T := x 1,..., x T : feature sequence a1 T := a 1,..., a T : HMM state sequence Emission probability P(x a) N (µ a, Σ a ) Gaussian. Replace with a neural network hybrid model. Use neural network for feature extraction bottleneck features.

Hybrid Model 5 / 72 Gaussian Mixture Model: P(ω x T 1 ) = a T 1 ω T P(x t a t ) P(a }{{} t a t 1 ) }{{} t=1 emission transition Training: A neural network usually models P(x a). Recognition: Use as a hybrid model for speech recognition: P(a x) P(a) = P(x, a) P(x)P(a) = P(x a) P(x) P(x a) P(x a)/p(x) and P(x a) are proportional.

Hybrid Model and Bayes Decision Rule 6 / 72 { ˆω = arg max P(ω)P(x T 1 ω) } ω = arg max ω P(ω) T P(x t a t ) P(x t ) P(a t a t 1 ) a1 T ω t=1 T = arg max P(ω) P(x t a t )P(a t a t 1 ) t=1 ω T a1 T ω P(x t ) t=1 = arg max ω P(ω) a T 1 ω T P(x t a t )P(a t a t 1 ) t=1

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 7 / 72

Recap: Deep Neural Network (DNN) First feed forward networks. Consists of input, multiple hidden and output layer. Each hidden and output layer consists of nodes. 8 / 72

Recap: Deep Neural Network (DNN) 9 / 72 Free parameters: weights W and bias b. Output of a layer is input to the next layer. Each node performs a linear followed by a non-linear activiation on the input. The output layer relates the output of the last hidden layer with the target states.

Neural Network Layer 10 / 72 Number of nodes: n l in layer l. Input from previous layer: y (l 1) R n l 1 Weight and bias : W (l) R n l 1 n l, b (l) R n l. Activation: y (l) = σ(w (l) y (l 1) + b }{{ (l) ) } } linear {{ } non-linear

Deep Neural Network (DNN) 11 / 72

Activation Function Zoo 12 / 72 Sigmoid: Hyperbolic tangent: σ sigmoid (y) = 1 1 + exp( y) σ tanh (y) = tanh(y) = 2σ sigmoid (2y) REctified Linear Unit (RELU): { y, y > 0 σ relu (y) = 0, y 0

Activation Function Zoo Parametric RELU (PRELU): { y, y > 0 σ prelu (y) = a y, y 0 Exponential Linear Unit (ELU): { y, y > 0 σ elu (y) = a (exp(y) 1), y 0 Maxout: σ maxout (y 1,..., y I ) = max i Softmax: σ softmax (y) = { W1 y (l 1) + b 1,..., W I y (l 1) + b I } ( exp(y1 ) Z (y),..., exp(y ) T I) with Z (y) = exp(y j ) Z (y) j 13 / 72

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 14 / 72

Multilingual Bottleneck 15 / 72

Multilingual Bottleneck 16 / 72 Encoder-Decoder architecture: DNN with a bottleneck. Forces low-dimensional representation of speech across mutliple languages. Several languages are presented to the network randomly. Training: Labels from different languages. Recognition: Network is cut off after bottleneck.

Why are Multilingual Bottlenecks? 17 / 72 Train Multilingual Bottleneck features with lots of data. Future use: Bottleneck features on different tasks to train GMM system. No expensive DNN training, but WER gains similar to DNN.

Multilingual Bottleneck: Performance 18 / 72 WER [%] Model FR EN DE PL MFCC 23.6 28.6 23.3 18.1 MLP BN targets 19.3 23.1 19.0 14.5 MLP BN multi 18.7 21.3 17.9 14.0 deep BN targets 17.4 20.3 17.3 13.0 deep BN multi 17.1 19.7 16.4 12.6 +lang.dep. hidden layer 16.8 19.7 16.2 12.4

More Fancy Models 19 / 72 Convolutional Neural Networks. Recurrent Neural Networks: Long Short-Term Memory (LSTM) RNNs, Gated Recurrent Unit (GRU) RNNs. Unstable Gradient Problem.

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 20 / 72

Convolutional Neural Networks (CNNs) 21 / 72 Convolution (remember signal analysis?): (x 1 x 2 )[k] = i x 1 [k i] x 2 [i]

Convolutional Neural Networks (CNNs) 22 / 72 Convolution (remember signal analysis?): (x 1 x 2 )[k] = i x 1 [k i] x 2 [i]

Convolutional Neural Networks (CNNs) 23 / 72

CNNs 24 / 72 Consists of multiple local maps with channels and kernels. Kernels are convolved across the input. Multidimensional input: 1D (frequency), 2D (time-frequency), 3D (time-frequency-?). Neurons are connected to a local receptive fields of input. Weights are shared across multiple receptive fields.

Formal Definition: Convolutional Neural Networks 25 / 72 Free parameters: Feature maps W n R C k bias b n R k for n = 1,..., N c = 1,..., C channels, k N kernel size Activation function: y n,i = σ(w n,i x i + b) C i+k = σ W n,c,i j x c,j + b f c=1 j=i k

Pooling 26 / 72 Max-Pooling: pool(y n,c,i ) = max j=i k,...,i+k {y n,c,j} Average-Pooling: average(y n,c,i ) = 1 2 k + 1 i+k j=i k y n,c,j

CNN vs. DNN: Performance GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. WER [%] Model CE ST GMM 18.8 n/a DNN 16.2 14.9 CNN 15.8 13.9 CNN+DNN 15.1 13.2 Table: Broadcast conversation 2k h. WER [%] Model CE ST DNN 11.7 10.3 CNN 12.6 10.4 DNN+CNN 11.3 9.6 27 / 72

VGG. The configurations of our very deep CNNs for LVCSR. In all but the classic convnet, convolutional layers have 3 3 kernels, size is omitted. The depth of the networks increases from left to right. The deepest configuration, WDX, has 10 convolutional a nnected layers. The leftmost column indicates the number of output feature maps in each layer. The optional X means there are nnected layers instead of three (output layer included). 28 / 72 # Fmaps Classic [16, 17, 18] VB(X) VC(X) VD(X) WD(X) 64 conv(3,64) conv(3,64) conv(3,64) conv(3,64) conv(64,64) conv(64,64) conv(64,64) conv(64,64) pool 1x3 pool 1x2 pool 1x2 pool 1x2 128 conv(64, 128) conv(64, 128) conv(64, 128) conv(64, 128) conv(128, 128) conv(128, 128) conv(128, 128) conv(128, 128) pool 2x2 pool 2x2 pool 1x2 pool 1x2 256 conv(128, 256) conv(128, 256) conv(128, 256) conv(256, 256) conv(256, 256) conv(256, 256) conv(256, 256) pool 1x2 pool 2x2 pool 2x2 512 conv9x9(3,512) conv(256, 512) conv(256, 512) pool 1x3 conv(512, 512) conv(512, 512) conv3x4(512,512) conv(512, 512) pool 2x2 pool 2x2 FC 2048 FC 2048 (FC 2048) FC output size Softmax

LVCSR. In all but the classic convnet, convolutional layers have 3 3 kernels, thus ses VGG from left to right. The deepest configuration, WDX, has 10 convolutional and 4 e number of output feature maps in each layer. The optional X means there are four ded). LTIES are adaptaain, where inated [16, NNs. The nvolutional Linear Unit connected put feature derstood to frequency) KUR Softmax TOK CEB KAZ TEL LIT Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv conv pool conv conv Softmax Softmax Shared Softmax padding of architecture size of the padding is Fig. 1. Multilingual VBX network with the last three layers untied. FC stands for Fully Connected layers. Context +/-5 29 / 72

Section 2.1, 2.2, and 2.1 respectively. From table 3 note that even in VGGthePerformance monolingual case (3 hours of data) the VBX CNN architecture outperforms both the classical CNN and the baseline DNN. 3.2. Switchboard 300 WER # params (M) #M frames Classic 512 [17] 13.2 41.2 1200 Classic 256 ReLU (A+S) 13.8 58.7 290 VCX (6 conv) (A+S) 13.1 36.9 290 VDX (8 conv) (A+S) 12.3 38.4 170 WDX (10 conv) (A+S) 12.2 41.3 140 VDX (8 conv) (S) 11.9 38.4 340 WDX (10 conv) (S) 11.8 41.3 320 Table 5. Results on Hub5 00 SWB after training on the 262-hour SWB-1 dataset. We obtain 14.5% relative improvement over our baseline adaptation of the classical CNN and 10.6% relative improvement over [17]. (A+S) means Adadelta + SGD finetuning. (S) means the model was trained from random initialization using SGD. The last column gives the number of frames til convergence. work a layers 2014 ers. W in the multiwith m of 2.5 of dat guage Hub5 11.8% tive) o provem cross- W ing w dropo This e Progra 30 / 72

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 31 / 72

Recurrent Neural Networks (RNNs) 32 / 72 DNNs are deep in layers. RNNs are deep in time (in addition). Shared weights and biases across time steps.

Unfolded RNN 33 / 72

DNN vs. RNN 34 / 72

Formal Definition: RNN 35 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Free parameters: Input to hidden weight: W R n l 1 n l Hidden to hidden weight: R R n l n l Bias: b R n l Output: Iterate the equation for t = 1,..., T Compare with DNN: h t = σ(w x t + R h t 1 + b) h t = σ(w x t + b)

BackPropagation Through Time (BPTT) 36 / 72 Chain rule through time: d F(θ) d h t = t 1 τ=1 d F(θ) d h τ d h τ d h t

BackPropagation Through Time (BPTT) 37 / 72 Implementation: Unfold RNN over time through t = 1,..., T. Forward propagate RNN. Backpropagate error through unfolded network. Faster than other optimization methods (e.g. evolutionary search). Difficulty with local optima.

Bidirectional RNN (BRNN) 38 / 72 Forward RNN processes data forward left to right. Backward RNN processes data backward right to left. Output joins the output of forward and backward RNN.

Formal Definition: BRNN Input vector sequence: x t R D, t = 1,..., T Forward and backward hidden outputs: h t, h t, t = 1,..., T Forward and backward free parameters: Input to hidden weight: W, W R n l 1 n l Hidden to hidden weight: R, R R n l n l Bias: b, b R n l Output: Iterate the equation for t = 1,..., T h t = σ( W x t + R h t 1 + b) Output: Iterate the equation for t = T,..., 1 h t = σ( W x t + R h t+1 + b) 39 / 72

RNN using Memory Cells 40 / 72 Equip an RNN with a memory cell. Can store information for a long time. Introduce gating units to: activations going in, activations going out, saving activations, forgetting activations.

Long Short-Term Memory RNN 41 / 72

Formal Definition: LSTM 42 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : z t = σ(w z x t + R z h t 1 + b z ) i t = σ(w i x t + R i h t 1 + P i c t 1 + b i ) f t = σ(w f x t + R f h t 1 + P f c t 1 + b f ) c t = i t z t + f t c t 1 o t = σ(w o x t + R o h t 1 + P o c t + b i ) h t = o t tanh(c t ) (block input) (input gate) (forget gate) (cell state) (output gate) (block output)

LSTM: Too many connections? 43 / 72 Some of the connections in the LSTM are not necessary [1]. Peepholes do not seem to be necessary. Coupled input and forget gates. Simplified LSTM Gated Recurrent Unit (GRU).

Gated Recurrent Unit (GRU) 44 / 72 References: [2, 3, 4]

Formal Definition: GRU Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : r t = σ(w r x t + R r h t 1 + b r ) z t = σ(w z x t + R z h t 1 + b z ) h t = σ(w h x t + R h (r t h t 1 ) + b h ) h t = z t h t 1 + (1 z t ) h t (reset gate) (update gate) (candidate gate) (output gate) 45 / 72

CNN vs. DNN vs. RNN: Performance 46 / 72 GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. WER [%] Model CE ST GMM 18.8 n/a DNN 16.2 14.9 CNN 15.8 13.9 BGRU (fmllr) 14.9 n/a BLSTM (fmllr) 14.8 n/a BGRU (Log-Mel) 14.1 n/a

DNN vs. CNN vs. RNN: Performance 47 / 72 GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast Conversation 2000 h. WER [%] Model CE ST DNN 11.7 10.3 CNN 12.6 10.4 RNN 11.5 9.9 DNN+CNN 11.3 9.6 RNN+CNN 11.2 9.4 DNN+RNN+CNN 11.1 9.4

RNN Black Magic Unrolling the RNN in training: whole utterance [5], vs. truncated BPTT with carryover [6]: Split utterance into subsequences of e.g. 21 frames. Carry over last cell from previous subsequence to new subsequence. Compose minibatch from subsequences. vs. truncated BPTT with overlap: Split utterance in subsequences of e.g. 21 frames. Overlap subsequences by 10. Compose minibatch of subsequences from different utterances. Gradient clipping of the LSTM cell. 48 / 72

RNN Black Magic 49 / 72 Recognition: Unrolling RNN whole utterance, vs. unrolling subsequences Split utterance in subsequences of e.g. 21 frames. Carry over last cell from previous subsequence to new subsequence. vs. unrolling on spectral window [7] For each frame unroll on the spectral window Last RNN layer only returns center/last frame.

Highway Network 50 / 72 References: [2, 3, 4]

Formal Definition: Highway Network 51 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : z t = σ(w z x t + b z ) h t = σ(w h x t + b h ) h t = z t x t + (1 z t ) h t (highway gate) (candidate gate) (output gate)

Formal Definition: Highway GRU Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : r t = σ(w r x t + R r h t 1 + b r ) z t = σ(w z x t + R z h t 1 + b z ) d t = σ(w d x t + R d h t 1 + b d ) h t = σ(w h x t + R h (r t h t 1 ) + b h ) (reset gate) (update gate) (highway gate) (candidate gate) h t = d t x t + (1 d t ) (z t h t 1 + (1 z t ) h t ) (output gate) 52 / 72

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 53 / 72

Unstable Gradient Problem Happens in deep as well in recurrent neural networks. If gradient becomes very small vanishing gradient. If gradient becomes very large exploding gradient. Simplified Neural Network (w i are just scalars): F(w 1,..., w N ) = L(σ(y N ) = L(σ(w N σ(y N 1 ) = L(σ(w N σ(w N 1... σ(w 1 x t )...))) 54 / 72

Unstable Gradient Problem, Constraints and Regularization 55 / 72 Gradient: d F(w 1,..., w N ) d w 1 = d L d θ d σ(w N σ(w N 1... σ(w 1 x t )...)) d w 1 = d L d w 1 σ (y N ) w N σ (y N 1 ) w N 1... σ (w 1 ) x t If w i σ (y i ) < 1, i = 2,..., N gradient vanishes. If w i σ (y i ) >> 1, i = 2,..., N gradient explodes.

Solution: Unstable Gradient Problem 56 / 72 Gradient Clipping. Weight constraints. Let the network save activations over layers/time steps: y new = αy previous + (1 α)y common Long Short-Term Memory RNN Highway Neural Network (>100 layers)

Gradient Clipping 57 / 72 Keeps gradient weights in range. One approach to deal with the exploding gradient problem. Ensure gradient is in the range [ c, c] for a constant c: ( ) ( ( d F clip d θ, c = min c, max c, d F )) d θ

Constraints (I) 58 / 72 Keeps weights in range (for e.g. Relu, Maxout). Ignored for gradient backpropagation. Constraints are forced after gradient update.

Constraints (II) 59 / 72 Max-Norm: force W 2 c for constant c W max = W max(min( W 2, 0), c) W 2 Unity-Norm: force W 2 1 W unity = W W 2 Positivity-Norm: force W > 0 W + = W max(0, W )

Regularization: Dropout 60 / 72 Dropout: Prevents getting stuck in local optimum avoids overfitting.

Regularization: Dropout 61 / 72 Dropout: Prevents getting stuck in local optimum avoids overfitting.

Regularization: Dropout 62 / 72 Input vector sequence: x t R D Choose z t {0, 1} D for t = 1,..., T According to Bernoulli distribution P(z t,d = i) = p 1 i (1 p) i with dropout probability with p [0, 1]: Training: x t := x t z t 1 p for t = 1,..., T. Recognition: x t := x t for t = 1,..., T.

Regularization (II) 63 / 72 L p Norm: θ θ p = θ p j=0 1 p Training criterion regularization: F p (θ) = F(θ) + λ θ p with scalar λ Smoothes the training criterion. Pushes the free parameter weights closer to zero.

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 64 / 72

Attention-based End-to-End Architecture 65 / 72

Attention model 66 / 72

Formal Definition: Content Focus 67 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Scorer: ɛ m,t = tanh(v m,ɛ x t + b ɛ ) for t = 1,..., T, m = 1,..., M Generator: σ(w α ɛ m,t ) α m,t = T τ=1 σ(w for t = 1,..., T, m = 1,..., M α ɛ m,τ ) Glimpse: g m = Output: T α m,t x t t=1 h m = σ(w h g m + b h ) for m = 1,..., M for m = 1,..., M

Formal Definition: Recurrent Attention Scorer: ɛ m,t = tanh(w ɛ x t + R ɛ s m 1 + U ɛ (F ɛ α m 1 ) + b ɛ ) Generator: σ(w α ɛ m,t ) α m,t = T τ=1 σ(w for t = 1,..., T, m = 1,..., M ɛ ɛ m,τ ) Glimpse: g m = GRU state: T α m,t x t for m = 1,..., M t=1 s m = GRU(g m, h m, s m 1 ) for m = 1,..., M Output: h m = σ(w h g m + R h s m 1 + b h ) for m = 1,..., M 68 / 72

End-to-End Performance 69 / 72 Table: TIMIT WER [%] Model dev eval HMM 13.9 16.7 End-to-end 15.8 17.6 RNN Transducer n/a 17.7

70 / 72 K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, LSTM: A search space odyssey, CoRR, vol. abs/1503.04069, 2015. [Online]. Available: http://arxiv.org/abs/1503.04069 K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using rnn encoder decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724 1734. [Online]. Available: http://www.aclweb.org/anthology/d14-1179 J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, CoRR, vol. abs/1412.3555, 2014. [Online]. Available: http://arxiv.org/abs/1412.3555

71 / 72 R. Józefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, in ICML, ser. JMLR Proceedings, vol. 37. JMLR.org, 2015, pp. 2342 2350. A. Graves, N. Jaitly, and A. Mohamed, Hybrid speech recognition with deep bidirectional LSTM, in ASRU. IEEE, 2013, pp. 273 278. H. Sak, A. W. Senior, and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in INTERSPEECH. ISCA, 2014, pp. 338 342. A.-R. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn, Deep bi-directional recurrent networks over spectral windows, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop. IEEE Institute of Electrical and Electronics Engineers, December

2015, pp. 78 83. [Online]. Available: http://research.microsoft.com/apps/pubs/default.aspx?id=259236 72 / 72