Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 27 th April 2016

Variants of Neural Network Architectures 2 / 72 Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), unidirectional, bidirectional, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Constraints and Regularization, Attention model,

Training 3 / 72 Observations and labels (x n, a n ) R D A for n = 1,..., N. Training criterion: F CE (θ) = 1 N F L (θ) = 1 N Optimization: n=1 N log P(a n x n, θ) n=1 N ω a T n 1 ω P(a T n Tn 1 x1, θ) L(ω, ω n) loss L θ = arg min {F(θ)} θ θ, θ : Free parameters of the model (NN, GMM). ω, ω : Word sequences.

Recap: Gaussian Mixture Model 4 / 72 Recap Gaussian Mixture Model: P(ω x T 1 ) = a T 1 ω T P(x t a t )P(a t a t 1 ) t=1 ω: word sequence x1 T := x 1,..., x T : feature sequence a1 T := a 1,..., a T : HMM state sequence Emission probability P(x a) N (µ a, Σ a ) Gaussian. Replace with a neural network hybrid model. Use neural network for feature extraction bottleneck features.

Hybrid Model 5 / 72 Gaussian Mixture Model: P(ω x T 1 ) = a T 1 ω T P(x t a t ) P(a }{{} t a t 1 ) }{{} t=1 emission transition Training: A neural network usually models P(x a). Recognition: Use as a hybrid model for speech recognition: P(a x) P(a) = P(x, a) P(x)P(a) = P(x a) P(x) P(x a) P(x a)/p(x) and P(x a) are proportional.

Hybrid Model and Bayes Decision Rule 6 / 72 { ˆω = arg max P(ω)P(x T 1 ω) } ω = arg max ω P(ω) T P(x t a t ) P(x t ) P(a t a t 1 ) a1 T ω t=1 T = arg max P(ω) P(x t a t )P(a t a t 1 ) t=1 ω T a1 T ω P(x t ) t=1 = arg max ω P(ω) a T 1 ω T P(x t a t )P(a t a t 1 ) t=1

Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 7 / 72

Recap: Deep Neural Network (DNN) First feed forward networks. Consists of input, multiple hidden and output layer. Each hidden and output layer consists of nodes. 8 / 72

Recap: Deep Neural Network (DNN) 9 / 72 Free parameters: weights W and bias b. Output of a layer is input to the next layer. Each node performs a linear followed by a non-linear activiation on the input. The output layer relates the output of the last hidden layer with the target states.

Neural Network Layer 10 / 72 Number of nodes: n l in layer l. Input from previous layer: y (l 1) R n l 1 Weight and bias : W (l) R n l 1 n l, b (l) R n l. Activation: y (l) = σ(w (l) y (l 1) + b }{{ (l) ) } } linear {{ } non-linear

Deep Neural Network (DNN) 11 / 72

Activation Function Zoo 12 / 72 Sigmoid: Hyperbolic tangent: σ sigmoid (y) = 1 1 + exp( y) σ tanh (y) = tanh(y) = 2σ sigmoid (2y) REctified Linear Unit (RELU): { y, y > 0 σ relu (y) = 0, y 0

Activation Function Zoo Parametric RELU (PRELU): { y, y > 0 σ prelu (y) = a y, y 0 Exponential Linear Unit (ELU): { y, y > 0 σ elu (y) = a (exp(y) 1), y 0 Maxout: σ maxout (y 1,..., y I ) = max i Softmax: σ softmax (y) = { W1 y (l 1) + b 1,..., W I y (l 1) + b I } ( exp(y1 ) Z (y),..., exp(y ) T I) with Z (y) = exp(y j ) Z (y) j 13 / 72

Multilingual Bottleneck 15 / 72

Multilingual Bottleneck 16 / 72 Encoder-Decoder architecture: DNN with a bottleneck. Forces low-dimensional representation of speech across mutliple languages. Several languages are presented to the network randomly. Training: Labels from different languages. Recognition: Network is cut off after bottleneck.

Why are Multilingual Bottlenecks? 17 / 72 Train Multilingual Bottleneck features with lots of data. Future use: Bottleneck features on different tasks to train GMM system. No expensive DNN training, but WER gains similar to DNN.

Multilingual Bottleneck: Performance 18 / 72 WER [%] Model FR EN DE PL MFCC 23.6 28.6 23.3 18.1 MLP BN targets 19.3 23.1 19.0 14.5 MLP BN multi 18.7 21.3 17.9 14.0 deep BN targets 17.4 20.3 17.3 13.0 deep BN multi 17.1 19.7 16.4 12.6 +lang.dep. hidden layer 16.8 19.7 16.2 12.4

More Fancy Models 19 / 72 Convolutional Neural Networks. Recurrent Neural Networks: Long Short-Term Memory (LSTM) RNNs, Gated Recurrent Unit (GRU) RNNs. Unstable Gradient Problem.

Convolutional Neural Networks (CNNs) 21 / 72 Convolution (remember signal analysis?): (x 1 x 2 )[k] = i x 1 [k i] x 2 [i]

Convolutional Neural Networks (CNNs) 22 / 72 Convolution (remember signal analysis?): (x 1 x 2 )[k] = i x 1 [k i] x 2 [i]

Convolutional Neural Networks (CNNs) 23 / 72

CNNs 24 / 72 Consists of multiple local maps with channels and kernels. Kernels are convolved across the input. Multidimensional input: 1D (frequency), 2D (time-frequency), 3D (time-frequency-?). Neurons are connected to a local receptive fields of input. Weights are shared across multiple receptive fields.

Formal Definition: Convolutional Neural Networks 25 / 72 Free parameters: Feature maps W n R C k bias b n R k for n = 1,..., N c = 1,..., C channels, k N kernel size Activation function: y n,i = σ(w n,i x i + b) C i+k = σ W n,c,i j x c,j + b f c=1 j=i k

Pooling 26 / 72 Max-Pooling: pool(y n,c,i ) = max j=i k,...,i+k {y n,c,j} Average-Pooling: average(y n,c,i ) = 1 2 k + 1 i+k j=i k y n,c,j

CNN vs. DNN: Performance GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. WER [%] Model CE ST GMM 18.8 n/a DNN 16.2 14.9 CNN 15.8 13.9 CNN+DNN 15.1 13.2 Table: Broadcast conversation 2k h. WER [%] Model CE ST DNN 11.7 10.3 CNN 12.6 10.4 DNN+CNN 11.3 9.6 27 / 72

VGG. The configurations of our very deep CNNs for LVCSR. In all but the classic convnet, convolutional layers have 3 3 kernels, size is omitted. The depth of the networks increases from left to right. The deepest configuration, WDX, has 10 convolutional a nnected layers. The leftmost column indicates the number of output feature maps in each layer. The optional X means there are nnected layers instead of three (output layer included). 28 / 72 # Fmaps Classic [16, 17, 18] VB(X) VC(X) VD(X) WD(X) 64 conv(3,64) conv(3,64) conv(3,64) conv(3,64) conv(64,64) conv(64,64) conv(64,64) conv(64,64) pool 1x3 pool 1x2 pool 1x2 pool 1x2 128 conv(64, 128) conv(64, 128) conv(64, 128) conv(64, 128) conv(128, 128) conv(128, 128) conv(128, 128) conv(128, 128) pool 2x2 pool 2x2 pool 1x2 pool 1x2 256 conv(128, 256) conv(128, 256) conv(128, 256) conv(256, 256) conv(256, 256) conv(256, 256) conv(256, 256) pool 1x2 pool 2x2 pool 2x2 512 conv9x9(3,512) conv(256, 512) conv(256, 512) pool 1x3 conv(512, 512) conv(512, 512) conv3x4(512,512) conv(512, 512) pool 2x2 pool 2x2 FC 2048 FC 2048 (FC 2048) FC output size Softmax

LVCSR. In all but the classic convnet, convolutional layers have 3 3 kernels, thus ses VGG from left to right. The deepest configuration, WDX, has 10 convolutional and 4 e number of output feature maps in each layer. The optional X means there are four ded). LTIES are adaptaain, where inated [16, NNs. The nvolutional Linear Unit connected put feature derstood to frequency) KUR Softmax TOK CEB KAZ TEL LIT Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv conv pool conv conv Softmax Softmax Shared Softmax padding of architecture size of the padding is Fig. 1. Multilingual VBX network with the last three layers untied. FC stands for Fully Connected layers. Context +/-5 29 / 72

Section 2.1, 2.2, and 2.1 respectively. From table 3 note that even in VGGthePerformance monolingual case (3 hours of data) the VBX CNN architecture outperforms both the classical CNN and the baseline DNN. 3.2. Switchboard 300 WER # params (M) #M frames Classic 512 [17] 13.2 41.2 1200 Classic 256 ReLU (A+S) 13.8 58.7 290 VCX (6 conv) (A+S) 13.1 36.9 290 VDX (8 conv) (A+S) 12.3 38.4 170 WDX (10 conv) (A+S) 12.2 41.3 140 VDX (8 conv) (S) 11.9 38.4 340 WDX (10 conv) (S) 11.8 41.3 320 Table 5. Results on Hub5 00 SWB after training on the 262-hour SWB-1 dataset. We obtain 14.5% relative improvement over our baseline adaptation of the classical CNN and 10.6% relative improvement over [17]. (A+S) means Adadelta + SGD finetuning. (S) means the model was trained from random initialization using SGD. The last column gives the number of frames til convergence. work a layers 2014 ers. W in the multiwith m of 2.5 of dat guage Hub5 11.8% tive) o provem cross- W ing w dropo This e Progra 30 / 72

Recurrent Neural Networks (RNNs) 32 / 72 DNNs are deep in layers. RNNs are deep in time (in addition). Shared weights and biases across time steps.

Unfolded RNN 33 / 72

DNN vs. RNN 34 / 72

Formal Definition: RNN 35 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Free parameters: Input to hidden weight: W R n l 1 n l Hidden to hidden weight: R R n l n l Bias: b R n l Output: Iterate the equation for t = 1,..., T Compare with DNN: h t = σ(w x t + R h t 1 + b) h t = σ(w x t + b)

BackPropagation Through Time (BPTT) 36 / 72 Chain rule through time: d F(θ) d h t = t 1 τ=1 d F(θ) d h τ d h τ d h t

BackPropagation Through Time (BPTT) 37 / 72 Implementation: Unfold RNN over time through t = 1,..., T. Forward propagate RNN. Backpropagate error through unfolded network. Faster than other optimization methods (e.g. evolutionary search). Difficulty with local optima.

Bidirectional RNN (BRNN) 38 / 72 Forward RNN processes data forward left to right. Backward RNN processes data backward right to left. Output joins the output of forward and backward RNN.

Formal Definition: BRNN Input vector sequence: x t R D, t = 1,..., T Forward and backward hidden outputs: h t, h t, t = 1,..., T Forward and backward free parameters: Input to hidden weight: W, W R n l 1 n l Hidden to hidden weight: R, R R n l n l Bias: b, b R n l Output: Iterate the equation for t = 1,..., T h t = σ( W x t + R h t 1 + b) Output: Iterate the equation for t = T,..., 1 h t = σ( W x t + R h t+1 + b) 39 / 72

RNN using Memory Cells 40 / 72 Equip an RNN with a memory cell. Can store information for a long time. Introduce gating units to: activations going in, activations going out, saving activations, forgetting activations.

Long Short-Term Memory RNN 41 / 72

Formal Definition: LSTM 42 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : z t = σ(w z x t + R z h t 1 + b z ) i t = σ(w i x t + R i h t 1 + P i c t 1 + b i ) f t = σ(w f x t + R f h t 1 + P f c t 1 + b f ) c t = i t z t + f t c t 1 o t = σ(w o x t + R o h t 1 + P o c t + b i ) h t = o t tanh(c t ) (block input) (input gate) (forget gate) (cell state) (output gate) (block output)

LSTM: Too many connections? 43 / 72 Some of the connections in the LSTM are not necessary [1]. Peepholes do not seem to be necessary. Coupled input and forget gates. Simplified LSTM Gated Recurrent Unit (GRU).

Gated Recurrent Unit (GRU) 44 / 72 References: [2, 3, 4]

Formal Definition: GRU Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : r t = σ(w r x t + R r h t 1 + b r ) z t = σ(w z x t + R z h t 1 + b z ) h t = σ(w h x t + R h (r t h t 1 ) + b h ) h t = z t h t 1 + (1 z t ) h t (reset gate) (update gate) (candidate gate) (output gate) 45 / 72

CNN vs. DNN vs. RNN: Performance 46 / 72 GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. WER [%] Model CE ST GMM 18.8 n/a DNN 16.2 14.9 CNN 15.8 13.9 BGRU (fmllr) 14.9 n/a BLSTM (fmllr) 14.8 n/a BGRU (Log-Mel) 14.1 n/a

DNN vs. CNN vs. RNN: Performance 47 / 72 GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast Conversation 2000 h. WER [%] Model CE ST DNN 11.7 10.3 CNN 12.6 10.4 RNN 11.5 9.9 DNN+CNN 11.3 9.6 RNN+CNN 11.2 9.4 DNN+RNN+CNN 11.1 9.4

RNN Black Magic Unrolling the RNN in training: whole utterance [5], vs. truncated BPTT with carryover [6]: Split utterance into subsequences of e.g. 21 frames. Carry over last cell from previous subsequence to new subsequence. Compose minibatch from subsequences. vs. truncated BPTT with overlap: Split utterance in subsequences of e.g. 21 frames. Overlap subsequences by 10. Compose minibatch of subsequences from different utterances. Gradient clipping of the LSTM cell. 48 / 72

RNN Black Magic 49 / 72 Recognition: Unrolling RNN whole utterance, vs. unrolling subsequences Split utterance in subsequences of e.g. 21 frames. Carry over last cell from previous subsequence to new subsequence. vs. unrolling on spectral window [7] For each frame unroll on the spectral window Last RNN layer only returns center/last frame.

Highway Network 50 / 72 References: [2, 3, 4]

Formal Definition: Highway Network 51 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : z t = σ(w z x t + b z ) h t = σ(w h x t + b h ) h t = z t x t + (1 z t ) h t (highway gate) (candidate gate) (output gate)

Formal Definition: Highway GRU Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : r t = σ(w r x t + R r h t 1 + b r ) z t = σ(w z x t + R z h t 1 + b z ) d t = σ(w d x t + R d h t 1 + b d ) h t = σ(w h x t + R h (r t h t 1 ) + b h ) (reset gate) (update gate) (highway gate) (candidate gate) h t = d t x t + (1 d t ) (z t h t 1 + (1 z t ) h t ) (output gate) 52 / 72

Unstable Gradient Problem Happens in deep as well in recurrent neural networks. If gradient becomes very small vanishing gradient. If gradient becomes very large exploding gradient. Simplified Neural Network (w i are just scalars): F(w 1,..., w N ) = L(σ(y N ) = L(σ(w N σ(y N 1 ) = L(σ(w N σ(w N 1... σ(w 1 x t )...))) 54 / 72

Unstable Gradient Problem, Constraints and Regularization 55 / 72 Gradient: d F(w 1,..., w N ) d w 1 = d L d θ d σ(w N σ(w N 1... σ(w 1 x t )...)) d w 1 = d L d w 1 σ (y N ) w N σ (y N 1 ) w N 1... σ (w 1 ) x t If w i σ (y i ) < 1, i = 2,..., N gradient vanishes. If w i σ (y i ) >> 1, i = 2,..., N gradient explodes.

Solution: Unstable Gradient Problem 56 / 72 Gradient Clipping. Weight constraints. Let the network save activations over layers/time steps: y new = αy previous + (1 α)y common Long Short-Term Memory RNN Highway Neural Network (>100 layers)

Gradient Clipping 57 / 72 Keeps gradient weights in range. One approach to deal with the exploding gradient problem. Ensure gradient is in the range [ c, c] for a constant c: ( ) ( ( d F clip d θ, c = min c, max c, d F )) d θ

Constraints (I) 58 / 72 Keeps weights in range (for e.g. Relu, Maxout). Ignored for gradient backpropagation. Constraints are forced after gradient update.

Constraints (II) 59 / 72 Max-Norm: force W 2 c for constant c W max = W max(min( W 2, 0), c) W 2 Unity-Norm: force W 2 1 W unity = W W 2 Positivity-Norm: force W > 0 W + = W max(0, W )

Regularization: Dropout 60 / 72 Dropout: Prevents getting stuck in local optimum avoids overfitting.

Regularization: Dropout 61 / 72 Dropout: Prevents getting stuck in local optimum avoids overfitting.

Regularization: Dropout 62 / 72 Input vector sequence: x t R D Choose z t {0, 1} D for t = 1,..., T According to Bernoulli distribution P(z t,d = i) = p 1 i (1 p) i with dropout probability with p [0, 1]: Training: x t := x t z t 1 p for t = 1,..., T. Recognition: x t := x t for t = 1,..., T.

Regularization (II) 63 / 72 L p Norm: θ θ p = θ p j=0 1 p Training criterion regularization: F p (θ) = F(θ) + λ θ p with scalar λ Smoothes the training criterion. Pushes the free parameter weights closer to zero.

Attention-based End-to-End Architecture 65 / 72

Attention model 66 / 72

Formal Definition: Content Focus 67 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Scorer: ɛ m,t = tanh(v m,ɛ x t + b ɛ ) for t = 1,..., T, m = 1,..., M Generator: σ(w α ɛ m,t ) α m,t = T τ=1 σ(w for t = 1,..., T, m = 1,..., M α ɛ m,τ ) Glimpse: g m = Output: T α m,t x t t=1 h m = σ(w h g m + b h ) for m = 1,..., M for m = 1,..., M

Formal Definition: Recurrent Attention Scorer: ɛ m,t = tanh(w ɛ x t + R ɛ s m 1 + U ɛ (F ɛ α m 1 ) + b ɛ ) Generator: σ(w α ɛ m,t ) α m,t = T τ=1 σ(w for t = 1,..., T, m = 1,..., M ɛ ɛ m,τ ) Glimpse: g m = GRU state: T α m,t x t for m = 1,..., M t=1 s m = GRU(g m, h m, s m 1 ) for m = 1,..., M Output: h m = σ(w h g m + R h s m 1 + b h ) for m = 1,..., M 68 / 72

End-to-End Performance 69 / 72 Table: TIMIT WER [%] Model dev eval HMM 13.9 16.7 End-to-end 15.8 17.6 RNN Transducer n/a 17.7

70 / 72 K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, LSTM: A search space odyssey, CoRR, vol. abs/1503.04069, 2015. [Online]. Available: http://arxiv.org/abs/1503.04069 K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using rnn encoder decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1724 1734. [Online]. Available: http://www.aclweb.org/anthology/d14-1179 J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, CoRR, vol. abs/1412.3555, 2014. [Online]. Available: http://arxiv.org/abs/1412.3555

71 / 72 R. Józefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, in ICML, ser. JMLR Proceedings, vol. 37. JMLR.org, 2015, pp. 2342 2350. A. Graves, N. Jaitly, and A. Mohamed, Hybrid speech recognition with deep bidirectional LSTM, in ASRU. IEEE, 2013, pp. 273 278. H. Sak, A. W. Senior, and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in INTERSPEECH. ISCA, 2014, pp. 338 342. A.-R. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn, Deep bi-directional recurrent networks over spectral windows, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop. IEEE Institute of Electrical and Electronics Engineers, December

2015, pp. 78 83. [Online]. Available: http://research.microsoft.com/apps/pubs/default.aspx?id=259236 72 / 72