Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom

Size: px

Start display at page:

Download "Lecture 14. Advanced Neural Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom"

Herbert Atkins
6 years ago
Views:

1 Lecture 14 Advanced Neural Networks Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 th April 2016

2 Variants of Neural Network Architectures 2 / 72 Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), unidirectional, bidirectional, Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), Constraints and Regularization, Attention model,

3 Training 3 / 72 Observations and labels (x n, a n ) R D A for n = 1,..., N. Training criterion: F CE (θ) = 1 N F L (θ) = 1 N Optimization: n=1 N log P(a n x n, θ) n=1 N ω a T n 1 ω P(a T n Tn 1 x1, θ) L(ω, ω n) loss L θ = arg min {F(θ)} θ θ, θ : Free parameters of the model (NN, GMM). ω, ω : Word sequences.

4 Recap: Gaussian Mixture Model 4 / 72 Recap Gaussian Mixture Model: P(ω x T 1 ) = a T 1 ω T P(x t a t )P(a t a t 1 ) t=1 ω: word sequence x1 T := x 1,..., x T : feature sequence a1 T := a 1,..., a T : HMM state sequence Emission probability P(x a) N (µ a, Σ a ) Gaussian. Replace with a neural network hybrid model. Use neural network for feature extraction bottleneck features.

5 Hybrid Model 5 / 72 Gaussian Mixture Model: P(ω x T 1 ) = a T 1 ω T P(x t a t ) P(a }{{} t a t 1 ) }{{} t=1 emission transition Training: A neural network usually models P(x a). Recognition: Use as a hybrid model for speech recognition: P(a x) P(a) = P(x, a) P(x)P(a) = P(x a) P(x) P(x a) P(x a)/p(x) and P(x a) are proportional.

6 Hybrid Model and Bayes Decision Rule 6 / 72 { ˆω = arg max P(ω)P(x T 1 ω) } ω = arg max ω P(ω) T P(x t a t ) P(x t ) P(a t a t 1 ) a1 T ω t=1 T = arg max P(ω) P(x t a t )P(a t a t 1 ) t=1 ω T a1 T ω P(x t ) t=1 = arg max ω P(ω) a T 1 ω T P(x t a t )P(a t a t 1 ) t=1

7 Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 7 / 72

8 Recap: Deep Neural Network (DNN) First feed forward networks. Consists of input, multiple hidden and output layer. Each hidden and output layer consists of nodes. 8 / 72

9 Recap: Deep Neural Network (DNN) 9 / 72 Free parameters: weights W and bias b. Output of a layer is input to the next layer. Each node performs a linear followed by a non-linear activiation on the input. The output layer relates the output of the last hidden layer with the target states.

10 Neural Network Layer 10 / 72 Number of nodes: n l in layer l. Input from previous layer: y (l 1) R n l 1 Weight and bias : W (l) R n l 1 n l, b (l) R n l. Activation: y (l) = σ(w (l) y (l 1) + b }{{ (l) ) } } linear {{ } non-linear

11 Deep Neural Network (DNN) 11 / 72

12 Activation Function Zoo 12 / 72 Sigmoid: Hyperbolic tangent: σ sigmoid (y) = exp( y) σ tanh (y) = tanh(y) = 2σ sigmoid (2y) REctified Linear Unit (RELU): { y, y > 0 σ relu (y) = 0, y 0

13 Activation Function Zoo Parametric RELU (PRELU): { y, y > 0 σ prelu (y) = a y, y 0 Exponential Linear Unit (ELU): { y, y > 0 σ elu (y) = a (exp(y) 1), y 0 Maxout: σ maxout (y 1,..., y I ) = max i Softmax: σ softmax (y) = { W1 y (l 1) + b 1,..., W I y (l 1) + b I } ( exp(y1 ) Z (y),..., exp(y ) T I) with Z (y) = exp(y j ) Z (y) j 13 / 72

14 Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 14 / 72

15 Multilingual Bottleneck 15 / 72

16 Multilingual Bottleneck 16 / 72 Encoder-Decoder architecture: DNN with a bottleneck. Forces low-dimensional representation of speech across mutliple languages. Several languages are presented to the network randomly. Training: Labels from different languages. Recognition: Network is cut off after bottleneck.

17 Why are Multilingual Bottlenecks? 17 / 72 Train Multilingual Bottleneck features with lots of data. Future use: Bottleneck features on different tasks to train GMM system. No expensive DNN training, but WER gains similar to DNN.

18 Multilingual Bottleneck: Performance 18 / 72 WER [%] Model FR EN DE PL MFCC MLP BN targets MLP BN multi deep BN targets deep BN multi lang.dep. hidden layer

19 More Fancy Models 19 / 72 Convolutional Neural Networks. Recurrent Neural Networks: Long Short-Term Memory (LSTM) RNNs, Gated Recurrent Unit (GRU) RNNs. Unstable Gradient Problem.

20 Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 20 / 72

21 Convolutional Neural Networks (CNNs) 21 / 72 Convolution (remember signal analysis?): (x 1 x 2 )[k] = i x 1 [k i] x 2 [i]

22 Convolutional Neural Networks (CNNs) 22 / 72 Convolution (remember signal analysis?): (x 1 x 2 )[k] = i x 1 [k i] x 2 [i]

23 Convolutional Neural Networks (CNNs) 23 / 72

24 CNNs 24 / 72 Consists of multiple local maps with channels and kernels. Kernels are convolved across the input. Multidimensional input: 1D (frequency), 2D (time-frequency), 3D (time-frequency-?). Neurons are connected to a local receptive fields of input. Weights are shared across multiple receptive fields.

25 Formal Definition: Convolutional Neural Networks 25 / 72 Free parameters: Feature maps W n R C k bias b n R k for n = 1,..., N c = 1,..., C channels, k N kernel size Activation function: y n,i = σ(w n,i x i + b) C i+k = σ W n,c,i j x c,j + b f c=1 j=i k

26 Pooling 26 / 72 Max-Pooling: pool(y n,c,i ) = max j=i k,...,i+k {y n,c,j} Average-Pooling: average(y n,c,i ) = 1 2 k + 1 i+k j=i k y n,c,j

27 CNN vs. DNN: Performance GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. WER [%] Model CE ST GMM 18.8 n/a DNN CNN CNN+DNN Table: Broadcast conversation 2k h. WER [%] Model CE ST DNN CNN DNN+CNN / 72

28 VGG. The configurations of our very deep CNNs for LVCSR. In all but the classic convnet, convolutional layers have 3 3 kernels, size is omitted. The depth of the networks increases from left to right. The deepest configuration, WDX, has 10 convolutional a nnected layers. The leftmost column indicates the number of output feature maps in each layer. The optional X means there are nnected layers instead of three (output layer included). 28 / 72 # Fmaps Classic [16, 17, 18] VB(X) VC(X) VD(X) WD(X) 64 conv(3,64) conv(3,64) conv(3,64) conv(3,64) conv(64,64) conv(64,64) conv(64,64) conv(64,64) pool 1x3 pool 1x2 pool 1x2 pool 1x2 128 conv(64, 128) conv(64, 128) conv(64, 128) conv(64, 128) conv(128, 128) conv(128, 128) conv(128, 128) conv(128, 128) pool 2x2 pool 2x2 pool 1x2 pool 1x2 256 conv(128, 256) conv(128, 256) conv(128, 256) conv(256, 256) conv(256, 256) conv(256, 256) conv(256, 256) pool 1x2 pool 2x2 pool 2x2 512 conv9x9(3,512) conv(256, 512) conv(256, 512) pool 1x3 conv(512, 512) conv(512, 512) conv3x4(512,512) conv(512, 512) pool 2x2 pool 2x2 FC 2048 FC 2048 (FC 2048) FC output size Softmax

29 LVCSR. In all but the classic convnet, convolutional layers have 3 3 kernels, thus ses VGG from left to right. The deepest configuration, WDX, has 10 convolutional and 4 e number of output feature maps in each layer. The optional X means there are four ded). LTIES are adaptaain, where inated [16, NNs. The nvolutional Linear Unit connected put feature derstood to frequency) KUR Softmax TOK CEB KAZ TEL LIT Softmax Softmax FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC FC pool conv conv pool conv conv Softmax Softmax Shared Softmax padding of architecture size of the padding is Fig. 1. Multilingual VBX network with the last three layers untied. FC stands for Fully Connected layers. Context +/-5 29 / 72

30 Section 2.1, 2.2, and 2.1 respectively. From table 3 note that even in VGGthePerformance monolingual case (3 hours of data) the VBX CNN architecture outperforms both the classical CNN and the baseline DNN Switchboard 300 WER # params (M) #M frames Classic 512 [17] Classic 256 ReLU (A+S) VCX (6 conv) (A+S) VDX (8 conv) (A+S) WDX (10 conv) (A+S) VDX (8 conv) (S) WDX (10 conv) (S) Table 5. Results on Hub5 00 SWB after training on the 262-hour SWB-1 dataset. We obtain 14.5% relative improvement over our baseline adaptation of the classical CNN and 10.6% relative improvement over [17]. (A+S) means Adadelta + SGD finetuning. (S) means the model was trained from random initialization using SGD. The last column gives the number of frames til convergence. work a layers 2014 ers. W in the multiwith m of 2.5 of dat guage Hub5 11.8% tive) o provem cross- W ing w dropo This e Progra 30 / 72

31 Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 31 / 72

32 Recurrent Neural Networks (RNNs) 32 / 72 DNNs are deep in layers. RNNs are deep in time (in addition). Shared weights and biases across time steps.

33 Unfolded RNN 33 / 72

34 DNN vs. RNN 34 / 72

35 Formal Definition: RNN 35 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Free parameters: Input to hidden weight: W R n l 1 n l Hidden to hidden weight: R R n l n l Bias: b R n l Output: Iterate the equation for t = 1,..., T Compare with DNN: h t = σ(w x t + R h t 1 + b) h t = σ(w x t + b)

36 BackPropagation Through Time (BPTT) 36 / 72 Chain rule through time: d F(θ) d h t = t 1 τ=1 d F(θ) d h τ d h τ d h t

37 BackPropagation Through Time (BPTT) 37 / 72 Implementation: Unfold RNN over time through t = 1,..., T. Forward propagate RNN. Backpropagate error through unfolded network. Faster than other optimization methods (e.g. evolutionary search). Difficulty with local optima.

38 Bidirectional RNN (BRNN) 38 / 72 Forward RNN processes data forward left to right. Backward RNN processes data backward right to left. Output joins the output of forward and backward RNN.

39 Formal Definition: BRNN Input vector sequence: x t R D, t = 1,..., T Forward and backward hidden outputs: h t, h t, t = 1,..., T Forward and backward free parameters: Input to hidden weight: W, W R n l 1 n l Hidden to hidden weight: R, R R n l n l Bias: b, b R n l Output: Iterate the equation for t = 1,..., T h t = σ( W x t + R h t 1 + b) Output: Iterate the equation for t = T,..., 1 h t = σ( W x t + R h t+1 + b) 39 / 72

40 RNN using Memory Cells 40 / 72 Equip an RNN with a memory cell. Can store information for a long time. Introduce gating units to: activations going in, activations going out, saving activations, forgetting activations.

41 Long Short-Term Memory RNN 41 / 72

42 Formal Definition: LSTM 42 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : z t = σ(w z x t + R z h t 1 + b z ) i t = σ(w i x t + R i h t 1 + P i c t 1 + b i ) f t = σ(w f x t + R f h t 1 + P f c t 1 + b f ) c t = i t z t + f t c t 1 o t = σ(w o x t + R o h t 1 + P o c t + b i ) h t = o t tanh(c t ) (block input) (input gate) (forget gate) (cell state) (output gate) (block output)

43 LSTM: Too many connections? 43 / 72 Some of the connections in the LSTM are not necessary [1]. Peepholes do not seem to be necessary. Coupled input and forget gates. Simplified LSTM Gated Recurrent Unit (GRU).

44 Gated Recurrent Unit (GRU) 44 / 72 References: [2, 3, 4]

45 Formal Definition: GRU Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : r t = σ(w r x t + R r h t 1 + b r ) z t = σ(w z x t + R z h t 1 + b z ) h t = σ(w h x t + R h (r t h t 1 ) + b h ) h t = z t h t 1 + (1 z t ) h t (reset gate) (update gate) (candidate gate) (output gate) 45 / 72

46 CNN vs. DNN vs. RNN: Performance 46 / 72 GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast News 50 h. WER [%] Model CE ST GMM 18.8 n/a DNN CNN BGRU (fmllr) 14.9 n/a BLSTM (fmllr) 14.8 n/a BGRU (Log-Mel) 14.1 n/a

47 DNN vs. CNN vs. RNN: Performance 47 / 72 GMM, DNN use fmllr features. CNN use log-mel features which have local structure, opposed to speaker normalized features. Table: Broadcast Conversation 2000 h. WER [%] Model CE ST DNN CNN RNN DNN+CNN RNN+CNN DNN+RNN+CNN

48 RNN Black Magic Unrolling the RNN in training: whole utterance [5], vs. truncated BPTT with carryover [6]: Split utterance into subsequences of e.g. 21 frames. Carry over last cell from previous subsequence to new subsequence. Compose minibatch from subsequences. vs. truncated BPTT with overlap: Split utterance in subsequences of e.g. 21 frames. Overlap subsequences by 10. Compose minibatch of subsequences from different utterances. Gradient clipping of the LSTM cell. 48 / 72

49 RNN Black Magic 49 / 72 Recognition: Unrolling RNN whole utterance, vs. unrolling subsequences Split utterance in subsequences of e.g. 21 frames. Carry over last cell from previous subsequence to new subsequence. vs. unrolling on spectral window [7] For each frame unroll on the spectral window Last RNN layer only returns center/last frame.

50 Highway Network 50 / 72 References: [2, 3, 4]

51 Formal Definition: Highway Network 51 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : z t = σ(w z x t + b z ) h t = σ(w h x t + b h ) h t = z t x t + (1 z t ) h t (highway gate) (candidate gate) (output gate)

52 Formal Definition: Highway GRU Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Iterate the equation for t = 1,..., T : r t = σ(w r x t + R r h t 1 + b r ) z t = σ(w z x t + R z h t 1 + b z ) d t = σ(w d x t + R d h t 1 + b d ) h t = σ(w h x t + R h (r t h t 1 ) + b h ) (reset gate) (update gate) (highway gate) (candidate gate) h t = d t x t + (1 d t ) (z t h t 1 + (1 z t ) h t ) (output gate) 52 / 72

53 Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 53 / 72

54 Unstable Gradient Problem Happens in deep as well in recurrent neural networks. If gradient becomes very small vanishing gradient. If gradient becomes very large exploding gradient. Simplified Neural Network (w i are just scalars): F(w 1,..., w N ) = L(σ(y N ) = L(σ(w N σ(y N 1 ) = L(σ(w N σ(w N 1... σ(w 1 x t )...))) 54 / 72

55 Unstable Gradient Problem, Constraints and Regularization 55 / 72 Gradient: d F(w 1,..., w N ) d w 1 = d L d θ d σ(w N σ(w N 1... σ(w 1 x t )...)) d w 1 = d L d w 1 σ (y N ) w N σ (y N 1 ) w N 1... σ (w 1 ) x t If w i σ (y i ) < 1, i = 2,..., N gradient vanishes. If w i σ (y i ) >> 1, i = 2,..., N gradient explodes.

56 Solution: Unstable Gradient Problem 56 / 72 Gradient Clipping. Weight constraints. Let the network save activations over layers/time steps: y new = αy previous + (1 α)y common Long Short-Term Memory RNN Highway Neural Network (>100 layers)

57 Gradient Clipping 57 / 72 Keeps gradient weights in range. One approach to deal with the exploding gradient problem. Ensure gradient is in the range [ c, c] for a constant c: ( ) ( ( d F clip d θ, c = min c, max c, d F )) d θ

58 Constraints (I) 58 / 72 Keeps weights in range (for e.g. Relu, Maxout). Ignored for gradient backpropagation. Constraints are forced after gradient update.

59 Constraints (II) 59 / 72 Max-Norm: force W 2 c for constant c W max = W max(min( W 2, 0), c) W 2 Unity-Norm: force W 2 1 W unity = W W 2 Positivity-Norm: force W > 0 W + = W max(0, W )

60 Regularization: Dropout 60 / 72 Dropout: Prevents getting stuck in local optimum avoids overfitting.

61 Regularization: Dropout 61 / 72 Dropout: Prevents getting stuck in local optimum avoids overfitting.

62 Regularization: Dropout 62 / 72 Input vector sequence: x t R D Choose z t {0, 1} D for t = 1,..., T According to Bernoulli distribution P(z t,d = i) = p 1 i (1 p) i with dropout probability with p [0, 1]: Training: x t := x t z t 1 p for t = 1,..., T. Recognition: x t := x t for t = 1,..., T.

63 Regularization (II) 63 / 72 L p Norm: θ θ p = θ p j=0 1 p Training criterion regularization: F p (θ) = F(θ) + λ θ p with scalar λ Smoothes the training criterion. Pushes the free parameter weights closer to zero.

64 Where Are We? 1 Recap: Deep Neural Network 2 Multilingual Bottleneck Features 3 Convolutional Neural Networks 4 Recurrent Neural Networks 5 Unstable Gradient Problem 6 Attention-based End-to-End ASR 64 / 72

65 Attention-based End-to-End Architecture 65 / 72

66 Attention model 66 / 72

67 Formal Definition: Content Focus 67 / 72 Input vector sequence: x t R D, t = 1,..., T Hidden outputs: h t, t = 1,..., T Scorer: ɛ m,t = tanh(v m,ɛ x t + b ɛ ) for t = 1,..., T, m = 1,..., M Generator: σ(w α ɛ m,t ) α m,t = T τ=1 σ(w for t = 1,..., T, m = 1,..., M α ɛ m,τ ) Glimpse: g m = Output: T α m,t x t t=1 h m = σ(w h g m + b h ) for m = 1,..., M for m = 1,..., M

68 Formal Definition: Recurrent Attention Scorer: ɛ m,t = tanh(w ɛ x t + R ɛ s m 1 + U ɛ (F ɛ α m 1 ) + b ɛ ) Generator: σ(w α ɛ m,t ) α m,t = T τ=1 σ(w for t = 1,..., T, m = 1,..., M ɛ ɛ m,τ ) Glimpse: g m = GRU state: T α m,t x t for m = 1,..., M t=1 s m = GRU(g m, h m, s m 1 ) for m = 1,..., M Output: h m = σ(w h g m + R h s m 1 + b h ) for m = 1,..., M 68 / 72

69 End-to-End Performance 69 / 72 Table: TIMIT WER [%] Model dev eval HMM End-to-end RNN Transducer n/a 17.7

70 70 / 72 K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, LSTM: A search space odyssey, CoRR, vol. abs/ , [Online]. Available: K. Cho, B. Van Merriënboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using rnn encoder decoder for statistical machine translation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp [Online]. Available: J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, CoRR, vol. abs/ , [Online]. Available:

71 71 / 72 R. Józefowicz, W. Zaremba, and I. Sutskever, An empirical exploration of recurrent network architectures, in ICML, ser. JMLR Proceedings, vol. 37. JMLR.org, 2015, pp A. Graves, N. Jaitly, and A. Mohamed, Hybrid speech recognition with deep bidirectional LSTM, in ASRU. IEEE, 2013, pp H. Sak, A. W. Senior, and F. Beaufays, Long short-term memory recurrent neural network architectures for large scale acoustic modeling, in INTERSPEECH. ISCA, 2014, pp A.-R. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, and G. Penn, Deep bi-directional recurrent networks over spectral windows, in Proc. IEEE Automatic Speech Recognition and Understanding Workshop. IEEE Institute of Electrical and Electronics Engineers, December

72 2015, pp [Online]. Available: 72 / 72

Introduction to RNNs!

Introduction to RNNs! Introduction to RNNs Arun Mallya Best viewed with Computer Modern fonts installed Outline Why Recurrent Neural Networks (RNNs)? The Vanilla RNN unit The RNN forward pass Backpropagation refresher The RNN