Deep Learning for Automatic Speech Recognition Part II

Size: px

Start display at page:

Download "Deep Learning for Automatic Speech Recognition Part II"

Clyde Leonard
5 years ago
Views:

1 Deep Learning for Automatic Speech Recognition Part II Xiaodong Cui IBM T. J. Watson Research Center Yorktown Heights, NY Fall, 2018

2 Outline A brief revisit of sampling, pitch/formant and MFCC DNN-HMM (hybrid) acoustic modeling Advanced acoustic modeling Convolutional neural networks (CNNs) Recurrent neural networks (RNNs) and long short-term memory (LSTM) networks Convolutional long short-term memory (CLSTM) networks EECS 6894, Columbia University 2/43

3 Nyquist Sampling EECS 6894, Columbia University 3/43

4 Pitch and Formants EECS 6894, Columbia University 4/43

5 Mel-Frequency Filter Bank and MFCC DFT MEL Log DCT MFCC EECS 6894, Columbia University 5/43

Mel-Frequency Filter Bank and MFCC ( m = 2595 log 10 1 + f ) 100 multi-resolution

energy clustering typical order around 13 for speech recognition increased order

6 Mel-Frequency Filter Bank and MFCC ( m = 2595 log f ) 100 multi-resolution on frequency bins use DCT to replace IDFT for better dimension decorrelation and energy clustering typical order around 13 for speech recognition increased order for speaker recognition/verification (typically 19-22) EECS 6894, Columbia University 6/43

7 Input Speech Features for DNNs Commonly-used hand-crafted features MFCCs Mel-frequency filter bank (FBank) Speaker adapted features (FMLLR) Appended speaker embedding vectors (i-vectors) Let DNNs learn the features Power spectra Raw audio signals EECS 6894, Columbia University 7/43

8 Two Streams of DNN Acoustic Models Hybrid DNNs Commonly referred to as DNN-HMM or CD-DNN-HMM Use GMM-HMM alignments as labels Use dictionary and language model for decoding. End-to-end (E2E) DNNs (will be addressed in next lecture) Directly deal with sequence-to-sequence mapping problem with unequal consequence lengths Do not need alignments, dictionary and language model in principle. Two E2E architectures: Connectionist Temporal Classification (CTC) Encoder-Decoder Attention models EECS 6894, Columbia University 8/43

9 GMM-HMM: Forced Alignment SIL SIL-HH+AW HH-AW+AA AW-AA+R AA-R+Y R-Y+UW Y-UW+SIL SIL How are you Given the text label, how to find the best underlying state sequence? same as decoding except the label is known Viterbi algorithm (dynamic programming) Often referred to as Viterbi alignments in speech community Widely used in deep learning ASR to generate the target labels for the training data EECS 6894, Columbia University 9/43

10 Context-Dependent DNNs AA-B-1 AA-B-2 AA-B-3 AA-M-1 AA-M-2 AA-M-3 AA-M-4 AA-E-1 AA-E-2 SIL-E-7 SIL-E-8 SIL-E-9 P (o s) = p(s o)p(o) p(s) p(s o) p(s) EECS 6894, Columbia University 10/43

11 DNN-HMMs: An Typical Training Recipe Preparation 40-dim FBank features with ±4 adjacent frames (input dim = 40x9) use an existing model to generate alignments which are then converted to 1-hot targets from each frame create training and validation data sets estimate priors p(s) of CD states Training set DNN configuration (multiple hidden layers, softmax output layers and cross-entropy loss function) initialization optimization based on back-prop using SGD on the training set while monitoring loss on the validation set Test push input features from test utterances through the DNN acoustic model to get their posteriors convert posteriors to likelihoods Viterbi decoding on the decoding network measure the word error rate EECS 6894, Columbia University 11/43

12 Training A Hybrid DNN-HMM System Attila/Kaldi Theano/Torch/ Tensorflow Attila/Kaldi feature extraction CD-phone generation GMM-HMM CD-phones (classes) alignments (labels) features DNN-HMM posteriors HMM Viterbi Decoding EECS 6894, Columbia University 12/43

13 Convolutional Neural Networks (CNNs) Bio-inspired feed-forward neural networks* individual cortical neurons in visual cortex respond to stimuli in a restricted region of space known as the receptive field. receptive fields of different neurons partially overlap such that they tile the visual field. response of an individual neuron to stimuli within its receptive field can be approximated mathematically by a convolution operation. LeNet-5 Y. LeCun, L. Bottou, Y. Bengio, P. Haffner (1998). "Gradient-based learning applied to document recognition," Proc. of the IEEE. A 7-layer CNN that outperformed other techniques on a standard handwritten digit recognition task. Advantages: Local (sparse) connectivity Weight sharing translation-invariant *adapted from wikipedia EECS 6894, Columbia University 13/43

14 Convolutional Neural Networks (CNNs) Convolutional layers with nonlinearity z k ij = σ((w k x) ij + b k ) + + W (m, n) x(m, n) = x(u, v)w (u m, v n) u= v= W (m, n) 0 only for 0 m M, 0 n N Pooling layers subsampling typically max-pooling or average-pooling Fully connected layers EECS 6894, Columbia University 14/43

15 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 15/43

16 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 16/43

17 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 17/43

18 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 18/43

19 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 19/43

20 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 20/43

21 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 21/43

22 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 22/43

23 Convolutional Neural Networks (CNNs) EECS 6894, Columbia University 23/43

24 A Brief Summary of Some CNN Terminology feature map local receptive field (filter or kernel) padding stride convolution pooling max-pooling average-pooling pooling along different axes resulting different resolutions EECS 6894, Columbia University 24/43

25 Tensor Representation Layer l-1 Layer l W 1 W0 weights W pq mn connecting each unit of the p-th feature map at layer l 1 and the q-th feature map of layer l with the local receptive filter w(m, n). number of parameters in the weights that connect the two convolutional layers is (ignore biases here) W pq mn = M N P Q where M and N are the dimensions of the local receptive filters and P and Q are the numbers of feature maps in layer l 1 and layer l. relation of input and output dimensionality O = W K + 2P S where W is the input dimensionality, O the output dimensionality, K the filter size, P the padding and S the stride. + 1 EECS 6894, Columbia University 25/43

26 An Example of CNN Acoustic modeling L (3000) (1024) (1024) (1024) (8x1x512) (8x1) (8x1) (8x1) (8x1) (512) (4x3x512x512) (11x3) (11x3) (11x3) (11x3) (512) (4x3) (4x3) (4x3) (4x3) (3x1) (32x3) (32x3) (32x3) (32x3) (3x1) (3x1) (3x1) (9x9x3x512) (40x11) (40x11) (40x11) (9x9) (9x9) (9x9) logmel logmel logmel EECS 6894, Columbia University 26/43

27 AlexNet Notable Development In CNNs Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton, "ImageNet Classification with Deep Convolutional Networks," NIPS conv layers, 3 fully-connected layers, ReLU, max-pooling, dropout, data augmentation, 2-GPU implementation ZFNet Matthew D. Zeiler and Rob Fergus, "Visualizing and understanding convolutional networks." ECCV, similar architecture to AlexNet, reduced window size and stride step in first conv layer. VGGNet Karen Simonyan and Andrew Zisserman, "Very deep convolutional neural networks for large-scale image recognition," ICLR up to 19 conv layers small 3x3 filters with stride 1, padding, 2x2 max-pooling layers with stride 2 every 2,3 or 4 conv layers ResNet Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep Residual Learning for Image Recognition," CVPR layers with 3x3 filters introduced residual learning by a direct by-pass identity link DenseNet Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, "Densely connected convolutional networks," CVPR do not throw away anything you learn along the way EECS 6894, Columbia University 27/43

28 VGG Acoustic Modeling T. Sercu, C. Puhrsch, B. Kingsbury and Y. LeCun, "Very Deep Multilingual Convolutional Neural Networks for LVCSR," ICASSP EECS 6894, Columbia University 28/43

29 Recurrent Neural Networks (RNNs) Sequential data is ubiquitous speech, video, text, stock index Sequence modeling is crucial how to model the history? h t = g t (x t, x t 1, x t 2,, x 1 ) RNNs model sequences with shared parameters (functions) h t = f(h t 1, x t ; θ) h x f EECS 6894, Columbia University 29/43

30 Compact and Unfolded Representations of RNNs o ot 1 ot ot+1 V V V V h W ht 1 W ht W ht+1 W U W U U U x xt 1 xt xt+1 compact unfolded a t = b + Wh t 1 + Ux t h t = tanh(a t ) o t = c + Vh t ŷ t = softmax(o t ) EECS 6894, Columbia University 30/43

31 Some RNN Configurations one-to-many mapping image captioning. EECS 6894, Columbia University 31/43

32 Some RNN Configurations many-to-one mapping sentiment analysis. EECS 6894, Columbia University 32/43

33 Some RNN Configurations synchronized many-to-many mapping speech recognition, video labeling. EECS 6894, Columbia University 33/43

34 Some RNN Configurations asynchronized many-to-many mapping speech recognition, machine translation. EECS 6894, Columbia University 34/43

35 Back-Propagation Through Time (BPTT) L Forward Propagation: yt 1 Lt 1 Lt Lt+1 yt yt+1 a t = b + Wh t 1 + Ux t h t = tanh(a t ) y t 1 y t y t+1 ot 1 ot ot+1 o t = c + Vh t ŷ t = softmax K (o t ) W V ht 1 U xt 1 W V ht U xt W V ht+1 U xt+1 W K L t = y t,k log ŷ t,k k=1 T L = L t t=1 EECS 6894, Columbia University 35/43

36 Back-Propagation Through Time (BPTT) Parameters to be optimized: W, U, V, b and c L W L U L b L V L c = h i t t i W = t = t ht b = t = t ot c L h i t h i t L i U h i t L h t h i t L i V h i t L o t which involves gradients of two internal nodes o t and h t. L o t, L h t EECS 6894, Columbia University 36/43

37 Back-Propagation Through Time (BPTT) L o t = ŷ t o t L t ŷ t L L t = ŷ t y t evaluate the gradient of h t backward starting from the last time step T: L = o T L T L = V T (ŷ T y T ) h T h T o T L T L = o t L + h t+1 L h t h t o t h t h t+1 = V T (ŷ t y t ) + a t+1 h t+1 L h t a t+1 h t+1 ( ) = V T (ŷ t y t ) + W T diag 1 h 2 L t+1 h t+1 EECS 6894, Columbia University 37/43

38 Back-Propagation Through Time (BPTT) L W = t L U = t L b = t L V = t L c = t ( ) diag 1 h 2 t L h T t 1 h t ( ) diag 1 h 2 t L x T t h t ( ) diag 1 h 2 t L h t L o t h T t L o t EECS 6894, Columbia University 38/43

39 An LSTM Memory Cell xt ht 1 xt ht 1 it ot xt ct ht ht 1 ft xt ht 1 input gate: i t = σ(w ix x t + W ih h t 1 + b i ) forget gate: f t = σ(w fx x t + W fh h t 1 + b f ) cell candidate: g t = tanh(w cx x t + W ch h t 1 + b c ) cell: c t = f t c t 1 + i t g t output gate: o t = σ(w ox x t + W oh h t 1 + b o ) hidden state output: h t = o t tanh(c t ) EECS 6894, Columbia University 39/43

40 Unfolded LSTM Representation L Lt 1 Lt Lt+1 yt 1 yt yt+1 y t 1 y t y t+1 ht 1 ht ht+1 Ct 1 Ct Ct+1 ht 2 xt 1 ht 1 xt ht xt+1 EECS 6894, Columbia University 40/43

41 Why LSTMs Work? c t = c t c t 1 c k+1 = diag(f t f t 1 f k+1 ) c k c t 1 c t 2 c k Sepp Horchreiter and Jürgen Schmidhuber, "Long Short-Term Memory," Neural Computation 9, 1997 a recurrent self-connection with weight 1.0 constant error carousel (CEC) Felix A. Gers, Jürgen Schmidhuber and Fred Cummins. "Learning to Forget: Continual Prediction with LSTM." Neural Computation 12, 2000 A forget gate added to learn to reset EECS 6894, Columbia University 41/43

42 A Bi-directional Multi-layer LSTM for ASR EECS 6894, Columbia University 42/43

43 A CLSTM for ASR EECS 6894, Columbia University 43/43

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent