Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC

Size: px

Start display at page:

Download "Speech recognition. Lecture 14: Neural Networks. Andrew Senior December 12, Google NYC"

Alison Willis
6 years ago
Views:

1 Andrew Senior 1 Speech recognition Lecture 14: Neural Networks Andrew Senior Google NYC December 12, 2013

2 Andrew Senior 2 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling

3 Andrew Senior 3 The perceptron Input x 1 Input x 2 Input x 3 Input x 4 Input x w 4 5 Output A perceptron is a linear classifier: f (x) = 1 if w.x > 0 (1) = 0 otherwise. (2) Add an extra always one input to provide an offset or bias. The weights w can be learned for a given task with the Perceptron Algorithm.

4 Andrew Senior 4 Perceptron algorithm (Rosenblatt, 1957) Adapt the weights w, example-by example: 1 Initialise the weights and the threshold. 2 For each example j in our training set D, perform the following steps over the input x j and desired output ŷ j : 3 1 Calculate the actual output: y j (t) = f [w(t) x j ] = f [w 0 (t) + w 1 (t)x j,1 + w 2 (t)x j,2 + + w n (t)x j,n ] 2 Update the weights: w i (t + 1) = w i (t) + α(ŷ j y j (t))x j,i, for all nodes 0 i n. 4 Repeat Step 2 until the iteration error 1 s s j [ŷ j y j (t)] is less than a user-specified error threshold γ, or a predetermined number of iterations have been completed.

5 Andrew Senior 5 Nonlinear perceptrons Introduce a nonlinearity: y i = σ( j w ij x j ) Each unit is a simple nonlinear function of a linear combination of its inputs Typically logistic sigmoid: or tanh: σ(z) = e z σ(z) = tanh z

6 Andrew Senior 6 Multilayer perceptrons Extend the network to multiple layers Now a hidden layer of nodes computes a function of the inputs, and output nodes compute a function of the hidden nodes activations. Input layer Hidden layer Output layer Input x 1 Input x 2 Input x 3 y 1 y 2 y 3 Input x 4

7 Andrew Senior 7 Cost function Such networks can be optimized ( trained ) to minimize a cost function (Loss function or objective function) that is a numerical score of the network s performance with respect to targets ŷ i (t). Squared Error L SE = 1 (y i (t) ŷ i (t)) 2 2 t i This is a frame-based criterion, where t would ideally be across the entire space of decoding frames, but in practice is across the training set, and we measure it across a development set. Cross Entropy L CE = t ŷ i (t) log y i (t) i

8 Andrew Senior 8 Targets We need targets / labels ŷ i (t), for each frame usually provided by forced-alignment. (Lecture 8) Viterbi alignment gives one target class for each frame t. Baum-Welch soft-alignments gives a target distribution across ŷ i (t) for each t

9 Andrew Senior 9 Softmax output layer If the output units are logistic, then they are suitable for representing Multivariate Bernouilli random variables P(ŷ i = 1 x) To model a multi-class categorical distribution then we use the Softmax (?) y i = P(c i x) = exp(z i) j exp(z j) which is normalized to sum to one. This reduces to the logistic sigmoid when there are two output classes

10 Andrew Senior 10 Gradient descent To minimize the loss L, compute a gradient L w update it using simple gradient descent: for each parameter w and w = w η L w η is a learning rate which is chosen (typically by cross-validation) but may be set automatically. We can apply the chain rule to compute L w for parameters deep in the network.

11 Andrew Senior 11 Back-propagation 0 Derivatives of the loss functions: L CE = ŷ j (t) log y j (t) (3) y i y i j = ŷi(t) y i (t) (4) L SE = 1 (y j (t) ŷ j (t)) 2 y i y i 2 (5) i = y j (t) ŷ j (t) (6)

12 Andrew Senior 12 Back-propagation 0 Derivative of Logistic activation function: Because y i z i = = 1 z i (7) 1 + e z i e z i (1 + e z (8) i ) 2 = y i (1 y i ) (9) (10) 1 y = (1 + e z ) 1 (1 + e z ) (11) (12)

13 Andrew Senior 13 Back-propagation 0 Derivative of Softmax activation function: y k = e zk z i z i j ez j (13) = δ ik( j ez j )e z k e z k e z i ( j ez j ) 2 (14) = ez i j ez j ( j ez j )δ ik e z k j ez j (15) = y i (δ ik y k ) (16)

14 Back-propagation I For a weight in the final layer, by the chain rule for one example: L = L y k z i (17) w ij y k z i w ij k For Softmax & L CE L CE = ŷk y k y k [Outer gradient.] (18) y k = y i (δ ik y k ) [Derivative of softmax activation function.] z i (19) z i = x j w ij (20) L = ŷ k y k (δ ik y i )x j (21) w ij y k k = x j ŷ k (δ ik y i ) (22) k = x j (ŷ i y i ) (23) Andrew Senior <andrewsenior@google.com> 14

15 Andrew Senior 15 Back-propagation II Back-propagating (Rumelhart et al., 1986) to an earlier hidden layer with weights w jk, activations x j and inputs x k : x j = σ(z j ) = σ( k w jk x k ) (24) First find the gradient w.r.t. the hidden layer activation x j : L x j = i L y i y i z i z i x j (25) z i x j = w ij (26) i.e. we pass the vector of gradients L y i back through the nonlinearity and then project back with the layer s output back through the weight matrix W.

16 Andrew Senior 16 Back-propagation III L w jk x j z j = L x j x j z j z j w jk [Same form as eqn. 17.] (27) = x j (1 x j ) [Derivative of sigmoid activation function.] (28) z j w jk = x k (29) Continue to arbitrary depth: compute activations gradients and then weight gradients for each layer.

17 Andrew Senior 17 Stochastic Gradient Descent Since L is typically defined on the entire training-set, it takes a long time to compute it and its derivatives (summed across all exemplars), and it s only an approximation to the true loss on the theoretical set of all utterances. We can compute a noisy estimate of L w on a small subset of the training set, and make a Stochastic Gradient Descent (SGD) update very quickly. In the limit, we could update on every frame, but a useful compromise is to use a minibatch of around 200 frames.

second-order approximation to the error- surface. More computation per step.

18 Andrew Senior 18 Second-order optimization Compute the second derivative and optimize a second-order approximation to the error- surface. More computation per step. Requires less-noisy estimates of gradient / curvature (bigger batches). Each step is more effective. Variants: Newton-Raphson Quickprop LBFGS Hessian-free Conjugate gradient

19 Andrew Senior 19 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling

20 Andrew Senior 20 Two main paradigms for neural networks for speech Use neural networks to compute nonlinear feature representations. Bottleneck or tandem features (Hermansky et al., 2000) Low-dimensional representation is modelled conventionally with GMMs. Allows all the GMM machinery and tricks to be exploited. Use neural networks to estimate CD state probabilities.

21 Andrew Senior 21 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling

22 Andrew Senior 22 Neural network features Train a neural network to discriminate classes. Use output or a low-dimensional bottleneck layer representation as features. x 1 x 2 x 3 x 4 Input layer Hidden layers Bottleneck layer Output layer y 1 y 2 y 3 y 4 y 5

23 Andrew Senior 23 Neural network features TRAP: Concatenate PLP-HLDA features and NN features. Bottleneck outperforms posterior features (Grezl et al., 2007) Generally DNN features + GMMs reach about the same performance as hybrid DNN-GMM systems, but are much more complex.

24 Andrew Senior 24 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling

25 Andrew Senior 25 Hybrid networks: Decoding (recap) Recall (Lecture 1) that we choose the decoder output as the optimal word sequence ŵ for an observation sequence o: and ŵ = arg max Pr[w o] (30) w Σ = arg max Pr[o w]pr[w] (31) w Σ Pr(o w) = d,c,p Pr(o c)pr(c p)pr(p w) (32) Where p is the phone sequence and c is the CD state sequence.

26 Empirically (by cross validation) we actually find better results with a prior smoothing term α 0.8. Andrew Senior <andrewsenior@google.com> 26 Hybrid Neural network decoding Now we model P(o c) with a Neural network instead of a Gaussian Mixture model. Everything else stays the same. P(o c) = t P(o t c t ) (33) P(o t c t ) = P(c t o t )P(o t ) P(c t ) P(c t o t ) P(c t ) For observations o t at time t and a CD state sequence c t. We can ignore P(o t ) since it is the same for all decoding paths. The last term is called the scaled posterior : (34) (35) log P(o t c t ) = log P(c t o t ) α log P(c t ) (36)

27 Andrew Senior 27 Input features Neural networks can handle high-dimensional features with correlated features. Use (26) stacked filterbank inputs. (40-dimensional mel-spaced filterbanks) Example filters learned in the first layer:

28 Andrew Senior 28 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling

29 Andrew Senior 29 Rough History Multi-layer perceptron 1986 Speech recognition with neural networks Superseded by GMMs Neural network features 2002 Deep networks 2006 (Hinton, 2002) Deep networks for speech recognition Good results on TIMIT (Mohamed et al., 2009) Results on large vocabulary systems 2010 (Dahl et al., 2011) Google launches DNN ASR product 2011 Dominant paradigm for ASR 2012 (Hinton et al., 2012)

30 Andrew Senior 30 What is new? Fast GPU-based training (distributed CPU-based training is even faster) Pretraining (turns out not to be important) Deeper networks - enabled by faster training Large datasets Machine learning understanding

31 Andrew Senior 31 State of the art Google s current speech production systems 26 frames of 40-dimensional filterbank inputs 8 hidden layers of 2560 hidden units. Rectified Linear nonlinearity (Zeiler et al., 2013) 14,000 outputs 85 million parameters, trained on 2,000 hours of speech data. Running quantized with 8 bit integer weights. On Android phones we run a smaller model with 2.7M parameters.

32 Andrew Senior 32 Outline 1 Introduction to Neural networks 2 Neural networks for speech recognition Neural network features for speech recognition Hybrid neural networks History Variations 3 Language modelling

33 Andrew Senior 33 Sequence training for neural networks Neural networks are trained with a frame-level discriminative criterion (cross-entropy L CE ) Far from the minimum WER criterion we care about. GMM-HMMs trained with sequence-level discriminative training (MMI, bmmi (Povey et al., 2008), MPE, MBR etc.) outperform Maximum-Likelihood models. Kingsbury (2009) shows how to compute a gradient for back-propagation from the numerator and denominator statistics for truth / alternative hypothesis lattices. Given this outer gradient we use back-propagation to compute parameter updates for the neural network.

34 Andrew Senior 34 Pretraining If we have a small amount of supervised data, we can use unlabelled data to get the parameters into reasonable places to model the distribution of the inputs, without knowing the labels. Pretraining is done layer-by layer so is faster than supervised training. There are several methods Contrastive divergence RBM training; Autoencoder; Greedy-layerwise [actually supervised] but none seems necessary for large speech corpora.

35 Andrew Senior 35 Alternative nonlinearities 1 Sigmoid σ(z) = 1 + e z (37) Tanh σ(z) = tanh(z) (38) ReLU σ(z) = max(z, 0) (39) z Softsign σ(z) = 1+ z (40) Softplus σ(z) = log(1 + e z ) (41)

36 Andrew Senior 36 Alternative nonlinearities Note: ReLU gives sparse activations. ReLU Gradient is zero x < 0, one x > 0, so propagated gradients don t attenuate as much as in other nonlinearities. ReLU & softsign are unbounded. Gradients asymptote differently for other nonlinearities.

Andrew Senior <andrewsenior@google.com> 37 Neural network variants Many variations Convolutional neural networks (Abdel-Hamid et al.

37 Andrew Senior 37 Neural network variants Many variations Convolutional neural networks (Abdel-Hamid et al., 2012) Convolve a filter with the input weight sharing saves parameters and gives invariance to frequency shifts. Recurrent neural networks Take one frame at a time but store a history of the previous frames, so could theoretically model long-term context. Long-Short Term Memory (Graves et al., 2013) A successful specialization of the recurrent neural network. With complex memory cells.

38 Andrew Senior 38 Recurrent neural networks A recurrent neural network has additional output nodes which are copied back to its inputs with a time delay. (Robinson et al., 1993) Training is with Back-Propagation Through Time. x 1 x 2 x 3 x 4 y 1 y 2 y 3 y 4 y 5 r 1 r 2 r 3 r 4 r 5 r 6

39 Andrew Senior 39 Neural network language modelling Model P(w n w n 1, w n 2, w n 3...) with a neural network instead of with an n-gram (pure frequency counts with back-off). Simply train a softmax for each w n, and use an input representation of w n 1, w n 2, w n 3,.... Even more effectively, train a recurrent neural network. (Mikolov et al., 2010) Leads to word-embeddings - a linear projection of sparse word identities (O(millions)) into a lower-dimensional (O(hundreds)) dense vector space. Easy to add other features (class, part-of-speech) Best performance when combined with an n-gram. Hard to do real-time decoding, though much of the performance can be retained when knowledge is extracted and stored in a WFST. (Arisoy et al., 2013)

40 Andrew Senior 40 Bibliography I Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., and Penn, G. (2012). Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In ICASSP, pages IEEE. Arisoy, E., Chen, S. F., Ramabhadran, B., and Sethy, A. (2013). Converting neural network language models into back-off language models for efficient decoding in automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages IEEE. Dahl, G., Yu, D., Li, D., and Acero, A. (2011). Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In ICASSP. Graves, A., Jaitly, N., and Mohamed, A. (2013). Hybrid speech recognition with deep bidirectional LSTM. In ASRU. Grezl, Karafiat, and Cernocky (2007). Neural network topologies and bottleneck features. Speech Recognition. Hermansky, H., Ellis, D., and Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In ICASSP. Hinton, G., Deng, L., Yu, D., Dahl, G., A., M., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29: Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation. Kingsbury, B. (2009). Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In ICASSP, pages Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., and Khudanpur, S. (2010). Recurrent neural network based language model. In Interspeech. Mohamed, A., Dahl, G., and Hinton, G. (2009). Deep belief networks for phone recognition. In NIPS. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., and Visweswariah, K. (2008). Boosted MMI for model and feature-space discriminative training. In Proc. ICASSP. Robinson, A. J., Almeida, L., m. Boite, J., Bourlard, H., Fallside, F., Hochberg, M., Kershaw, D., Kohn, P., Konig, Y., Morgan, N., Neto, J. P., Renals, S., Saerens, M., and Wooters, C. (1993). A neural network based, speaker independent, large vocabulary, continuous speech recognition system: The Wernicke project. In PROC. EUROSPEECH 93, pages Rosenblatt, F. (1957). The perceptron a perceiving and recognizing automaton. Technical Report , Cornell Aeronautical Laboratory. Rumelhart, D. E., Hinton, G., and Williams, R. (1986). Learning representations by back-propagating errors. Nature, 323(6088): Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., and Hinton, G. (2013). On rectified linear units for speech processing. In ICASSP.

Deep Neural Networks

Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi