Speech and Language Processing

Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course Takahiro Shinoaki 08//6

Lecture Plan (Shinoaki s part) I gives the first 6 lectures about speech recognition. Through these lectures, the backbone of the latest speech recognition techniques is eplained.. 0/9 (remote) Speech recognition based on GMM, HMM, and N gram. 0/6 (remote) Maimum likelihood estimation and EM algorithm. /5 (@TAIST) Baesian network and Baesian inference 4. /5 (@TAIST) Variational inference and sampling 5. /6 (@TAIST) Neural network based acoustic and language models 6. /6 (@TAIST) Weighted finite state transducer (WFST) and speech decoding

Toda s Topic Answers for the previous eercises Neural network based acoustic and language models

Answers for the Previous Eercises 4

Eercise 4. When p() and = f() are given as follows, obtain distribution q() p( ) 0,, log 0 0, ep d d, ep d q( ) p( ) ep( ) d # of occurrence Histogram of # of occurrence Histogram of 5

Eercise 4. When p() and = f() are given as follows, obtain distribution q() 6 4,, 0, ep ) ( N p, 4 d d 4, 4 ep ) ( ) ( N d d p q # of occurrence Histogram of Histogram of # of occurrence

Eercise 4. Show that N( A B, ) = N( B A,), where N( m,v) is the Gaussian distribution with mean m and variance v 7, ep ep, A B A B B A B A N N

Neural network 8

Multi Laer Perceptron (MLP) Unit of MLP MLP consists of multiple laers of the units i m h i w i i b h: activation function w: weight b:bias Output laer Multiple laers Hidden laers n Input laer 9

Activation Functions Linear function Unit step function h h if 0 0 otherwise hinge function ma0 h, Sigmoid function h ep 0

Softma Function For N variables i, softma function is: h Properties of softma Positive Sum is one Eample i ep( i ) ep( j j 0 h N i h ) i. 0 i Epresses a probabilit distribution Z,, -,, h Z h h, h 0.05, 0.7054, 0.595, Z, 6,8, hz h, h, h 0.987, 0.000, 0.080,

Eercise 5. Let h be a softma function having inputs,,, N. h i ep( i ) ep( j j ) Prove that N i h. 0 i N i h N i j ep( ) ep( ep( i i i j ) j ep( i j ) )

Forward Propagation Compute the output of MLP step b step from the input laer to the output laer E.g. softma laer E.g. sigmoid laer E.g. sigmoid laer Input vector

Parameters of Neural Network The weights and a bias of each unit need training before the network is used w b w w N N h wi i i hw b h: activation function w: weight vector w=(w,w,,w N,b) X: input vector =(,,, N,) The bias b can be regarded as one of the weights whose input takes a constant value.0 4

Principle of NN Training Training set Reference output vector Adjust parameters of MLP so as to minimie the error Output b MLP Input vector 5

Definitions of Errors Sum of square error Used when output laer uses linear functions X, W E W n n t n Cross entrop Used when the output laer is a softma E W t ln X, W n k nk nk n W X t n n n :Set of weights in MLP :Vector of a training sample (input) :Vector of a training sample (output) :Inde of training samples t :Reference output (Takes if unit k nk corresponds to correct output, 0 otherwise) k :Inde of output unit 6

Gradient Descent An iterative optimiation method f() f 0 t t N f t 0 Initial value :Learning rate (small positive value) 7

MLP Training b Gradient Descent Define an error measure E(W) for training samples e.g. EW X, W Initialie parameters W={w, w,, w M } Repeatedl update the parameter set using gradient descent n n t n w i t w t i E W w i w i w i t 8

Chain Rule of Differentiation 9 ) ( ) ( g f g f When,, are scalars: When,, are vectors:,,,,,, Jacobian matri The same rule holds using Jacobian matri

When There Are Branches 0 ) ( ) ( ), ( g g f g f g g f f g Variations: g ) ( C g ) ( (independent of )

Back Propagation(BP) ref Out put Input r 4 f 4, w4 f, w f, w f, w Err E, E.: E.: 4 r 4 soft ma w4 sigmoid w Err 4 obtain value of each node b forward propagation E f 4 f f f r w 4 Err f w Err f w w Obtain derivatives b backward propagation Err f 4 Err f 4 f4 f Err f Err f f f Err f f f Err w Err w Err w Err w Err f 4 4 f4 w4 Err f f w Err f f w Err f f w

Feed Forward Neural Network When the network structure is a DAG, it is called feed forward network The nodes are ordered in a line so that all connections have the same direction The forward/backward propagation can be efficientl applied 4

Eercise 5. When h() and () are given as follows, obtain h ep b a h b a h b a h a b a b a a a b a h h ep ep ep ep ep

Recurrent Neural Network (RNN) Neural network having a feedback Epected to be more powerful modeling performance than feed forward MLP, but the training is more difficult Output Output laer Hidden laers Input laer Dela Input 4

Unfolding of RNN to Time Ais Reference vector sequence D Unfold Through Time Input feature sequence Time 5

Training of RNN b BP Through Time (BPTT) Appl BP to the unfolded network Output (Regard the output sequence as an output) 4 Output sequence 4 h h h h 4 4 h 4 h h h Back propagation Input sequence 4 Input (Regard the input sequence as an input) 6

Long Short Term Memor (LSTM) A tpe of RNN addressing the gradient vanishing problem t c t tanh t Output gate σ tanh σ Tanh laer with affine transform Sigmoid laer with affine transform Dela Dela c t LSTM forget gate tanh Input gate σ c t σ t t c t t t Pointwise multiplication Sum 7

Convolutional Neural Network (CNN) A filter is shifted and applied at different positions Filter () Activation map () 4 5 5 Pooling 5 4 5 Filter () Activation map () Input A tpe of feed forward neural network with parameter sharing and connection constraint Filter () Activation map (N) Net convolution laer etc. Convolution Laer Pooling Laer 8

Deep Neural Network (Just a) Neural network with man hidden laers 5 < # of laers Training was difficult until recentl Improvements in training algorithms: Pre training, Dropout Improvements in computer hardware: GPGPU Year 0: Large performance gains have been reported for large vocabular speech recognition Deep Learning Fever! Cf: G. Hinton, A Practical Guide to Training Restricted Boltmann Machines http://www.cs.toronto.edu/~hinton/absps/guidetr.pdf 9

Neural network based acoustic model 0

Frame Level Vowel Recognition Using MLP Softma function 0. 0.4 0.5 0. 0.5 pあ pい pうえ p pお Sigmoid function Sigmoid function Input: Speech feature vector (e.g. MFCC)

Eercise 5. Obtain recognition result (es or no). You ma use a calculator. P(es) P(no) Softma.5 Sigmoid.5.5 4.0

Combination of HMM and MLP p X s GMM X s p X s p s X s p MLP p s s X s 0 s s s s 4 s 0 s s s s 4 Softma laer GMM HMM MLP HMM

MLP HMM based Phone Recognier Start /a/ /i/ /N/ End Softma Sigmoid Sigmoid Input speech feature 4

Neural network based language model 5

Word Vector One of K representation of a word for a fied vocabular word ID of K Apple <,0,0,0,0,0,0> Banana <0,,0,0,0,0,0> Cherr <0,0,,0,0,0,0> Durian 4 <0,0,0,,0,0,0> Orange 5 <0,0,0,0,,0,0> Pineapple 6 <0,0,0,0,0,,0> Strawberr 7 <0,0,0,0,0,0,> 6

Word Prediction Using RNN (Distribution of) Word t <0.0,0.65, 0.4,0., 0.05, 0.0,0.0> D Word t <0, 0, 0,, 0, 0, 0> 7

RNN Language Model (Unfolded) P(<s>, Delicious, Big, Red, Apple, </s>) Delicious Big Red Apple </s> <s> 8

Dialogue Sstem Using SeqSeq Network Sampling from posterior Output M name is TS 800 </s> Encoder network What is our name <s> Input Decoder network 9

Evolution of Compute Hardware 00 Earth simulator 40.96TFLOPS 07 GeForce GTX 080Ti 0.609TFLOPS 699USD Picture is from wikipedia Picture is from Nvidia.com 40