Speech and Language Processing

Size: px

Start display at page:

Download "Speech and Language Processing"

Debra Johnston
5 years ago
Views:

1 Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course Takahiro Shinoaki 08//6

2 Lecture Plan (Shinoaki s part) I gives the first 6 lectures about speech recognition. Through these lectures, the backbone of the latest speech recognition techniques is eplained.. 0/9 (remote) Speech recognition based on GMM, HMM, and N gram. 0/6 (remote) Maimum likelihood estimation and EM algorithm. /5 (@TAIST) Baesian network and Baesian inference 4. /5 (@TAIST) Variational inference and sampling 5. /6 (@TAIST) Neural network based acoustic and language models 6. /6 (@TAIST) Weighted finite state transducer (WFST) and speech decoding

3 Toda s Topic Answers for the previous eercises Neural network based acoustic and language models

4 Answers for the Previous Eercises 4

5 Eercise 4. When p() and = f() are given as follows, obtain distribution q() p( ) 0,, log 0 0, ep d d, ep d q( ) p( ) ep( ) d # of occurrence Histogram of # of occurrence Histogram of 5

6 Eercise 4. When p() and = f() are given as follows, obtain distribution q() 6 4,, 0, ep ) ( N p, 4 d d 4, 4 ep ) ( ) ( N d d p q # of occurrence Histogram of Histogram of # of occurrence

7 Eercise 4. Show that N( A B, ) = N( B A,), where N( m,v) is the Gaussian distribution with mean m and variance v 7, ep ep, A B A B B A B A N N

8 Neural network 8

9 Multi Laer Perceptron (MLP) Unit of MLP MLP consists of multiple laers of the units i m h i w i i b h: activation function w: weight b:bias Output laer Multiple laers Hidden laers n Input laer 9

10 Activation Functions Linear function Unit step function h h if 0 0 otherwise hinge function ma0 h, Sigmoid function h ep 0

11 Softma Function For N variables i, softma function is: h Properties of softma Positive Sum is one Eample i ep( i ) ep( j j 0 h N i h ) i. 0 i Epresses a probabilit distribution Z,, -,, h Z h h, h 0.05, , 0.595, Z, 6,8, hz h, h, h 0.987, 0.000, 0.080,

12 Eercise 5. Let h be a softma function having inputs,,, N. h i ep( i ) ep( j j ) Prove that N i h. 0 i N i h N i j ep( ) ep( ep( i i i j ) j ep( i j ) )

13 Forward Propagation Compute the output of MLP step b step from the input laer to the output laer E.g. softma laer E.g. sigmoid laer E.g. sigmoid laer Input vector

14 Parameters of Neural Network The weights and a bias of each unit need training before the network is used w b w w N N h wi i i hw b h: activation function w: weight vector w=(w,w,,w N,b) X: input vector =(,,, N,) The bias b can be regarded as one of the weights whose input takes a constant value.0 4

15 Principle of NN Training Training set Reference output vector Adjust parameters of MLP so as to minimie the error Output b MLP Input vector 5

16 Definitions of Errors Sum of square error Used when output laer uses linear functions X, W E W n n t n Cross entrop Used when the output laer is a softma E W t ln X, W n k nk nk n W X t n n n :Set of weights in MLP :Vector of a training sample (input) :Vector of a training sample (output) :Inde of training samples t :Reference output (Takes if unit k nk corresponds to correct output, 0 otherwise) k :Inde of output unit 6

17 Gradient Descent An iterative optimiation method f() f 0 t t N f t 0 Initial value :Learning rate (small positive value) 7

18 MLP Training b Gradient Descent Define an error measure E(W) for training samples e.g. EW X, W Initialie parameters W={w, w,, w M } Repeatedl update the parameter set using gradient descent n n t n w i t w t i E W w i w i w i t 8

19 Chain Rule of Differentiation 9 ) ( ) ( g f g f When,, are scalars: When,, are vectors:,,,,,, Jacobian matri The same rule holds using Jacobian matri

20 When There Are Branches 0 ) ( ) ( ), ( g g f g f g g f f g Variations: g ) ( C g ) ( (independent of )

21 Back Propagation(BP) ref Out put Input r 4 f 4, w4 f, w f, w f, w Err E, E.: E.: 4 r 4 soft ma w4 sigmoid w Err 4 obtain value of each node b forward propagation E f 4 f f f r w 4 Err f w Err f w w Obtain derivatives b backward propagation Err f 4 Err f 4 f4 f Err f Err f f f Err f f f Err w Err w Err w Err w Err f 4 4 f4 w4 Err f f w Err f f w Err f f w

22 Feed Forward Neural Network When the network structure is a DAG, it is called feed forward network The nodes are ordered in a line so that all connections have the same direction The forward/backward propagation can be efficientl applied 4

23 Eercise 5. When h() and () are given as follows, obtain h ep b a h b a h b a h a b a b a a a b a h h ep ep ep ep ep

24 Recurrent Neural Network (RNN) Neural network having a feedback Epected to be more powerful modeling performance than feed forward MLP, but the training is more difficult Output Output laer Hidden laers Input laer Dela Input 4

25 Unfolding of RNN to Time Ais Reference vector sequence D Unfold Through Time Input feature sequence Time 5

26 Training of RNN b BP Through Time (BPTT) Appl BP to the unfolded network Output (Regard the output sequence as an output) 4 Output sequence 4 h h h h 4 4 h 4 h h h Back propagation Input sequence 4 Input (Regard the input sequence as an input) 6

27 Long Short Term Memor (LSTM) A tpe of RNN addressing the gradient vanishing problem t c t tanh t Output gate σ tanh σ Tanh laer with affine transform Sigmoid laer with affine transform Dela Dela c t LSTM forget gate tanh Input gate σ c t σ t t c t t t Pointwise multiplication Sum 7

28 Convolutional Neural Network (CNN) A filter is shifted and applied at different positions Filter () Activation map () Pooling Filter () Activation map () Input A tpe of feed forward neural network with parameter sharing and connection constraint Filter () Activation map (N) Net convolution laer etc. Convolution Laer Pooling Laer 8

29 Deep Neural Network (Just a) Neural network with man hidden laers 5 < # of laers Training was difficult until recentl Improvements in training algorithms: Pre training, Dropout Improvements in computer hardware: GPGPU Year 0: Large performance gains have been reported for large vocabular speech recognition Deep Learning Fever! Cf: G. Hinton, A Practical Guide to Training Restricted Boltmann Machines 9

30 Neural network based acoustic model 0

31 Frame Level Vowel Recognition Using MLP Softma function pあ pい pうえ p pお Sigmoid function Sigmoid function Input: Speech feature vector (e.g. MFCC)

32 Eercise 5. Obtain recognition result (es or no). You ma use a calculator. P(es) P(no) Softma.5 Sigmoid

33 Combination of HMM and MLP p X s GMM X s p X s p s X s p MLP p s s X s 0 s s s s 4 s 0 s s s s 4 Softma laer GMM HMM MLP HMM

34 MLP HMM based Phone Recognier Start /a/ /i/ /N/ End Softma Sigmoid Sigmoid Input speech feature 4

35 Neural network based language model 5

36 Word Vector One of K representation of a word for a fied vocabular word ID of K Apple <,0,0,0,0,0,0> Banana <0,,0,0,0,0,0> Cherr <0,0,,0,0,0,0> Durian 4 <0,0,0,,0,0,0> Orange 5 <0,0,0,0,,0,0> Pineapple 6 <0,0,0,0,0,,0> Strawberr 7 <0,0,0,0,0,0,> 6

37 Word Prediction Using RNN (Distribution of) Word t <0.0,0.65, 0.4,0., 0.05, 0.0,0.0> D Word t <0, 0, 0,, 0, 0, 0> 7

38 RNN Language Model (Unfolded) P(<s>, Delicious, Big, Red, Apple, </s>) Delicious Big Red Apple </s> <s> 8

39 Dialogue Sstem Using SeqSeq Network Sampling from posterior Output M name is TS 800 </s> Encoder network What is our name <s> Input Decoder network 9

40 Evolution of Compute Hardware 00 Earth simulator 40.96TFLOPS 07 GeForce GTX 080Ti 0.609TFLOPS 699USD Picture is from wikipedia Picture is from Nvidia.com 40

Introduction to Neural Networks

Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction