Deep Learning for Speech Recognition. Hung-yi Lee

Size: px

Start display at page:

Download "Deep Learning for Speech Recognition. Hung-yi Lee"

Gloria Atkinson
6 years ago
Views:

1 Deep Learning for Speech Recognition Hung-yi Lee

2 Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing

3 Conventional Speech Recognition

4 Machine Learning helps X Speech Recognition 天氣真好 W This is a structured learning problem. Evaluation function: F X, W = P W X Inference: W = arg max W = arg max W F X, W P X W P W P X = arg max W P W X = arg max W P X W P W P X W : Acoustic Model, P W : Language Model

5 Input Representation Audio is represented by a vector sequence X: X: x 1 x 2 x 3 39 dim MFCC

6 Input Representation - Splice To consider some temporal information MFCC Splice

7 Phoneme Phoneme: basic unit Each word corresponds to a sequence of phonemes. Lexicon what do you think Lexicon Different words can correspond to the same phonemes hh w aa t d uw y uw th ih ng k

8 State Each phoneme correspond to a sequence of states what do you think Phone: hh w aa t d uw y uw th ih ng k Tri-phone: t-d+uw d-uw+y uw-y+uw y-uw+th t-d+uw1 t-d+uw2 t-d+uw3 State: d-uw+y1 d-uw+y2 d-uw+y3

9 State Each state has a stationary distribution for acoustic features Gaussian Mixture Model (GMM) t-d+uw1 P(x t-d+uw1 ) d-uw+y3 P(x d-uw+y3 )

10 State Each state has a stationary distribution for acoustic features pointer P(x d-uw+y3 ) Tied-state pointer P(x t-d+uw1 ) Same Address

11 Acoustic Model W: W = arg max W what do you think? P X W P W P(X W) = P(X S) S: X: a b c d e s 1 s 2 s 3 s 4 s 5 Assume we also know the alignment s 1 s T. x 1 x 2 x 3 x 4 x 5 P X S, h = T t=1 transition P s t s t 1 P x t s t emission

12 Acoustic Model W: W = arg max W what do you think? P X W P W P(X W) = P(X S) S: a b c d e Actually, we don t know the alignment. X: s 1 s 2 s 3 s 4 s 5 x 1 x 2 x 3 x 4 x 5 T P X S max s 1 s T t=1 P s t s t 1 P x t s t (Viterbi algorithm)

13 How to use Deep Learning?

14 People imagine Acoustic features This can not be true! DNN can only take fixed length vectors as input. DNN Hello

15 What DNN can do is P(a x i ) Size of output layer = No. of states P(b x i ) P(c x i ) DNN DNN input: One acoustic feature DNN output: Probability of each state x i

16 Low rank approximation Output layer W N M W: M X N N is the size of the last hidden layer M is the size of output layer Number of states Input layer M can be large if the outputs are the states of tri-phone.

17 Low rank approximation N K N K V M W M U K < M,N Less parameters Output layer M Output layer U M W linear K N V N

18 How we use deep learning There are three ways to use DNN for acoustic modeling Way 1. Tandem Way 2. DNN-HMM hybrid Way 3. End-to-end Efforts for exploiting deep learning

19 How to use Deep Learning? Way 1: Tandem

20 Way 1: Tandem system P(a x i ) P(b x i ) P(c x i ) Size of output layer = No. of states DNN new feature x i Input of your original speech recognition system x i Last hidden layer or bottleneck layer are also possible.

21 How to use Deep Learning? Way 2: DNN-HMM hybrid

22 Way 2: DNN-HMM Hybrid W = arg max W P W X = arg max T P X W max s 1 s T t=1 W P X W P W P s t s t 1 P x t s t From DNN DNN P s t x t x t P x t s t = P x t, s t P s t = P s t x t P x t P s t Count from training data

23 Way 2: DNN-HMM Hybrid T P X W max s 1 s T t=1 P s t s t 1 P x t s t DNN P X W max P X W; θ T s 1 s T t=1 P s t s t 1 P s t x t P s t P s t x t DNN θ From original HMM Count from training data x t This assembled vehicle works.

24 Way 2: DNN-HMM Hybrid Sequential Training W = arg max W P X W; θ P W Given training data X 1, W 1, X 2, W 2, X r, W r, Find-tune the DNN parameters θ such that P X r W r ; θ P W r P X r W; θ P W increase decrease (W is any word sequence different from W r )

25 How to use Deep Learning? Way 3: End-to-end

Way 3: End-to-end - Character Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon (No OOV problem) A. Hannun, C. Case, J.

26 Way 3: End-to-end - Character Input: acoustic features (spectrograms) Output: characters (and space) + null (~) No phoneme and lexicon (No OOV problem) A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, A. Ng Deep Speech: Scaling up end-to-end speech recognition, arxiv: v2, 2014.

27 Way 3: End-to-end - Character HIS FRIEND S ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Graves, Alex, and Navdeep Jaitly. "Towards end-to-end speech recognition with recurrent neural networks." Proceedings of the 31st International Conference on Machine Learning (ICML-14)

28 Way 3: End-to-end Word? apple bog cat Output layer ~50k words in the lexicon DNN padding with zero! Size = 200 times of acoustic features Input layer Use other systems to get the word boundaries Ref: Bengio, Samy, and Georg Heigold. "Word embeddings for speech recognition., Interspeech

29 Why Deep Learning?

30 Deeper is better? Word error rate (WER) multiple layers Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech hidden layer Deeper is Better

31 Deeper is better? Word error rate (WER) multiple layers Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech hidden layer Deeper is Better For a fixed number of parameters, a deep model is clearly better than the shallow one.

32 What does DNN do? A. Mohamed, G. Hinton, and G. Penn, Understanding how Deep Belief Networks Perform Acoustic Modelling, in ICASSP, Speaker normalization is automatically done in DNN Input Acoustic Feature (MFCC) 1-st Hidden Layer

33 What does DNN do? A. Mohamed, G. Hinton, and G. Penn, Understanding how Deep Belief Networks Perform Acoustic Modelling, in ICASSP, Speaker normalization is automatically done in DNN Input Acoustic Feature (MFCC) 8-th Hidden Layer

34 What does DNN do? In ordinary acoustic models, all the states are modeled independently Not effective way to model human voice high front back The sound of vowel is only controlled by a few factors. low

What does DNN do? front high back Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech. 2014.

35 What does DNN do? front high back Vu, Ngoc Thang, Jochen Weiner, and Tanja Schultz. "Investigating the Learning Effect of Multilingual Bottle-Neck Features for ASR." Interspeech Output of hidden layer reduce to two dimensions /i/ /e/ /a/ low /o/ /u/ The lower layers detect the manner of articulation All the states share the results from the same set of detectors. Use parameters effectively

36 Speaker Adaptation

37 Speaker Adaptation Speaker adaptation: use different models to recognition the speech of different speakers Collect the audio data of each speaker A DNN model for each speaker Challenge: limited data for training Not enough data for directly training a DNN model Not enough data for just fine-tune a speaker independent DNN model

some constraints Only train the parameter of one layer Do

38 Categories of Methods Conservative training Transformation methods Speaker-aware Training Re-train the whole DNN with some constraints Only train the parameter of one layer Do not really change the DNN parameters Need less training data

39 Conservative Training Output layer output close Output layer parameter close Input layer initialization Input layer Audio data of Many speakers A little data from target speaker

40 Transformation methods Add an extra layer Output layer Output layer Fix all the other parameters W Input layer layer i+1 layer i W W a Input layer layer i+1 extra layer layer i A little data from target speaker

41 Transformation methods Add the extra layer between the input and first layer With splicing layer 1 extra layer layer 1 extra layers W a W a W a W a Larger W a More data Smaller W a less data

42 Transformation methods SVD bottleneck adaptation Output layer U M linear W a V K N K is usually small

43 Speaker-aware Training Can also be noiseaware, devise aware training, Lots of mismatched data Data of Speaker 1 Data of Speaker 2 Data of Speaker 3 Speaker information Fixed length low dimension vectors Text transcription is not needed for extracting the vectors.

44 Speaker-aware Training Training data: Speaker 1 train Output layer Speaker 2 Acoustic features are appended with speaker information features Testing data: test Acoustic feature Speaker Information All the speaker use the same DNN model Different speaker augmented by different features

45 Multi-task Learning

46 Multitask Learning The multi-layer structure makes DNN suitable for multitask learning Task A Task B Task A Task B Input feature Input feature for task A Input feature for task B

47 Multitask Learning - Multilingual states of French states of German states of Spanish states of Italian states of Mandarin acoustic features Human languages share some common characteristics.

48 Character Error Rate Multitask Learning - Multilingual Mandarin only With European Language Hours of training data for Mandarin Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." Acoustics, Speech and Signal Processing (ICASSP), 2013

49 Multitask Learning - Different units A B A= state B = phoneme A= state B = gender acoustic features A= state B = grapheme (character)

50 Deep Learning for Acoustic Modeling New acoustic features

51 MFCC spectrogram DFT Waveform Input of DNN MFCC DCT log filter bank

52 Filter-bank Output spectrogram DFT Waveform Input of DNN Kind of standard now log filter bank

53 Spectrogram spectrogram DFT Waveform common today Input of DNN 5% relative improvement over fitlerbank output Ref: ainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B., Learning filter banks within a deep neural network framework, In Automatic Speech Recognition and Understanding (ASRU), 2013

54 Spectrogram

55 Waveform? Input of DNN If success, no Signal & Systems Waveform People tried, but not better than spectrogram yet Ref: Tüske, Z., Golik, P., Schlüter, R., & Ney, H., Acoustic modeling with deep neural networks using raw time signal for LVCSR, In INTERPSEECH 2014 Still need to take Signal & Systems

56 Waveform?

57 Convolutional Neural Network (CNN)

58 Frequency CNN Speech can be treated as images Spectrogram Time

59 CNN CNN Probabilities of states CNN Replace DNN by CNN Image

60 CNN Max Maxout Max Max pooling a 1 b 1 a 2 b 2 Max

61 CNN Tóth, László. "Convolutional Deep Maxout Networks for Phone Recognition, Interspeech, 2014.

62 Applications in Acoustic Signal Processing

63 DNN for Speech Enhancement Clean Speech for mobile commination or speech recognition DNN Noisy Speech Demo for speech enhancement:

64 DNN for Voice Conversion Female DNN Male Demo for Voice Conversion:

65 Concluding Remarks

66 Concluding Remarks Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New acoustic features Convolutional Neural Network (CNN) Applications in Acoustic Signal Processing

67 Thank you for your attention!

68 More Researches related to Speech Find the lectures related to deep learning Spoken Content Retrieval lecture recordings Summary Speech Summarization Speech Recognition 你好 core techniques Computer Assisted Language Learning I would like to leave Taipei on November 2nd Information Extraction Hi Hello Dialogue

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave