An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

Size: px
Start display at page:

Download "An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis"

Transcription

1 ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan contact: we welcome critical comments, suggestions, and discussion

2 ABBREVIATION GMM NN RNN MDN RMDN Gaussian mixture model Neural network Recurrent neural network Mixture density network Recurrent mixture density network AR AR-RMDN Autoregressive model Autoregressive RMDN (proposed model)

3 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments l Conclusion 3

4 INTRODUCTION Text-to-Speech l Based on parametric speech synthesis vocoder acoustic model text analyzer text this work: on acoustic models based on neural networks 4

5 INTRODUCTION Acoustic models based on neural networks l RNN [] bo bo bo 3 bo 4 bo 5 generated acoustic features x x x 3 x 4 x 5 input textual features output of RNN: (spectral features, F0...) bo t time (frame) axis [] Fan, Y., Qian, Y., Xie, F.-L., & Soong, F. K. (04). TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. INTERSPEECH (pp ). 7/5/7 5

6 INTRODUCTION Acoustic models based on neural networks l RMDN [] o o o 3 o 4 o 5 distribution p(o t ; M t ) M M M 3 M 4 M 5 generated parameter set x x x 3 x 4 x 5 input textual features output of RMDN: for M t p(o t ; M t ) [] Schuster, M. (999). Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks. In Proc. NIPS (pp ). 7/5/7 6

7 INTRODUCTION Acoustic models based on neural networks l RMDN [3,4] o o o 3 o 4 o 5 GMM M M M 3 M 4 M 5 RMDN using GMM M t = {! t,,! M t, µ t,, µ M t, t,, M t } p(o :T ; M :T )= x x x 3 x 4 x 5 TY p(o t ; M t )= t= t= m= generate from GMM using MLPG [5] bo t [3] Bishop, C. M. (004). Mixture Density Networks. Retrieved from [4] Schuster, M. (999). Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks. In Proc. NIPS (pp ). [5] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. TY MX! m t N (o t ; µ m t, m t ). 7/5/7 7

8 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion 8

9 MOTIVATION Conventional RMDN o o o 3 o 4 o 5 distribution p(o t ; M t ) M M M 3 M 4 M 5 l Independence assumption: p(o :T ; M :T )= TY p(o t ; M t ) t= l Alternative? x x x 3 x 4 x 5 the idea of Autoregressive HMM (for speech synthesis [6] )... [6] Shannon, M., Zen, H., & Byrne, W. (03). Autoregressive models for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, (3), /5/7 9

10 DEFINITION Proposed model: AR + RMDN o o o 3 o 4 o 5 M M M 3 M 4 M 5 RMDN x x x 3 x 4 x 5 l AR assumption: p(o :T ; M :T )= TY p(o t o t K:t ; M t ) t= 7/5/7 0

11 DEFINITION AR-RMDN using GMM a a a o o o 3 o 4 o 5 a a a a baseline RMDN: AR-RMDN: p(o :T ; M :T )= p(o :T ; M :T )= TY MX! t m N (o t ; µ m t, m t ) t= m= TY MX! t m N (o t ; µ m t + f(o t K:t ), m t ) t= m= f(o t K:t )= KX a k o t k + b, k= o {a,, a K, b}. time invariant (context-independent). joint training with RMDN using back-propagation 7/5/7

12 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion

13 BEFORE INTERPRETATION A glimpse of the results Generated trajectories of the 30 th dimension of mel-generalized cepstrum coefficients 0. MGC (30th dim) Frame index (utterance BC0_nancy_APDC-66-00) smooth? Natural data AR-RMDN 7/5/7 3

14 BEFORE INTERPRETATION A glimpse of the results Generated trajectories of the 30 th dimension of mel-generalized cepstrum coefficients 0. MGC (30th dim) Frame index (utterance BC0_nancy_APDC-66-00) smooth? larger dynamic range? Natural data AR-RMDN RNN RMDN 7/5/7 4

15 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR f(o t )=a o t + b, where a =[a,,a D ] > for GMM in AR-RMDN p(o t o t ; M t )= MX m= w m t DY d= q exp( t,d m (o t,d f(o t,d ) µ m t,d ) t,d m ) o t,d f(o t,d )=o t,d a d o t,d b d c t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames 7/5/7 5

16 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: c t,d = o t,d a d o t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames if T = c,d = o,d c,d = o,d a d o,d apple apple 0 a d o,d = o,d apple c,d c,d 7/5/7 6

17 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: c t,d = o t,d a d o t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames if T = 3 c,d = o,d c,d = o,d a d o,d c 3,d = o 3,d a d o,d o,d a d 05 4o,d 5 = 0 a d o 3,d c,d 3 4c,d 5 c 3,d 7/5/7 7

18 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: c t,d = o t,d a d o t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames in general o,d a d 0 0 o,d 0 a d 0 o 3,d a d o T,d = c,d 3 c,d c 3,d c T,d 7/5/7 8

19 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: a filtering processing d [,D],, dimension of index feature vector o :T,d A d (z) = a d z c :T,d o,d a d 0 0 o,d 0 a d 0 o 3,d a d o T,d = c,d 3 c,d c 3,d c T,d o :T,d c :T,d v A d (z) denotes the filter in z-domain 7/5/7 9

20 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: a filtering processing o :T,d H d (z) = A d (z) d [,D],, dimension of index feature vector c :T,d a d a d a d c,d 3 c,d c 3,d c T,d = o,d 3 o,d o 3,d o T,d c :T,d o :T,d v H d (z) denotes the filter in z-domain 7/5/7 0

21 INTERPRETATION Signals and filters in AR-RMDN l With high order AR joint training of AR filter and RMDN using back-propagation: AR-RMDN AR analysis filter RMDN KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= textual features generation: bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDN TY p(c t ; M t ) t= textual features d [,D],, dimension ofindex feature vector 7/5/7

22 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion

23 IMPLEMENTATION Stability of the filter l H d (z) must be stable bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDN TY p(c t ; M t ) t= textual features infinite impulse response filter [7] Oppenheim, A. V, Schafer, R. W., & Buck, J. R. (999). Discrete-time Signal Processing (nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc. 7/5/7 3

24 IMPLEMENTATION Stability of the filter l H d (z) must be stable unstable > will be infinitely large! H d (z) bo :T,d F0 output from unstable filter bo :T,d AR-RMDN AR filter part H d (z) = P K k= a k,dz k bc :T,d RMDN part TY p(c t ; M t ) t= 7/5/7 4

25 IMPLEMENTATION Stability of the filter l General requirement: H d (z) 's poles are inside the unit circle [7] l A simple trick (with a strong constraint): H(z) = P K k= a kz = Y K k k= k z k R requirement: k (, ) how to: k = tanh(ˆ k ) [7] Oppenheim, A. V, Schafer, R. W., & Buck, J. R. (999). Discrete-time Signal Processing (nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc. 7/5/7 5

26 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion 6

27 Corpus DATA Name Size Usage Blizzard Challenge About,000 utterances 0 corpus [8] 6 hours Features Feature and description Validation set: 500 utterances Test set: 500 utterances Training set: the rest Dimension Input features x t Phoneme sequence, stress, pitch accent... (extracted using Flite [9] ) 38 Target features o t Mel-generalized cepstrum coefficients (MGC) 60 Continuous F0 trajectory + Unvoiced/voiced flag + Band aperiodicities (BAP) 5 [8] King, S., & Karaiskos, V. (0). The Blizzard Challenge 0. Retrieved from [9] HTS Working Group. (04). The English TTS System Flite+hts_engine. Retrieved from 7

28 RNN RMDN RNN+MLPG [5] AR-RMDN without without with without Description o t, o t, o t, o t, SYSTEMS o t o t o t o t Configuration () feedforward tanh [5] () feedforward tanh [5] (3) Bi-LSTM [56] (4) Bi-LSTM [56] (5) feedforward linear output [59] Same RNN + MDN:. GMM ( mix) for MGC. GMM ( mix) for F0 3. GMM ( mix) for BAP 4. Binary distribution for U/V Same as RNN Same as RMDN, with AR. GMM ( mix, st order AR) for MGC. GMM ( mix, nd order AR) for F0 3. GMM ( mix) for BAP 4. Binary distribution for U/V number of AR parameters: * * + = 3 [5] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP (Vol. 3, pp ). 7/5/7 8

29 Results 8 EXPERIMENTS l Trajectory of the generated MGC: 6 MGC (st dim) NAT RNN RNN+MLPG RMDN AR-RMDN st dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) MGC (th dim) NAT RNN RNN+MLPG RMDN AR-RMDN nd dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) 7/5/7 9

30 MGC (5th dim) EXPERIMENTS Results l Trajectory of the generated MGC: NAT RNN RNN+MLPG RMDN AR-RMDN 5 th dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) MGC (30th dim) NAT RNN RNN+MLPG RMDN AR-RMDN 30 th dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) 7/5/7 30

31 EXPERIMENTS Results l Trajectory of the generated MGCand F0: 0. MGC (45th dim) F0 (Hz) NAT RNN RNN+MLPG RMDN AR-RMDN 45 th dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) NAT RNN RNN+MLPG RMDN AR-RMDN F0 (after unvoiced/voiced classification) Frame index (utterance BC0_nancy_APDC-66-00) 7/5/7 3

32 EXPERIMENTS Analysis l Global variance [0] of the generated MGC and F0 trajectories GV of MGC GV of F AR-RMDN 0 GV of MGC -4-6 NAT GV of F RNN+MLPG NAT RNN -0 RNN RNN+MLPG RMDN RMDN AR-RMDN Order of MGC NAT RNN RNN+MLPG RMDN AR-RMDN [0] Toda, T., & Tokuda, K. (007). A speech parameter generation algorithm considering global variance for {HMM}-based speech synthesis. IEICE Transactions on Information and Systems, 90(5), /5/7 3

33 EXPERIMENTS Analysis l A larger dynamic range (higher GV)? bo :T,d H d (z) = AR synthesis filter P K k= a k,dz k AR-RMDN bc :T,d RMDNpart TY p(c t ; M t ) t= Frequency response of H d (z) for MGC d [,D],, dimension ofindex featu 7/5/7 33

34 Results l Samples: EXPERIMENTS RNN RNN+MLPG RMDN AR-RMDN Natural w/o formant enhancement with formant enhancement formant enhancement was used in subjective evaluation Other samples : here or tonywangx.github.io 7/5/7 34

35 CONTENTS l Introduction l Method of this work l Experiments and results l Conclusion 35

36 CONCLUSION l Model: AR-RMDN : a pair of AR filter + RMDN l Results: AR synthesis filter: a 'low-pass' filter AR analysis filter: a 'high-pass' filter (see pp.45-46) generated trajectories with a larger dynamic range better perceived quality Subjective evaluation (MUSHRA test) Rates (0-00) 7/5/7 36

37 RECENT WORK l Inverse AR filter with complex poles bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= using sigmoid & tanh function please see attached slides pp.58 7/5/7 37

38 l Sampling from the model FUTURE WORK F0 (Hz) NAT AR-RMDN AR-RMDN (Sampling) Frame index (utterance BC0_nancy_APDC-66-00) AR is still weak for temporal correlation? 7/5/7 38

39 Thank you for your attention Q & A l Toolkit, scripts, slides, and samples: tonywangx.github.io 39

40 l AR vs RNN as the output layer [] To simply the equation suppose, GMM has one mixture component with (thus consider only two frames ARGUMENT o t D t = M t = {µ t }, RMDN is equivalent to RNN trained using MSE) For example, baseline (no RNN output layer, no AR) will be: o o M M M t = {µ t } not o t D is the output of RMDN p(o : ; M : )=p(o ; M )p(o ; M ) = p exp( (o µ ) ) p exp( (o µ ) ) [] Zen, H. and Sak, H. (05). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages /5/7 40

41 ARGUMENT l AR vs RNN as the output layer RNN output layer: suppose the weight on the RNN output layer is a, no bias o o M a M x x p(o : ; M : )=p(o ; M )p(o ; M ) = p exp( (o µ ) ) p exp( (o µ aµ ) ) 7/5/7 4

42 ARGUMENT l AR vs RNN as the output layer AR: suppose the weight of AR is a, no bias o o M a M x x p(o : ; M : )= p exp( (o µ ) ) p exp( (o µ ao ) ) 7/5/7 4

43 ARGUMENT l AR vs RNN as the output layer Difference in probability density function: o o M a M RNN's case p(o : ; M : )= p exp( = p exp( = p(o ; M )p(o ; M 0 ) (o µ ) ) p exp( (o µ ) ) p exp( (o µ aµ ) ) (o µ 0 ) ) just change the distribution's parameter, µ 0 = µ aµ still independent distribution RNN output layer builds the dependency between parameters 7/5/7 43

44 ARGUMENT l AR vs RNN as the output layer Difference in probability density function: o o M a M AR's case p(o : ; M : )= p (o µ ) exp( ) p exp( = exp( (o µ)> (o µ)) where o =[o,o ] >, µ =[µ,µ + aµ ] (o µ ao ) ) = apple apple a a +a +a a = a = AR builds the dependency between random variables the -dim Gaussian distribution has non-diagonal covariance 7/5/7 44

45 ARGUMENT l AR vs RNN as the output layer Difference in probability density function: o o M a M AR's case Let's compare and (o µ) > (o µ) (o µ ) +(o µ ao ) Let m =[m,m ] > = o Get m > m = m m apple +a a a u apple m m = m + m a m m a + m = m +(m a m ) As m = o µ,m = o µ µ a Get m > m =(o µ ) +(o a µ a o + µ + µ a) =(o µ ) +(o µ o a) 7/5/7 45

46 ARGUMENT l AR vs RNN as the output layer Difference in random sampling: RNN's case o o M a M x x. calculate by neural network, then we have M p(o ; M ). calculate by neural network and, then we have M M p(o ; M ) 3. draw samples bo, bo from and sample doesn't influence p(o ; M ) p(o ; M ) bo p(o ; M ) 7/5/7 46

47 ARGUMENT l AR vs RNN as the output layer Difference in random sampling: AR's case o o M a M x x. calculate by neural network, then we have M p(o ; M ). draw sample from bo p(o ; M ) 3. calculate by neural network and M bo 4. draw sample bo from p(o ; M ) 7/5/7 47

48 ARGUMENT l AR vs RNN as the output layer Difference in training using back-propagation: RNN's case o o M =(o µ aµ )µ AR'scase o o M a M gradient is =(o µ ao )o 7/5/7 48

49 l Sampling from the model ARGUMENT F0 (Hz) NAT AR-RMDN AR-RMDN (Sampling) Frame index (utterance BC0_nancy_APDC-66-00) AR is still weak for temporal correlation? YES! Why? feature transformation bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= 7/5/7 49

50 ARGUMENT l Sampling from the model AR is still weak for temporal correlation? YES! Why? a d a d a d c,d c :T,d 3 c,d c 3,d = c T,d o,d 3 o,d o 3,d o T,d o :T,d feature transformation bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= 7/5/7 50

51 ARGUMENT l Sampling from the model AR is still weak for temporal correlation? YES! Why? a d a d a d c,d c :T,d 3 c,d c 3,d = c T,d o,d 3 o,d o 3,d o T,d o :T,d call it the feature transformation matrix H. Vector c contains samples from i.i.d Gaussian distribution. Vector o ~ Gaussian distribution, with a covariance matrix as Cov(o) = H*Cov (c)*h T 3. Unfortunately, off-diagonal elements of Cov(o) decays too quickly (exponentially) because of H 4. Similar results for high order AR 7/5/7 5

52 l Simple experiment Set 00 frames, let ARGUMENT c t = sin(w t)+, where N (0, I) Use matrix H from the trained AR model, transform c into o (red line) Define a transformation matrix based on i,j = exp( transform c into another vector (green line) 0.5 l ((i j) )) Co-variance matrix based on AR c generated from AR generated from GP Time index Amplitude Time index Co-variance matrix based on GP Time index Time index Time index

53 ARGUMENT l Simple experiment So: transformation matrix of AR is too simple Co-variance matrix based on AR c generated from AR generated from GP Time index Amplitude Time index Co-variance matrix based on GP Time index Time index Time index

54 Objective results EXPERIMENTS MGC RMSE F0 RMSE F0 CORR RNN RMDN RNN+MLPG ARRMDN /5/7 54

55 EXPERIMENTS Analysis l A larger dynamic range? 5 Modulation spectrum [,] of the 30 th dimension of MGC AR-RMDN NAT MS MGC (db) RNN RMDN RNN+MLPG Frequency Index (:/048) bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= [] Hermansky, H. (997). The modulation spectrum in the automatic recognition of speech. In Proc. ASRU (pp ). [] Takamichi, S., Toda, T., Neubig, G., Sakti, S., & Nakamura, S. (04). A postfilter to modify the modulation spectrum in HMM-based speech synthesis. In Proc. ICASSP (pp ). 7/5/7 55

56 EXPERIMENTS Analysis (the training process) l AR model in training stage On the st order AR for MGC A d (z) = tanh( d )z Normalized MGC Original data Filtered data feature trajectory 30 th MGC Frame Normalized MGC 30 ) modulation spectrum 30 th MGC Original data Filtered data Frequency index (k * :/048) AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= Magnitude (db) Frequency response of A(z) 5 0 H (z) 0 H 5 (z) 5 H 30 (z) AH 60 (z) (z) -5 A 5 (z) 0 A 30 (z) -0 A 60 (z) Frequency bin (:/04) 7/5/7 56

57 EXPERIMENTS Analysis (the training process) l AR model in training stage On the st order AR for F0 Normalized F0 4 0 feature trajectory interpolated F0 Original data Filtered data Frame Normalized F modulation spectrum interpolated F0 Original data Filtered data Frequency index (k * :/048) Frequency response of A(z) AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= Frequency Response(dB Normalized frequency (:) 7/5/7 57

58 EXPERIMENTS Analysis (the training process) l AR model in training stage nd order A d (z) =( a z )( a z ) for MGC is also high-pass Value of a a a Order of the MGC Magnitude (db) Frequency response of A d (z) A (z) A 5 (z) A 30 (z) A 60 (z) Frequency bin (:/04) AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= 7/5/7 58

59 EXPERIMENTS Analysis (the training process) l AR model in training stage High-pass AR analysis filter Magnitude (db) Frequency response of A (z) order order 3 order 4 order 5 order 6 AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= Frequency bin (:/04) t= v Order is re-trained 7/5/7 59

60 EXPERIMENTS II Compare with post-filtering method [5] bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDN part TY p(c t ; M t ) t= RMDN bo 0 :T,d Modulation-spectrumbased post-filter bo :T,d TY p(c t ; M t ) t= ID RMDN RMDN-MS AR-RMDN Description recurrent mixture density network recurrent mixture density network with Modulation-spectrumbased post-filter proposed model [5] Takamichi, S., Toda, T., Neubig, G., Sakti, S., & Nakamura, S. (04). A postfilter to modify the modulation spectrum in HMMbased speech synthesis. In Proc. ICASSP(pp ). 7/5/7 60

61 Results EXPERIMENTS II MGC (30th dim) th dimension of MGC NAT RMDN RMDN-MS AR-RMDN Frame index (utterance BC0_nancy_APDC-66-00) MS MGC (db) AR-RMDN RMDN NAT RMDN-MS -0 Modulation spectrum of 30 th dimension of MGC Frequency Index (:/048) 7/5/7 6

62 EXPERIMENTS II Only compare MGC RMDN RMDN-MS AR-RMDN natural w/o formant enhancement with formant enhancement All systems used the same generated F0 from RMDN 7/5/7 6

63 EXPERIMENTS II Compare MGC + F0 RMDN RMDN-MS AR-RMDN natural w/o formant enhancement with formant enhancement 7/5/7 63

64 Wrap up EXPERIMENTS II Rated quality (from 0 (min) to 00 (max)) RMDN RMDN-MS F AR-RMDN F RMDN-MS M AR-RMDN M RMDN-MS AR-RMDN only enhance F0 only enhance MGC Why? AR-RMDN avoid FFT/iFFT on acoustic trajectories v Note, RMDN-MS could be better [] [] Takamichi, S. (06). Acoustic modeling and speech parameter generation for high-quality statistical parametric speech synthesis. Nara Institute of Science and Technology. Retrieved from 7/5/7 64

65 INTERPRETATION Signals and filters in AR-RMDN l Signal o :T,d T frames D dimensions 3 o, o, o 3, o t, o T, o, o, o 3, o t, o T, o,d o,d o 3,d o t,d o T,d o,d o,d o 3,D o t,d o T,D o > :T,d o t R D Define acoustic feature trajectories o :T,d =[o,d,,o T,d ] > d [,D] 7/5/7 65

66 INTERPRETATION Signals and filters in AR-RMDN l Signal c :T,d. Consider a st order AR model f(o t )=a o t + b, where a =[a,,a D ] >. Look into the GMM with AR MX p(o t o t ; M t )= m= w m t DY d= q exp( t,d m (o t,d f(o t,d ) µ m t,d ) t,d m ) MX = m= w m t DY d= q exp( t,d m (o t,d a d o t,d µ m t,d b d ) t,d m ) 3. Define c t,d = o t,d a d o t,d, d [,D] 4. Define c :T,d =[c,d,,c T,d ] >, d [,D] 7/5/7 66

67 INTERPRETATION Signal and filter in AR-RMDN l Filters Still consider a st order AR, where and c t,d = o t,d a d o t,d d [,D],t [,T] Then, o :T,d and c :T,d are related by a d a d a d. o,d 3 o,d o 3,d o T,d = c,d 3 c,d c 3,d c T,d A d (z) = a d z o :T,d c :T,d Filter with finite impulse response excitation signal filtered signal 7/5/7 67

68 INTERPRETATION Signal and filter in AR-RMDN l Filters In general From o :T,d to c :T,d f(o t K:t )= KX a k o t k + b, k= a,d a,d a,d a K,d a,d. o,d 3 o,d o 3,d o T,d = c,d 3 c,d c 3,d c T,d A d (z) = KX a k,d z k o :T,d c :T,d k= Filter with finite impulse response excitation signal filtered signal 7/5/7 68

69 INTERPRETATION Signal and filter in AR-RMDN l Filters In general From c :T,d to o :T,d f(o t K:t )= KX a k o t k + b, k= a,d a,d a,d a K,d a,d. = c,d 3 c,d c 3,d c T,d = o,d 3 o,d o 3,d o T,d H d (z) = P K k= a k,dz k Filter with infinite impulse response c :T,d o :T,d 7/5/7 69

70 MODEL IMPLEMENTATION Stability of the filter l To make Constraints H d (z) = P K k= a k,dz k,d [,D] Filters stable? Trainable parameters no constraint H(z) = P K k= a kz k a k real poles inside the unit circle one or no real pole & pairs of complex poles inside the unit circle (please read the report) H(z) = 8 < H(z) = : P K k= a kz = Y K k Q K/ k= k= ( kz kz ) ( 0 z ) Q (K )/ k= ( k z kz ) k = sigmoid( c k) q k = sigmoid( c k)tanh(c k ) k z k if K is even if K is odd. c k, c k Please find the technical report SLP-5 Kagawa on tonywangx.github.io 7/5/7 70

71 EXPERIMENTS II High order AR-Filter l Learned AR filter on F0 data bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= Systems Constraints on AR filter H d (z) Form of the AR filter U6 R6 C6 6-order, Unconstrained 6-order, Stable, with real poles 6-order, Stable, with complex poles H(z) = H(z) = H(z) = P K k= a kz k P K k= a kz = Y K k k= Q K/ k= ( kz kz ) k z v Note, constraints are used both in training and generation stage 7/5/7 7

72 EXPERIMENTS II High order AR-Filter l Learned AR filter on F0 data imaginary axis U6 R6 C6 6-0 real axis imaginary axis real axis imaginary axis real axis U6 R6 C6 Systems Constraints on AR filter H d (z) 6-order, Unconstrained 6-order, Stable, with real poles 6-order, Stable, with complex poles Magnitude (db) U6 R6 C Frequency bin (:/04) 7/5/7 7

73 F0 (Hz) EXPERIMENTS II High order AR-Filter l Generated F Frame index (utterance BC0_nancy_APDC-66-00) NAT U R 6 C 6 0 U6 generated very large F0 value (unstable IIR filter) U (-order unconstrained) is plotted instead Vibration of the F0 in C6 Magnitude (db) U6 R6 C Frequency bin (:/04) 7/5/7 73

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn

More information

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi

More information

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis INTERSPEECH 217 August 2 24, 217, Stockholm, Sweden An RNN-based Quantized F Model with Multi-tier Feedback Links for Text-to-Speech Synthesis Xin Wang 1,2, Shinji Takaki 1, Junichi Yamagishi 1,2,3 1 National

More information

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models 8th ISCA Speech Synthesis Workshop August 31 September 2, 2013 Barcelona, Spain Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models Nobukatsu Hojo 1, Kota Yoshizato

More information

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION

A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION A DEEP RECURRENT APPROACH FOR ACOUSTIC-TO-ARTICULATORY INVERSION P eng Liu 1,2, Quanjie Y u 1,2, Zhiyong W u 1,2,3, Shiyin Kang 3, Helen Meng 1,3, Lianhong Cai 1,2 1 Tsinghua-CUHK Joint Research Center

More information

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

More information

Vocoding approaches for statistical parametric speech synthesis

Vocoding approaches for statistical parametric speech synthesis Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge,

More information

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

More information

Deep Recurrent Neural Networks

Deep Recurrent Neural Networks Deep Recurrent Neural Networks Artem Chernodub e-mail: a.chernodub@gmail.com web: http://zzphoto.me ZZ Photo IMMSP NASU 2 / 28 Neuroscience Biological-inspired models Machine Learning p x y = p y x p(x)/p(y)

More information

A POSTFILTER TO MODIFY THE MODULATION SPECTRUM IN HMM-BASED SPEECH SYNTHESIS

A POSTFILTER TO MODIFY THE MODULATION SPECTRUM IN HMM-BASED SPEECH SYNTHESIS 24 IEEE International Conference on Acoustic, Speech an Signal Processing (ICASSP A POSTFILTER TO MODIFY THE MODULATION SPECTRUM IN HMM-BASED SPEECH SYNTHESIS Shinnosuke Takamichi, Tomoki Toa, Graham Neubig,

More information

Generating Sequences with Recurrent Neural Networks

Generating Sequences with Recurrent Neural Networks Generating Sequences with Recurrent Neural Networks Alex Graves University of Toronto & Google DeepMind Presented by Zhe Gan, Duke University May 15, 2015 1 / 23 Outline Deep recurrent neural network based

More information

GMM-Based Speech Transformation Systems under Data Reduction

GMM-Based Speech Transformation Systems under Data Reduction GMM-Based Speech Transformation Systems under Data Reduction Larbi Mesbahi, Vincent Barreaud, Olivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-22305 Lannion Cedex

More information

Unfolded Recurrent Neural Networks for Speech Recognition

Unfolded Recurrent Neural Networks for Speech Recognition INTERSPEECH 2014 Unfolded Recurrent Neural Networks for Speech Recognition George Saon, Hagen Soltau, Ahmad Emami and Michael Picheny IBM T. J. Watson Research Center, Yorktown Heights, NY, 10598 gsaon@us.ibm.com

More information

Introduction to Neural Networks

Introduction to Neural Networks Introduction to Neural Networks Steve Renals Automatic Speech Recognition ASR Lecture 10 24 February 2014 ASR Lecture 10 Introduction to Neural Networks 1 Neural networks for speech recognition Introduction

More information

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling An Model for HMM-Based Speech Synthesis Based on Residual Modeling Ranniery Maia,, Tomoki Toda,, Heiga Zen, Yoshihiko Nankaku, Keiichi Tokuda, National Inst. of Inform. and Comm. Technology (NiCT), Japan

More information

Noise Compensation for Subspace Gaussian Mixture Models

Noise Compensation for Subspace Gaussian Mixture Models Noise ompensation for ubspace Gaussian Mixture Models Liang Lu University of Edinburgh Joint work with KK hin, A. Ghoshal and. enals Liang Lu, Interspeech, eptember, 2012 Outline Motivation ubspace GMM

More information

Christian Mohr

Christian Mohr Christian Mohr 20.12.2011 Recurrent Networks Networks in which units may have connections to units in the same or preceding layers Also connections to the unit itself possible Already covered: Hopfield

More information

David Weenink. First semester 2007

David Weenink. First semester 2007 Institute of Phonetic Sciences University of Amsterdam First semester 2007 Digital s What is a digital filter? An algorithm that calculates with sample values Formant /machine H 1 (z) that: Given input

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT. CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition INTERSPEECH 017 August 0 4, 017, Stockholm, Sweden Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition Jaeyoung Kim 1, Mostafa El-Khamy 1, Jungwon Lee 1 1 Samsung Semiconductor,

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

SPEECH ANALYSIS AND SYNTHESIS

SPEECH ANALYSIS AND SYNTHESIS 16 Chapter 2 SPEECH ANALYSIS AND SYNTHESIS 2.1 INTRODUCTION: Speech signal analysis is used to characterize the spectral information of an input speech signal. Speech signal analysis [52-53] techniques

More information

Evaluation of the modified group delay feature for isolated word recognition

Evaluation of the modified group delay feature for isolated word recognition Evaluation of the modified group delay feature for isolated word recognition Author Alsteris, Leigh, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Gaussian with mean ( µ ) and standard deviation ( σ)

Gaussian with mean ( µ ) and standard deviation ( σ) Slide from Pieter Abbeel Gaussian with mean ( µ ) and standard deviation ( σ) 10/6/16 CSE-571: Robotics X ~ N( µ, σ ) Y ~ N( aµ + b, a σ ) Y = ax + b + + + + 1 1 1 1 1 1 1 1 1 1, ~ ) ( ) ( ), ( ~ ), (

More information

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,

More information

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models Statistical NLP Spring 2009 The Noisy Channel Model Lecture 10: Acoustic Models Dan Klein UC Berkeley Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Statistical NLP Spring The Noisy Channel Model

Statistical NLP Spring The Noisy Channel Model Statistical NLP Spring 2009 Lecture 10: Acoustic Models Dan Klein UC Berkeley The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya

More information

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS Hyungui Lim 1, Jeongsoo Park 1,2, Kyogu Lee 2, Yoonchang Han 1 1 Cochlear.ai, Seoul, Korea 2 Music and Audio Research Group,

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Recurrent Neural Networks

Recurrent Neural Networks Recurrent Neural Networks Datamining Seminar Kaspar Märtens Karl-Oskar Masing Today's Topics Modeling sequences: a brief overview Training RNNs with back propagation A toy example of training an RNN Why

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

Why DNN Works for Acoustic Modeling in Speech Recognition?

Why DNN Works for Acoustic Modeling in Speech Recognition? Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,

More information

Comparing linear and non-linear transformation of speech

Comparing linear and non-linear transformation of speech Comparing linear and non-linear transformation of speech Larbi Mesbahi, Vincent Barreaud and Olivier Boeffard IRISA / ENSSAT - University of Rennes 1 6, rue de Kerampont, Lannion, France {lmesbahi, vincent.barreaud,

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Nonparametric and Parametric Defined This text distinguishes between systems and the sequences (processes) that result when a WN input is applied

Nonparametric and Parametric Defined This text distinguishes between systems and the sequences (processes) that result when a WN input is applied Linear Signal Models Overview Introduction Linear nonparametric vs. parametric models Equivalent representations Spectral flatness measure PZ vs. ARMA models Wold decomposition Introduction Many researchers

More information

Linear Prediction 1 / 41

Linear Prediction 1 / 41 Linear Prediction 1 / 41 A map of speech signal processing Natural signals Models Artificial signals Inference Speech synthesis Hidden Markov Inference Homomorphic processing Dereverberation, Deconvolution

More information

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models Statistical NLP Spring 2010 The Noisy Channel Model Lecture 9: Acoustic Models Dan Klein UC Berkeley Acoustic model: HMMs over word positions with mixtures of Gaussians as emissions Language model: Distributions

More information

Speech and Language Processing

Speech and Language Processing Speech and Language Processing Lecture 5 Neural network based acoustic and language models Information and Communications Engineering Course Takahiro Shinoaki 08//6 Lecture Plan (Shinoaki s part) I gives

More information

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino Artificial Neural Networks Data Base and Data Mining Group of Politecnico di Torino Elena Baralis Politecnico di Torino Artificial Neural Networks Inspired to the structure of the human brain Neurons as

More information

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta

Recurrent Autoregressive Networks for Online Multi-Object Tracking. Presented By: Ishan Gupta Recurrent Autoregressive Networks for Online Multi-Object Tracking Presented By: Ishan Gupta Outline Multi Object Tracking Recurrent Autoregressive Networks (RANs) RANs for Online Tracking Other State

More information

EE-559 Deep learning Recurrent Neural Networks

EE-559 Deep learning Recurrent Neural Networks EE-559 Deep learning 11.1. Recurrent Neural Networks François Fleuret https://fleuret.org/ee559/ Sun Feb 24 20:33:31 UTC 2019 Inference from sequences François Fleuret EE-559 Deep learning / 11.1. Recurrent

More information

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017

Modelling Time Series with Neural Networks. Volker Tresp Summer 2017 Modelling Time Series with Neural Networks Volker Tresp Summer 2017 1 Modelling of Time Series The next figure shows a time series (DAX) Other interesting time-series: energy prize, energy consumption,

More information

arxiv: v4 [cs.cl] 5 Jun 2017

arxiv: v4 [cs.cl] 5 Jun 2017 Multitask Learning with CTC and Segmental CRF for Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, and Noah A Smith Toyota Technological Institute at Chicago, USA School of Computer Science, Carnegie

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Interspeech 2018 2-6 September 2018, Hyderabad Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma,

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Tuomas Virtanen and Anssi Klapuri Tampere University of Technology, Institute of Signal Processing Korkeakoulunkatu

More information

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING Liang Lu and Steve Renals Centre for Speech Technology Research, University of Edinburgh, Edinburgh, UK {liang.lu, s.renals}@ed.ac.uk ABSTRACT

More information

Signal representations: Cepstrum

Signal representations: Cepstrum Signal representations: Cepstrum Source-filter separation for sound production For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes,

More information

Temporal Modeling and Basic Speech Recognition

Temporal Modeling and Basic Speech Recognition UNIVERSITY ILLINOIS @ URBANA-CHAMPAIGN OF CS 498PS Audio Computing Lab Temporal Modeling and Basic Speech Recognition Paris Smaragdis paris@illinois.edu paris.cs.illinois.edu Today s lecture Recognizing

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Presented By: Omer Shmueli and Sivan Niv

Presented By: Omer Shmueli and Sivan Niv Deep Speaker: an End-to-End Neural Speaker Embedding System Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, Zhenyao Zhu Presented By: Omer Shmueli and Sivan

More information

Statistical NLP Spring Digitizing Speech

Statistical NLP Spring Digitizing Speech Statistical NLP Spring 2008 Lecture 10: Acoustic Models Dan Klein UC Berkeley Digitizing Speech 1 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon

More information

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ...

Digitizing Speech. Statistical NLP Spring Frame Extraction. Gaussian Emissions. Vector Quantization. HMMs for Continuous Observations? ... Statistical NLP Spring 2008 Digitizing Speech Lecture 10: Acoustic Models Dan Klein UC Berkeley Frame Extraction A frame (25 ms wide extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3 Figure from Simon Arnfield

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

ADAPTIVE FILTER THEORY

ADAPTIVE FILTER THEORY ADAPTIVE FILTER THEORY Fourth Edition Simon Haykin Communications Research Laboratory McMaster University Hamilton, Ontario, Canada Front ice Hall PRENTICE HALL Upper Saddle River, New Jersey 07458 Preface

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

L7: Linear prediction of speech

L7: Linear prediction of speech L7: Linear prediction of speech Introduction Linear prediction Finding the linear prediction coefficients Alternative representations This lecture is based on [Dutoit and Marques, 2009, ch1; Taylor, 2009,

More information

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries

Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Gate Activation Signal Analysis for Gated Recurrent Neural Networks and Its Correlation with Phoneme Boundaries Yu-Hsuan Wang, Cheng-Tao Chung, Hung-yi

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates

Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Spectral and Textural Feature-Based System for Automatic Detection of Fricatives and Affricates Dima Ruinskiy Niv Dadush Yizhar Lavner Department of Computer Science, Tel-Hai College, Israel Outline Phoneme

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

Lecture 11 Recurrent Neural Networks I

Lecture 11 Recurrent Neural Networks I Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor niversity of Chicago May 01, 2017 Introduction Sequence Learning with Neural Networks Some Sequence Tasks

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

More information

Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability

Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability Exploring Granger Causality for Time series via Wald Test on Estimated Models with Guaranteed Stability Nuntanut Raksasri Jitkomut Songsiri Department of Electrical Engineering, Faculty of Engineering,

More information

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition

Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition Upper Bound Kullback-Leibler Divergence for Hidden Markov Models with Application as Discrimination Measure for Speech Recognition Jorge Silva and Shrikanth Narayanan Speech Analysis and Interpretation

More information

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression Group Prof. Daniel Cremers 4. Gaussian Processes - Regression Definition (Rep.) Definition: A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution.

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Generation of F 0 Contour Using Deep Boltzmann Machine and Twin Gaussian Process Hybrid Model for Bengali Language

Generation of F 0 Contour Using Deep Boltzmann Machine and Twin Gaussian Process Hybrid Model for Bengali Language INTERSPEECH 2014 Generation of F 0 Contour Using Deep Boltzmann Machine and Twin Gaussian Process Hybrid Model for Bengali Language Sankar Mukherjee, Shyamal Kumar Das Mandal Centre for Educational Technology

More information

Recurrent and Recursive Networks

Recurrent and Recursive Networks Neural Networks with Applications to Vision and Language Recurrent and Recursive Networks Marco Kuhlmann Introduction Applications of sequence modelling Map unsegmented connected handwriting to strings.

More information

Very Deep Convolutional Neural Networks for LVCSR

Very Deep Convolutional Neural Networks for LVCSR INTERSPEECH 2015 Very Deep Convolutional Neural Networks for LVCSR Mengxiao Bi, Yanmin Qian, Kai Yu Key Lab. of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab,

More information

GENERALIZED DEFLATION ALGORITHMS FOR THE BLIND SOURCE-FACTOR SEPARATION OF MIMO-FIR CHANNELS. Mitsuru Kawamoto 1,2 and Yujiro Inouye 1

GENERALIZED DEFLATION ALGORITHMS FOR THE BLIND SOURCE-FACTOR SEPARATION OF MIMO-FIR CHANNELS. Mitsuru Kawamoto 1,2 and Yujiro Inouye 1 GENERALIZED DEFLATION ALGORITHMS FOR THE BLIND SOURCE-FACTOR SEPARATION OF MIMO-FIR CHANNELS Mitsuru Kawamoto,2 and Yuiro Inouye. Dept. of Electronic and Control Systems Engineering, Shimane University,

More information

Deep Learning for Speech Recognition. Hung-yi Lee

Deep Learning for Speech Recognition. Hung-yi Lee Deep Learning for Speech Recognition Hung-yi Lee Outline Conventional Speech Recognition How to use Deep Learning in acoustic modeling? Why Deep Learning? Speaker Adaptation Multi-task Deep Learning New

More information

Gaussian Processes for Audio Feature Extraction

Gaussian Processes for Audio Feature Extraction Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

arxiv: v2 [cs.ne] 7 Apr 2015

arxiv: v2 [cs.ne] 7 Apr 2015 A Simple Way to Initialize Recurrent Networks of Rectified Linear Units arxiv:154.941v2 [cs.ne] 7 Apr 215 Quoc V. Le, Navdeep Jaitly, Geoffrey E. Hinton Google Abstract Learning long term dependencies

More information

Long-Short Term Memory and Other Gated RNNs

Long-Short Term Memory and Other Gated RNNs Long-Short Term Memory and Other Gated RNNs Sargur Srihari srihari@buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Sequence Modeling

More information

Highway-LSTM and Recurrent Highway Networks for Speech Recognition

Highway-LSTM and Recurrent Highway Networks for Speech Recognition Highway-LSTM and Recurrent Highway Networks for Speech Recognition Golan Pundak, Tara N. Sainath Google Inc., New York, NY, USA {golan, tsainath}@google.com Abstract Recently, very deep networks, with

More information

An exploration of dropout with LSTMs

An exploration of dropout with LSTMs An exploration of out with LSTMs Gaofeng Cheng 1,3, Vijayaditya Peddinti 4,5, Daniel Povey 4,5, Vimal Manohar 4,5, Sanjeev Khudanpur 4,5,Yonghong Yan 1,2,3 1 Key Laboratory of Speech Acoustics and Content

More information

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks Modeling Time-Frequency Patterns with LSTM vs Convolutional Architectures for LVCSR Tasks Tara N Sainath, Bo Li Google, Inc New York, NY, USA {tsainath, boboli}@googlecom Abstract Various neural network

More information

Recurrent Neural Network

Recurrent Neural Network Recurrent Neural Network Xiaogang Wang xgwang@ee..edu.hk March 2, 2017 Xiaogang Wang (linux) Recurrent Neural Network March 2, 2017 1 / 48 Outline 1 Recurrent neural networks Recurrent neural networks

More information

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu Lecture: Gaussian Process Regression STAT 6474 Instructor: Hongxiao Zhu Motivation Reference: Marc Deisenroth s tutorial on Robot Learning. 2 Fast Learning for Autonomous Robots with Gaussian Processes

More information

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification

More information

Time series forecasting using ARIMA and Recurrent Neural Net with LSTM network

Time series forecasting using ARIMA and Recurrent Neural Net with LSTM network Time series forecasting using ARIMA and Recurrent Neural Net with LSTM network AvinashNath [1], Abhay Katiyar [2],SrajanSahu [3], Prof. Sanjeev Kumar [4] [1][2][3] Department of Information Technology,

More information

Modeling the creaky excitation for parametric speech synthesis.

Modeling the creaky excitation for parametric speech synthesis. Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity

More information

On Moving Average Parameter Estimation

On Moving Average Parameter Estimation On Moving Average Parameter Estimation Niclas Sandgren and Petre Stoica Contact information: niclas.sandgren@it.uu.se, tel: +46 8 473392 Abstract Estimation of the autoregressive moving average (ARMA)

More information