An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

Size: px

Start display at page:

Download "An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis"

Elfrieda Terry
5 years ago
Views:

1 ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan contact: we welcome critical comments, suggestions, and discussion

2 ABBREVIATION GMM NN RNN MDN RMDN Gaussian mixture model Neural network Recurrent neural network Mixture density network Recurrent mixture density network AR AR-RMDN Autoregressive model Autoregressive RMDN (proposed model)

3 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments l Conclusion 3

4 INTRODUCTION Text-to-Speech l Based on parametric speech synthesis vocoder acoustic model text analyzer text this work: on acoustic models based on neural networks 4

5 INTRODUCTION Acoustic models based on neural networks l RNN [] bo bo bo 3 bo 4 bo 5 generated acoustic features x x x 3 x 4 x 5 input textual features output of RNN: (spectral features, F0...) bo t time (frame) axis [] Fan, Y., Qian, Y., Xie, F.-L., & Soong, F. K. (04). TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. INTERSPEECH (pp ). 7/5/7 5

6 INTRODUCTION Acoustic models based on neural networks l RMDN [] o o o 3 o 4 o 5 distribution p(o t ; M t ) M M M 3 M 4 M 5 generated parameter set x x x 3 x 4 x 5 input textual features output of RMDN: for M t p(o t ; M t ) [] Schuster, M. (999). Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks. In Proc. NIPS (pp ). 7/5/7 6

7 INTRODUCTION Acoustic models based on neural networks l RMDN [3,4] o o o 3 o 4 o 5 GMM M M M 3 M 4 M 5 RMDN using GMM M t = {! t,,! M t, µ t,, µ M t, t,, M t } p(o :T ; M :T )= x x x 3 x 4 x 5 TY p(o t ; M t )= t= t= m= generate from GMM using MLPG [5] bo t [3] Bishop, C. M. (004). Mixture Density Networks. Retrieved from [4] Schuster, M. (999). Better Generative Models for Sequential Data Problems: Bidirectional Recurrent Mixture Density Networks. In Proc. NIPS (pp ). [5] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. TY MX! m t N (o t ; µ m t, m t ). 7/5/7 7

8 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion 8

9 MOTIVATION Conventional RMDN o o o 3 o 4 o 5 distribution p(o t ; M t ) M M M 3 M 4 M 5 l Independence assumption: p(o :T ; M :T )= TY p(o t ; M t ) t= l Alternative? x x x 3 x 4 x 5 the idea of Autoregressive HMM (for speech synthesis [6] )... [6] Shannon, M., Zen, H., & Byrne, W. (03). Autoregressive models for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech, and Language Processing, (3), /5/7 9

10 DEFINITION Proposed model: AR + RMDN o o o 3 o 4 o 5 M M M 3 M 4 M 5 RMDN x x x 3 x 4 x 5 l AR assumption: p(o :T ; M :T )= TY p(o t o t K:t ; M t ) t= 7/5/7 0

11 DEFINITION AR-RMDN using GMM a a a o o o 3 o 4 o 5 a a a a baseline RMDN: AR-RMDN: p(o :T ; M :T )= p(o :T ; M :T )= TY MX! t m N (o t ; µ m t, m t ) t= m= TY MX! t m N (o t ; µ m t + f(o t K:t ), m t ) t= m= f(o t K:t )= KX a k o t k + b, k= o {a,, a K, b}. time invariant (context-independent). joint training with RMDN using back-propagation 7/5/7

12 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion

13 BEFORE INTERPRETATION A glimpse of the results Generated trajectories of the 30 th dimension of mel-generalized cepstrum coefficients 0. MGC (30th dim) Frame index (utterance BC0_nancy_APDC-66-00) smooth? Natural data AR-RMDN 7/5/7 3

14 BEFORE INTERPRETATION A glimpse of the results Generated trajectories of the 30 th dimension of mel-generalized cepstrum coefficients 0. MGC (30th dim) Frame index (utterance BC0_nancy_APDC-66-00) smooth? larger dynamic range? Natural data AR-RMDN RNN RMDN 7/5/7 4

15 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR f(o t )=a o t + b, where a =[a,,a D ] > for GMM in AR-RMDN p(o t o t ; M t )= MX m= w m t DY d= q exp( t,d m (o t,d f(o t,d ) µ m t,d ) t,d m ) o t,d f(o t,d )=o t,d a d o t,d b d c t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames 7/5/7 5

16 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: c t,d = o t,d a d o t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames if T = c,d = o,d c,d = o,d a d o,d apple apple 0 a d o,d = o,d apple c,d c,d 7/5/7 6

17 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: c t,d = o t,d a d o t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames if T = 3 c,d = o,d c,d = o,d a d o,d c 3,d = o 3,d a d o,d o,d a d 05 4o,d 5 = 0 a d o 3,d c,d 3 4c,d 5 c 3,d 7/5/7 7

18 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: c t,d = o t,d a d o t,d d [,D],, dimension of index feature vector t [,T],, number frame index of frames in general o,d a d 0 0 o,d 0 a d 0 o 3,d a d o T,d = c,d 3 c,d c 3,d c T,d 7/5/7 8

19 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: a filtering processing d [,D],, dimension of index feature vector o :T,d A d (z) = a d z c :T,d o,d a d 0 0 o,d 0 a d 0 o 3,d a d o T,d = c,d 3 c,d c 3,d c T,d o :T,d c :T,d v A d (z) denotes the filter in z-domain 7/5/7 9

20 INTERPRETATION Signals and filters in AR-RMDN l Simple case: st order AR: a filtering processing o :T,d H d (z) = A d (z) d [,D],, dimension of index feature vector c :T,d a d a d a d c,d 3 c,d c 3,d c T,d = o,d 3 o,d o 3,d o T,d c :T,d o :T,d v H d (z) denotes the filter in z-domain 7/5/7 0

21 INTERPRETATION Signals and filters in AR-RMDN l With high order AR joint training of AR filter and RMDN using back-propagation: AR-RMDN AR analysis filter RMDN KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= textual features generation: bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDN TY p(c t ; M t ) t= textual features d [,D],, dimension ofindex feature vector 7/5/7

22 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion

23 IMPLEMENTATION Stability of the filter l H d (z) must be stable bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDN TY p(c t ; M t ) t= textual features infinite impulse response filter [7] Oppenheim, A. V, Schafer, R. W., & Buck, J. R. (999). Discrete-time Signal Processing (nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc. 7/5/7 3

24 IMPLEMENTATION Stability of the filter l H d (z) must be stable unstable > will be infinitely large! H d (z) bo :T,d F0 output from unstable filter bo :T,d AR-RMDN AR filter part H d (z) = P K k= a k,dz k bc :T,d RMDN part TY p(c t ; M t ) t= 7/5/7 4

25 IMPLEMENTATION Stability of the filter l General requirement: H d (z) 's poles are inside the unit circle [7] l A simple trick (with a strong constraint): H(z) = P K k= a kz = Y K k k= k z k R requirement: k (, ) how to: k = tanh(ˆ k ) [7] Oppenheim, A. V, Schafer, R. W., & Buck, J. R. (999). Discrete-time Signal Processing (nd Ed.). Upper Saddle River, NJ, USA: Prentice-Hall, Inc. 7/5/7 5

26 CONTENTS l Introduction l Model definition, interpretation, implementation l Experiments and results l Conclusion 6

27 Corpus DATA Name Size Usage Blizzard Challenge About,000 utterances 0 corpus [8] 6 hours Features Feature and description Validation set: 500 utterances Test set: 500 utterances Training set: the rest Dimension Input features x t Phoneme sequence, stress, pitch accent... (extracted using Flite [9] ) 38 Target features o t Mel-generalized cepstrum coefficients (MGC) 60 Continuous F0 trajectory + Unvoiced/voiced flag + Band aperiodicities (BAP) 5 [8] King, S., & Karaiskos, V. (0). The Blizzard Challenge 0. Retrieved from [9] HTS Working Group. (04). The English TTS System Flite+hts_engine. Retrieved from 7

28 RNN RMDN RNN+MLPG [5] AR-RMDN without without with without Description o t, o t, o t, o t, SYSTEMS o t o t o t o t Configuration () feedforward tanh [5] () feedforward tanh [5] (3) Bi-LSTM [56] (4) Bi-LSTM [56] (5) feedforward linear output [59] Same RNN + MDN:. GMM ( mix) for MGC. GMM ( mix) for F0 3. GMM ( mix) for BAP 4. Binary distribution for U/V Same as RNN Same as RMDN, with AR. GMM ( mix, st order AR) for MGC. GMM ( mix, nd order AR) for F0 3. GMM ( mix) for BAP 4. Binary distribution for U/V number of AR parameters: * * + = 3 [5] Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T., & Kitamura, T. (000). Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP (Vol. 3, pp ). 7/5/7 8

29 Results 8 EXPERIMENTS l Trajectory of the generated MGC: 6 MGC (st dim) NAT RNN RNN+MLPG RMDN AR-RMDN st dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) MGC (th dim) NAT RNN RNN+MLPG RMDN AR-RMDN nd dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) 7/5/7 9

30 MGC (5th dim) EXPERIMENTS Results l Trajectory of the generated MGC: NAT RNN RNN+MLPG RMDN AR-RMDN 5 th dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) MGC (30th dim) NAT RNN RNN+MLPG RMDN AR-RMDN 30 th dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) 7/5/7 30

31 EXPERIMENTS Results l Trajectory of the generated MGCand F0: 0. MGC (45th dim) F0 (Hz) NAT RNN RNN+MLPG RMDN AR-RMDN 45 th dimension of MGC Frame index (utterance BC0_nancy_APDC-66-00) NAT RNN RNN+MLPG RMDN AR-RMDN F0 (after unvoiced/voiced classification) Frame index (utterance BC0_nancy_APDC-66-00) 7/5/7 3

32 EXPERIMENTS Analysis l Global variance [0] of the generated MGC and F0 trajectories GV of MGC GV of F AR-RMDN 0 GV of MGC -4-6 NAT GV of F RNN+MLPG NAT RNN -0 RNN RNN+MLPG RMDN RMDN AR-RMDN Order of MGC NAT RNN RNN+MLPG RMDN AR-RMDN [0] Toda, T., & Tokuda, K. (007). A speech parameter generation algorithm considering global variance for {HMM}-based speech synthesis. IEICE Transactions on Information and Systems, 90(5), /5/7 3

33 EXPERIMENTS Analysis l A larger dynamic range (higher GV)? bo :T,d H d (z) = AR synthesis filter P K k= a k,dz k AR-RMDN bc :T,d RMDNpart TY p(c t ; M t ) t= Frequency response of H d (z) for MGC d [,D],, dimension ofindex featu 7/5/7 33

34 Results l Samples: EXPERIMENTS RNN RNN+MLPG RMDN AR-RMDN Natural w/o formant enhancement with formant enhancement formant enhancement was used in subjective evaluation Other samples : here or tonywangx.github.io 7/5/7 34

35 CONTENTS l Introduction l Method of this work l Experiments and results l Conclusion 35

36 CONCLUSION l Model: AR-RMDN : a pair of AR filter + RMDN l Results: AR synthesis filter: a 'low-pass' filter AR analysis filter: a 'high-pass' filter (see pp.45-46) generated trajectories with a larger dynamic range better perceived quality Subjective evaluation (MUSHRA test) Rates (0-00) 7/5/7 36

37 RECENT WORK l Inverse AR filter with complex poles bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= using sigmoid & tanh function please see attached slides pp.58 7/5/7 37

38 l Sampling from the model FUTURE WORK F0 (Hz) NAT AR-RMDN AR-RMDN (Sampling) Frame index (utterance BC0_nancy_APDC-66-00) AR is still weak for temporal correlation? 7/5/7 38

39 Thank you for your attention Q & A l Toolkit, scripts, slides, and samples: tonywangx.github.io 39

40 l AR vs RNN as the output layer [] To simply the equation suppose, GMM has one mixture component with (thus consider only two frames ARGUMENT o t D t = M t = {µ t }, RMDN is equivalent to RNN trained using MSE) For example, baseline (no RNN output layer, no AR) will be: o o M M M t = {µ t } not o t D is the output of RMDN p(o : ; M : )=p(o ; M )p(o ; M ) = p exp( (o µ ) ) p exp( (o µ ) ) [] Zen, H. and Sak, H. (05). Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages /5/7 40

41 ARGUMENT l AR vs RNN as the output layer RNN output layer: suppose the weight on the RNN output layer is a, no bias o o M a M x x p(o : ; M : )=p(o ; M )p(o ; M ) = p exp( (o µ ) ) p exp( (o µ aµ ) ) 7/5/7 4

42 ARGUMENT l AR vs RNN as the output layer AR: suppose the weight of AR is a, no bias o o M a M x x p(o : ; M : )= p exp( (o µ ) ) p exp( (o µ ao ) ) 7/5/7 4

43 ARGUMENT l AR vs RNN as the output layer Difference in probability density function: o o M a M RNN's case p(o : ; M : )= p exp( = p exp( = p(o ; M )p(o ; M 0 ) (o µ ) ) p exp( (o µ ) ) p exp( (o µ aµ ) ) (o µ 0 ) ) just change the distribution's parameter, µ 0 = µ aµ still independent distribution RNN output layer builds the dependency between parameters 7/5/7 43

44 ARGUMENT l AR vs RNN as the output layer Difference in probability density function: o o M a M AR's case p(o : ; M : )= p (o µ ) exp( ) p exp( = exp( (o µ)> (o µ)) where o =[o,o ] >, µ =[µ,µ + aµ ] (o µ ao ) ) = apple apple a a +a +a a = a = AR builds the dependency between random variables the -dim Gaussian distribution has non-diagonal covariance 7/5/7 44

45 ARGUMENT l AR vs RNN as the output layer Difference in probability density function: o o M a M AR's case Let's compare and (o µ) > (o µ) (o µ ) +(o µ ao ) Let m =[m,m ] > = o Get m > m = m m apple +a a a u apple m m = m + m a m m a + m = m +(m a m ) As m = o µ,m = o µ µ a Get m > m =(o µ ) +(o a µ a o + µ + µ a) =(o µ ) +(o µ o a) 7/5/7 45

46 ARGUMENT l AR vs RNN as the output layer Difference in random sampling: RNN's case o o M a M x x. calculate by neural network, then we have M p(o ; M ). calculate by neural network and, then we have M M p(o ; M ) 3. draw samples bo, bo from and sample doesn't influence p(o ; M ) p(o ; M ) bo p(o ; M ) 7/5/7 46

47 ARGUMENT l AR vs RNN as the output layer Difference in random sampling: AR's case o o M a M x x. calculate by neural network, then we have M p(o ; M ). draw sample from bo p(o ; M ) 3. calculate by neural network and M bo 4. draw sample bo from p(o ; M ) 7/5/7 47

48 ARGUMENT l AR vs RNN as the output layer Difference in training using back-propagation: RNN's case o o M =(o µ aµ )µ AR'scase o o M a M gradient is =(o µ ao )o 7/5/7 48

49 l Sampling from the model ARGUMENT F0 (Hz) NAT AR-RMDN AR-RMDN (Sampling) Frame index (utterance BC0_nancy_APDC-66-00) AR is still weak for temporal correlation? YES! Why? feature transformation bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= 7/5/7 49

50 ARGUMENT l Sampling from the model AR is still weak for temporal correlation? YES! Why? a d a d a d c,d c :T,d 3 c,d c 3,d = c T,d o,d 3 o,d o 3,d o T,d o :T,d feature transformation bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= 7/5/7 50

51 ARGUMENT l Sampling from the model AR is still weak for temporal correlation? YES! Why? a d a d a d c,d c :T,d 3 c,d c 3,d = c T,d o,d 3 o,d o 3,d o T,d o :T,d call it the feature transformation matrix H. Vector c contains samples from i.i.d Gaussian distribution. Vector o ~ Gaussian distribution, with a covariance matrix as Cov(o) = H*Cov (c)*h T 3. Unfortunately, off-diagonal elements of Cov(o) decays too quickly (exponentially) because of H 4. Similar results for high order AR 7/5/7 5

52 l Simple experiment Set 00 frames, let ARGUMENT c t = sin(w t)+, where N (0, I) Use matrix H from the trained AR model, transform c into o (red line) Define a transformation matrix based on i,j = exp( transform c into another vector (green line) 0.5 l ((i j) )) Co-variance matrix based on AR c generated from AR generated from GP Time index Amplitude Time index Co-variance matrix based on GP Time index Time index Time index

ARGUMENT l Simple experiment So: transformation matrix of AR is too simple Co-variance matrix based on AR 00 5 0 c generated from AR generated from GP Time index

53 ARGUMENT l Simple experiment So: transformation matrix of AR is too simple Co-variance matrix based on AR c generated from AR generated from GP Time index Amplitude Time index Co-variance matrix based on GP Time index Time index Time index

54 Objective results EXPERIMENTS MGC RMSE F0 RMSE F0 CORR RNN RMDN RNN+MLPG ARRMDN /5/7 54

55 EXPERIMENTS Analysis l A larger dynamic range? 5 Modulation spectrum [,] of the 30 th dimension of MGC AR-RMDN NAT MS MGC (db) RNN RMDN RNN+MLPG Frequency Index (:/048) bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= [] Hermansky, H. (997). The modulation spectrum in the automatic recognition of speech. In Proc. ASRU (pp ). [] Takamichi, S., Toda, T., Neubig, G., Sakti, S., & Nakamura, S. (04). A postfilter to modify the modulation spectrum in HMM-based speech synthesis. In Proc. ICASSP (pp ). 7/5/7 55

56 EXPERIMENTS Analysis (the training process) l AR model in training stage On the st order AR for MGC A d (z) = tanh( d )z Normalized MGC Original data Filtered data feature trajectory 30 th MGC Frame Normalized MGC 30 ) modulation spectrum 30 th MGC Original data Filtered data Frequency index (k * :/048) AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= Magnitude (db) Frequency response of A(z) 5 0 H (z) 0 H 5 (z) 5 H 30 (z) AH 60 (z) (z) -5 A 5 (z) 0 A 30 (z) -0 A 60 (z) Frequency bin (:/04) 7/5/7 56

57 EXPERIMENTS Analysis (the training process) l AR model in training stage On the st order AR for F0 Normalized F0 4 0 feature trajectory interpolated F0 Original data Filtered data Frame Normalized F modulation spectrum interpolated F0 Original data Filtered data Frequency index (k * :/048) Frequency response of A(z) AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= Frequency Response(dB Normalized frequency (:) 7/5/7 57

58 EXPERIMENTS Analysis (the training process) l AR model in training stage nd order A d (z) =( a z )( a z ) for MGC is also high-pass Value of a a a Order of the MGC Magnitude (db) Frequency response of A d (z) A (z) A 5 (z) A 30 (z) A 60 (z) Frequency bin (:/04) AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= t= 7/5/7 58

59 EXPERIMENTS Analysis (the training process) l AR model in training stage High-pass AR analysis filter Magnitude (db) Frequency response of A (z) order order 3 order 4 order 5 order 6 AR-RMDN AR analysis filter RMDNpart KX Y T o :T,d A d (z) = a k,d z k c :T,d p(c t ; M t ) k= Frequency bin (:/04) t= v Order is re-trained 7/5/7 59

60 EXPERIMENTS II Compare with post-filtering method [5] bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDN part TY p(c t ; M t ) t= RMDN bo 0 :T,d Modulation-spectrumbased post-filter bo :T,d TY p(c t ; M t ) t= ID RMDN RMDN-MS AR-RMDN Description recurrent mixture density network recurrent mixture density network with Modulation-spectrumbased post-filter proposed model [5] Takamichi, S., Toda, T., Neubig, G., Sakti, S., & Nakamura, S. (04). A postfilter to modify the modulation spectrum in HMMbased speech synthesis. In Proc. ICASSP(pp ). 7/5/7 60

61 Results EXPERIMENTS II MGC (30th dim) th dimension of MGC NAT RMDN RMDN-MS AR-RMDN Frame index (utterance BC0_nancy_APDC-66-00) MS MGC (db) AR-RMDN RMDN NAT RMDN-MS -0 Modulation spectrum of 30 th dimension of MGC Frequency Index (:/048) 7/5/7 6

62 EXPERIMENTS II Only compare MGC RMDN RMDN-MS AR-RMDN natural w/o formant enhancement with formant enhancement All systems used the same generated F0 from RMDN 7/5/7 6

63 EXPERIMENTS II Compare MGC + F0 RMDN RMDN-MS AR-RMDN natural w/o formant enhancement with formant enhancement 7/5/7 63

64 Wrap up EXPERIMENTS II Rated quality (from 0 (min) to 00 (max)) RMDN RMDN-MS F AR-RMDN F RMDN-MS M AR-RMDN M RMDN-MS AR-RMDN only enhance F0 only enhance MGC Why? AR-RMDN avoid FFT/iFFT on acoustic trajectories v Note, RMDN-MS could be better [] [] Takamichi, S. (06). Acoustic modeling and speech parameter generation for high-quality statistical parametric speech synthesis. Nara Institute of Science and Technology. Retrieved from 7/5/7 64

65 INTERPRETATION Signals and filters in AR-RMDN l Signal o :T,d T frames D dimensions 3 o, o, o 3, o t, o T, o, o, o 3, o t, o T, o,d o,d o 3,d o t,d o T,d o,d o,d o 3,D o t,d o T,D o > :T,d o t R D Define acoustic feature trajectories o :T,d =[o,d,,o T,d ] > d [,D] 7/5/7 65

66 INTERPRETATION Signals and filters in AR-RMDN l Signal c :T,d. Consider a st order AR model f(o t )=a o t + b, where a =[a,,a D ] >. Look into the GMM with AR MX p(o t o t ; M t )= m= w m t DY d= q exp( t,d m (o t,d f(o t,d ) µ m t,d ) t,d m ) MX = m= w m t DY d= q exp( t,d m (o t,d a d o t,d µ m t,d b d ) t,d m ) 3. Define c t,d = o t,d a d o t,d, d [,D] 4. Define c :T,d =[c,d,,c T,d ] >, d [,D] 7/5/7 66

67 INTERPRETATION Signal and filter in AR-RMDN l Filters Still consider a st order AR, where and c t,d = o t,d a d o t,d d [,D],t [,T] Then, o :T,d and c :T,d are related by a d a d a d. o,d 3 o,d o 3,d o T,d = c,d 3 c,d c 3,d c T,d A d (z) = a d z o :T,d c :T,d Filter with finite impulse response excitation signal filtered signal 7/5/7 67

68 INTERPRETATION Signal and filter in AR-RMDN l Filters In general From o :T,d to c :T,d f(o t K:t )= KX a k o t k + b, k= a,d a,d a,d a K,d a,d. o,d 3 o,d o 3,d o T,d = c,d 3 c,d c 3,d c T,d A d (z) = KX a k,d z k o :T,d c :T,d k= Filter with finite impulse response excitation signal filtered signal 7/5/7 68

69 INTERPRETATION Signal and filter in AR-RMDN l Filters In general From c :T,d to o :T,d f(o t K:t )= KX a k o t k + b, k= a,d a,d a,d a K,d a,d. = c,d 3 c,d c 3,d c T,d = o,d 3 o,d o 3,d o T,d H d (z) = P K k= a k,dz k Filter with infinite impulse response c :T,d o :T,d 7/5/7 69

70 MODEL IMPLEMENTATION Stability of the filter l To make Constraints H d (z) = P K k= a k,dz k,d [,D] Filters stable? Trainable parameters no constraint H(z) = P K k= a kz k a k real poles inside the unit circle one or no real pole & pairs of complex poles inside the unit circle (please read the report) H(z) = 8 < H(z) = : P K k= a kz = Y K k Q K/ k= k= ( kz kz ) ( 0 z ) Q (K )/ k= ( k z kz ) k = sigmoid( c k) q k = sigmoid( c k)tanh(c k ) k z k if K is even if K is odd. c k, c k Please find the technical report SLP-5 Kagawa on tonywangx.github.io 7/5/7 70

71 EXPERIMENTS II High order AR-Filter l Learned AR filter on F0 data bo :T,d AR-RMDN AR synthesis filter H d (z) = P K k= a k,dz k bc :T,d RMDNpart TY p(c t ; M t ) t= Systems Constraints on AR filter H d (z) Form of the AR filter U6 R6 C6 6-order, Unconstrained 6-order, Stable, with real poles 6-order, Stable, with complex poles H(z) = H(z) = H(z) = P K k= a kz k P K k= a kz = Y K k k= Q K/ k= ( kz kz ) k z v Note, constraints are used both in training and generation stage 7/5/7 7

72 EXPERIMENTS II High order AR-Filter l Learned AR filter on F0 data imaginary axis U6 R6 C6 6-0 real axis imaginary axis real axis imaginary axis real axis U6 R6 C6 Systems Constraints on AR filter H d (z) 6-order, Unconstrained 6-order, Stable, with real poles 6-order, Stable, with complex poles Magnitude (db) U6 R6 C Frequency bin (:/04) 7/5/7 7

73 F0 (Hz) EXPERIMENTS II High order AR-Filter l Generated F Frame index (utterance BC0_nancy_APDC-66-00) NAT U R 6 C 6 0 U6 generated very large F0 value (unstable IIR filter) U (-order unconstrained) is plotted instead Vibration of the F0 in C6 Magnitude (db) U6 R6 C Frequency bin (:/04) 7/5/7 73

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn