Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Size: px

Start display at page:

Download "Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis"

Mavis Hutchinson
5 years ago
Views:

1 ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi YAMAGISHI National Institute of Informatics, Japan & Aalto University, Finland contact: we welcome critical comments, suggestions, and discussion 1

2 qmotivation Better modules for the statistical parametric speech synthesis (SPSS) framework? qmethod OVERVIEW Plug and test new acoustic models and waveform generators qresults Best combination WaveNet-based vocoder Autoregressive (AR) acoustic models Quality: as good as vocoded speech (at 16kHz) 2

3 CONTENTS q Introduction q Models and modules q Experiments q Summary 3

4 INTRODUCTION Background q Conventional TTS pipeline [1] Linguistic Text Front-end Back-end features Speech q SPSS back-end [2,3] Linguistic features Acoustic models Spectral MGC & features BAP F0 Waveform generators Speech MGC: Mel-generalized cepstral coefficients [4] BAP: band-aperiodicity [1] T. Dutoit. An Introduction to Text-to-speech Synthesis. Kluwer Academic Publishers, Norwell, MA, USA, [2] Tokuda, K., et al., (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), [3] Zen, H., et al. (2009). Statistical parametric speech synthesis. Speech Communication, 51, [4] Tokuda, K., Kobayashi, T., Masuko, T., and Imai, S. (1994). Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages

5 INTRODUCTION Topic of this work q Better modules for SPSS back-end? Recurrent neural networks (RNNs) Autoregressive (AR) models General adversarial network (GAN) WORLD vocoder + Phrase recovery WaveNet-based vocoder Linguistic features Acoustic models MGC & BAP F0 Waveform generators Speech MGC: Mel-generalized cepstral coefficients BAP: band-aperiodicity 5

6 CONTENTS q Introduction q Models and modules q Experiments q Summary 6

7 MODELS & METHODS Speech waveforms WORLD MGC & BAP & F0 Waveform generators RNN Linguistic features Acoustic models 7

8 MODELS & METHODS Acoustic models q Baseline RNN bo 1 bo 2 bo 3 bo 4 bo 5 x 1 x 2 x 3 x 4 x 5 Sequence of linguistic features Sequence of generated acoustic features {x 1,, x t, } {bo 1,, bo t, } 8

9 MODELS & METHODS Acoustic models q Baseline RNN o 1 o 2 o 3 o 4 o 5 Probabilistic models M 1 M 2 M 3 M 4 M 5 Neural network H (RNN) ( ) p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t x 1:T ; ) = t=1 M t = {µ t }, where µ t = H (RNN) (x 1:T,t) bo t = µ t TY N (o t ; µ t, I) t=1 [5] C. M. Bishop. Neural networks for pattern recognition. Oxford university press,

10 MODELS & METHODS Acoustic models q Baseline RNN o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 Limitations 1. Conditional independence 2. Maximum-likelihood training AR models GAN x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ; ) = TY p(o t x 1:T ; ) = t=1 TY N (o t ; µ t, I) t=1 10

11 MODELS & METHODS Acoustic models q Shallow AR (SAR) o 1 o 2 o 3 o 4 o 5 K = 2 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ;, )= TY p(o t o t K:t 1, x 1:T ; ) = t=1 TY N (o t ; µ t + f (o t K:t 1 ), I) t=1 Alternative interpretation: trainable filter + RNN [6] X. Wang, S. Takaki, and J. Yamagishi. An autoregressive recurrent mixture density network for parametric speech synthesis. In Proc. ICASSP, pages ,

12 MODELS & METHODS Acoustic models q Deep AR (DAR) o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T x 1:T ; ) = TY p(o t o 1:t 1, x 1:T ; ) t=1 Only for (quantized) F0 modeling [7] X. Wang, S. Takaki, and J. Yamagishi. Autoregressive neural F0 model for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing. (Accepted) 12

13 MODELS & METHODS Acoustic models q GAN GAN-based post-filter [8] Acoustic features (generated) Acoustic features (natural) Noise + Residual generator Discriminator Acoustic model Linguistic features 0/1 [8] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino. Generative adversarial network-based postfilter for statistical parametric speech synthesis. In Proc. ICASSP, pages ,

14 MODELS & METHODS Acoustic models Speech waveforms WORLD F0 MGC Waveform generators GAN DAR SAR RNN Acoustic models Linguistic features BAP is not shown 14

15 MODELS & METHODS Acoustic models SAR-Wo SGA-Wo RGA-Wo RNN-Wo WORLD Waveform generators F0 GAN MGC DAR SAR RNN Acoustic models Linguistic features BAP is not shown 15

16 MODELS & METHODS Waveform generators q Deterministic approaches WORLD [9] o o Binary voicing decision Minimum phase A log domain pulse model (PML) [10] o o Source-filter model, additive in log-domain Binary noisy mask WORLD + phrase recovery Generated waveform STFT amplitude Phrase recovery [11] Inverse STFT Phraserecovered waveform [9] M. Morise, et al. WORLD: A vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. on Information and Systems, 99(7): , [10] G. Degottex, et al. A log domain pulse model for parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing, [11] D. Griffin and J. Lim. Signal estimation from modified short-time Fourier transform. IEEE Trans. ASSP, 32(2): ,

MODELS & METHODS Waveform generators q WaveNet-based vocoder [12, 13] How to generate (search) a waveform: 1. Exploration: sampling 2. Exploitation: picking one-best [12] A. van den Oord, S.

17 MODELS & METHODS Waveform generators q WaveNet-based vocoder [12, 13] How to generate (search) a waveform: 1. Exploration: sampling 2. Exploitation: picking one-best [12] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arxiv preprint arxiv: , [13] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda. Speaker-dependent WaveNet vocoder. In Proc. Interspeech, pages ,

18 MODELS & METHODS Waveform generators q WaveNet-based vocoder Unvoiced segment probability Sampling point Waveform level Voiced region Sampling in unvoiced region Picking one-best in voiced region Less distortion of harmonics appendix & paper Sampling point Waveform level 59

19 MODELS & METHODS Acoustic models SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo WaveNet Phrase recovery PML minimum phase WORLD Waveform generators F0 GAN MGC DAR SAR RNN Acoustic models Linguistic features 19

20 CONTENTS q Introduction q Models and modules q Experiments q Summary 20

21 Configuration q Data EXPERIMENTS Corpus Size Note ATR Ximera F009 voice [14] ~30,000 utterances 48 hours Recording period: over 1 year Sampling rate: 48kHz Japanese, neutral style, reading Feature Dimension Linguistic Phone identity, prosodic tags... ~ 390 Acoustic MGC 60 BAP 25 F0 1 q Front-end: Open JTalk [15] [14] Kawai, H., Toda, T., Ni, J., Tsuzaki, M., and Tokuda, K. (2004). Ximera: A new TTS from ATR based on corpus-based technologies. In Proc. SSW5, pages [15] HTS Working Group. The Japanese TTS System Open JTalk, 2015.

22 Configuration q Listening test Quality: MOS (1-5) Similarity: rate 1-5, natural reference 48kHz Participants: 235 native Japanese listeners, 1500 sets of results q Systems EXPERIMENTS Common network configuration (cf. the paper) 2 Without,, nor formant enhancement Sampling rate: 48kHz & 16kHz, except SAR-Wa at 16kHz (10 bits, μ-law) Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR GAN SAR GAN RNN RNN 22

23 EXPERIMENTS 48kHz 16kHz Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 23

24 Quality scores EXPERIMENTS Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 24

25 Similarity scores EXPERIMENTS Natural Abs-Pm Abs-Wo SAR-Wa SAR-Pr SAR-Pm SAR-Wo SGA-Wo RGA-Wo RNN-Wo PML WORLD WaveNet +PhraseRec WORLD PML WORLD natural natural SAR SAR SAR SAR SAR GAN SAR GAN RNN RNN 25

26 CONTENTS q Introduction q Models and methods q Experiments q Summary 26

27 Plug and test q SAR: Avoid conditional independence assumption Alleviate the over-smoothing effect appendix & paper q WaveNet SUMMARY Generation method: one-best generation + random sampling Less distortion of harmonics appendix & paper q WaveNet vocoder + SAR & DAR Better than other combinations Worse than natural speech Close to vocoded speech 27

28 FURTHER IMPROVEMENT? Recent work appendix q SAR Special case of volume-preserving normalizing flow [16] Extended SAR = time-variant filter + RNN q WaveNet-vocoders Training based on generated conditional features q Annotated linguistic features Reduce the gap between natural & synthetic speech Future work? q Reduce the variability of recordings q Use complex-valued neural models [16] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In Proc. ICML, pages ,

29 MESSAGE Code, recipes, slides q Acoustic models & WaveNet (CUDA/C++) q Simple explanation on WaveNet and acoustic models tonywangx.github.io wangxin@nii.ac.jp 29

30 Thank you for your attention Q & A tonywangx.github.io wangxin@nii.ac.jp 30

31 APPENDIX - WAVENET Conditional network WaveNet-Backend Learning curve No cherry picking All samples based on generated acoustic features or automatically inferred linguistic features Generation method Generation with other acoustic models Training based on generated features Tutorial slides: 31

32 Structure APPENDIX - WAVENET P (o t o t R:t 1, c 1:N ) Post-processing network Linear+tanh Linear+tanh Softmax + e t 1 s (1) t s (2) t o Wavenet Wavenet t 1 Linear block 1 block 2 e t r (1) t s (M) t Wavenet block M Clock rate: 16kHz l t c 1:N F0 Bi-LSTM Conditional feature network CNN concatenation Linear Up-sampling Clock rate: 200Hz (frame shift = 5ms) 32

33 APPENDIX - WAVENET Conditional network q Architecture of the conditional network F0 MGC Trial1 c 1:N Linear l 1:N Trial2 c 1:N Bi-LSTM CNN Linear l 1:N Trial 3 F0 c 1:N Bi-LSTM CNN Concat. Linear l 1:N 33

34 APPENDIX - WAVENET Conditional network q Experiments on WaveNet Vocoder Given generated MGC/F0 sample1 sample2 sample3 Natural Trial 1 None Trial 2 LSTM+CNN Trial 3 LSTM+CNN+skip-F0 34

35 APPENDIX - WAVENET WaveNet backend q Architecture Text Text analyzer Textual features F0 model (DAR) WaveNet backend Waveform q WaveNet-backend only uses random sampling Natural WaveNetvocoder WaveNetbackend sample1 sample2 sample3 sample4 sample5 35

36 APPENDIX - WAVENET Learning curve - log likelihood Train set Val. Set epoch WaveNet vocoder - log likelihood Train set Val. Set epoch We trained WaveNet backend for more than 100 epochs WaveNet backend 36

37 APPENDIX - WAVENET Generation method 1000 Waveform probablity levels (0-1024) Waveform waveform levels (mu-law) (0-1024) sampling point 37

38 APPENDIX - WAVENET 0.5 Generation method q Experiments on WaveNet vocoder Sampling point Waveform level Sampling point Waveform level 59

39 APPENDIX - WAVENET Generation method q One-best + random sampling Given generated MGC/F0 voiced region: one-best Mix: unvoiced region: random sampling Random: random sampling all the time Natural sample1 sample2 sample3 Mix Random 39

40 Natural Voiced: one-best Unvoiced: sampling Voiced: sampling Unvoiced: sampling 40

41 APPENDIX - WAVENET Generation method q One-best + random sampling Mix: Mix2: voiced region: one-best unvoiced region: random sampling 75% voiced frames: one-best else: random sampling Mix4: 25% voiced frames: one-best else: random sampling 41

42 INVESTIGATION Generation method q One-best + random sampling Natural sample1 sample2 sample3 Mix Mix2 100% 75% Mix3 Mix4 Random 50% 20% 00% 42

43 APPENDIX - WAVENET Generation method q One-best + random sampling sample1 sample2 sample3 Natural WaveNet backend WaveNet vocoder Mix Random Mix 43

44 Natural WaveNet-backend Mix Mix WaveNet-backend Random Random 44

45 APPENDIX - WAVENET WaveNet-vocoder + other acoustic models Natural sample1 sample2 sample3 SAR + DAR SGA + DAR RGA + DAR RNN + DAR extended SAR + DAR 45

46 APPENDIX - WAVENET WaveNet-vocoder: training using generated MGC sample1 sample2 sample3 Natural WaveNet-Backend Trained on natural MGC WaveNet- Vocoder Trained on generated MGC Epoch 35 Trained on generated MGC Epoch 45 Trained on generated MGC Epoch 55 46

47 APPENDIX ACOUSTIC MODELS General comparison SAR & DAR SAR extension More details: 47

48 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories 4 MGC 1 dim MGC 1 st dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) Frame index (utterance ATR Ximera F009 AOZORAR T01) MGC 5 th dim MGC 5 dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) Frame index (utterance ATR Ximera F009 AOZORAR T01) 48

49 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories Natural RNN SAR 0.2 MGC 15 dim MGC 15 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) Natural RNN SAR MGC 31 dim MGC 31 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) 49

50 APPENDIX ACOUSTIC MODELS General comparison q Generated trajectories Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) 0.2 MGC 15 dim MGC 15 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) MGC 31 dim Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA) MGC 31 th dim Frame index (utterance ATR Ximera F009 AOZORAR T01) 50

51 Utterance-level GV of MGC APPENDIX ACOUSTIC MODELS General 1 comparison Global variance 1.5 q 0GV Natural SAR SAR+GAN (SGA) RNN+GAN (RGA) RNN Dimension index of MGC Modulation spectrum (MGC 31th) 51

52 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example SAR versus RMDN with a recurrent output layer [11] SAR RMDN o a 1 o 2 o 1 o 2 w µ µ µ 1 µ 2 1 µ 2 output layer h 1 h 2 h 1 h 2 hidden layer x 1 x 2 x 1 x 2 Assume o t 2 R and t =1, linear activation function [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,

53 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b + w µ µ 1 = µ 2 + w µ µ 1 o a 1 o 2 SAR µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,

54 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) o =[o 1,o 2 ] > µ =[µ 1, µ 2 + w µ µ 1 ] > = apple SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 Dependency between µ t or o t? p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) apple o =[o 1,o 2 ] > µ =[µ 1,µ 2 + aµ 1 ] > 1 a = a 1+a 2 54

55 APPENDIX ACOUSTIC MODELS SAR & DAR q Toy SAR example RMDN c 1 c 2 µ 1 µ 2 h 1 h 2 p(o) =p(c) =N (c; µ c, c ) apple µ c =[µ 1,µ 2 ] > 1 0 c = 0 1 x 1 x 2 o a 1 o 2 c = apple c1 c 2 = apple o 1 = o 2 ao 1 apple 1 0 a 1 apple o1 o 2 = Ao SAR µ 1 µ 2 h 1 h 2 x 1 x 2 o =[o 1,o 2 ] > p(o) =N (o; µ o, o ) µ o =[µ 1,µ 2 + aµ 1 ] > o = apple 1 a a 1+a 2 55

56 APPENDIX ACOUSTIC MODELS SAR & DAR q SAR: invertible linear feature/model transformation For o 1:T 2 R D T SAR is equivalent to: o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D A (1) A (D) 2 3 c 1:T,1 4 5 c 1:T,D Training o 1:T A (1) A (D) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T A (1) 1 A (D) 1 bc 1:T TY p(c t ; M t ) t=1 x 1:T 56

57 APPENDIX ACOUSTIC MODELS SAR & DAR q SAR: invertible linear feature/model transformation For o 1:T 2 R D T o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D filter 1 filter D 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T filters A 1 (z) A D (z) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T TY p(c t ; M t ) t=1 x 1:T 57

58 APPENDIX ACOUSTIC MODELS SAR & DAR 5 q SAR: invertible linear feature/model transformation o 1:T filters A 1 (z) A D (z) c 1:T Magnitude (db) A 1 (z) A 1 (z) Frequency bin (:/1024) bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T Magnitude (db) /A 1 (z) H 1 (z) Only due to 1/A d (z)? Frequency bin (:/1024) Due to {A d (z), 1/A d (z)}, less mismatch between c 1:T and RMDN 58

59 APPENDIX ACOUSTIC MODELS SAR extension: normalizing flow q Basic idea o 1:T c 1:T = f (o 1:T ) c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo 1:T = f 1 (bc 1:T ) bc 1:T TY p(c t ; M t ) t=1 Jacobian matrix must be simple f (.) must be invertible p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T [13] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages , [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages ,

60 APPENDIX ACOUSTIC MODELS SAR extension: normalizing flow q Basic idea p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T SAR AR Flow Transform c t = o t X K k=1 a k o t k c t = o t X K k=1 f (k) (o 1:t k ) o t k De-transform KX bo t = c t + a k bo t k k=1 KX bo t = c t + f (k) (bo 1:t k ) bo t k k=1 Simple for SAR: 1:T =1 µ t = RNN(o 1:t 1 ) 60

61 APPENDIX ACOUSTIC MODELS SAR extension q SAR can be extended More details: 61

62 p-value APPENDIX ACOUSTIC MODELS DAR q Same autoregressive principle q But noninvertible nonlinear MOS NAT DAR MOS score SAR RMDN RNN 3.00 NAT DAR SAR RMDN RNN NAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30 DAR <1e e e e-30 SAR <1e e RMDN <1e e RNN <1e e GV of F0 at utterance-level (Hz) NAT RNN RMDN SAR DAR NAT RNN F0 GV RMDN SAR DAR [7] X. Wang, S. Takaki, and J. Yamagishi. Autoregressive neural F0 model for statistical parametric speech synthesis. IEEE Transactions on Audio, Speech and Language Processing. (Accepted) 62

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn