Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Size: px

Start display at page:

Download "Autoregressive Neural Models for Statistical Parametric Speech Synthesis"

Denis Bell
5 years ago
Views:

1 Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG contact: we welcome critical comments, suggestions, and discussion 1

2 2

3 CONTENTS q Introduction q Models and methods q Summary 3

4 INTRODUCTION Text-to-speech (TTS) pipeline [1,2] Linguistic Text Front-end Back-end features Speech Statistical parametric speech synthesizer (SPSS) [3,4] Acoustic models Spectral features Acoustic features Fundamental frequency (F0) Vocoder [1] Taylor, P. (2009). Text-to-Speech Synthesis. [2] Dutoit, T. (1997). An Introduction to Text-to-speech Synthesis. [3] Tokuda, K., et al., (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), [4] Zen, H., et al. (2009). Statistical parametric speech synthesis. Speech Communication, 51,

5 Topic of this talk INTRODUCTION Linguistic features Acoustic model Acoustic features x 1:T = {x 1,, x T } bo 1:T = {bo 1,, bo T } x 1 x 2 x T p(o bo 1 bo 2 bo T 1:T x 1:T ; ) x t 2 R D x, o t 2 R D o T frames 5

6 Roadmap INTRODUCTION Feedforward Naïve network model (FNN) Recurrent network (RNN) Autoregressive (AR) neural models Perfect model Time dependency modeling 6

7 CONTENTS q Introduction q Models and methods Baseline models q Summary 7

8 FNN q Computation flow BASELINE MODELS bo 1 bo 2 bo 3 bo 4 bo 5 H (FNN) ( ) x 1 x 2 x 3 x 4 x 5 bo t = H (FNN) (x t ) x t 2 R D x, o t 2 R D o 8

9 FNN q As a probabilistic model [6] BASELINE MODELS o 1 o 2 o 3 o 4 o 5 Output layer M 1 M 2 M 3 M 4 M 5 H (FNN) ( ) Mixture density network (MDN) [6] p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t ; M t )= t=1 M t = {µ t }, where µ t = H (FNN) (x t ) bo t = µ t TY N (o t ; µ t, I) t=1 [6] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer- Verlag New York, Inc., Secaucus, NJ, USA,

10 BASELINE MODELS RNN q As a recurrent MDN (RMDN) [7] o 1 o 2 o 3 o 4 o 5 Output layer M 1 M 2 M 3 M 4 M 5 H (RNN) ( ) p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t ; M t )= t=1 TY N (o t ; µ t, I) t=1 M t = {µ t }, where µ t = H (RNN) (x 1:T,t) bo t = µ t [7] M. Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. In Proc. NIPS, pages ,

11 BASELINE MODELS Model assumption q (Conditional) Independence of o 1:T p(o 1:T x 1:T ; ) =p(o 1:T ; M 1:T )= TY p(o t ; M t ) t=1 q Verify the model assumption [8] Mean-based generation bo t = µ t Sampling bo t p(o t ; M t ) [8] M. Shannon. Probabilistic acoustic modelling for parametric speech synthesis. PhD thesis,

12 BASELINE MODELS Model assumption q Example: F0 modeling 450 p(o 1:T ; M 1:T ) by RMDN RMDN mean ±1 std range Frame index (utterance BC2011 nancy APDC ) 450 Natural F0 NAT RMDN sample NAT 350 F0 (Hz) F0 (Hz) Sampling Frame index (utterance BC2011 nancy APDC ) 12

13 CONTENTS q Introduction q Models and methods Baseline models AR models q Summary 13

14 AR MODELS Improve baseline models Directed graphic models AR models [8] o 1 o 2 o 3 p(o 1:T )= TY p(o t o 1:t 1 ) t=1 FNN RNN Undirected graphic models Trajectory model [9] o o, 2 1 o 2 ( o + MLPG [10] )* Q k o 3 p(o 1:T )= f k(o 1:T ) Z [8] M. Shannon. Probabilistic acoustic modelling for parametric speech synthesis. PhD thesis, [9] H. Zen, K. Tokuda, and T. Kitamura. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences. 14 Computer Speech & Language, 21(1): , [10] T. Keiichi, Y. Takayoshi, M. Takashi, K. Takao, and K. Tadashi. Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP, pages , 2000.

15 AR MODELS Improve baseline models Directed graphic models AR models [8] o 1 o 2 o 3 p(o 1:T )= TY p(o t o 1:t 1 ) t=1 FNN RNN 1. Shallow AR models (SAR) 2. AR flow 3. Deep AR models (DAR) 15

16 Definition SHALLOW AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T ; M 1:T )= TY p(o t o t K:t 1 ; M t ) t=1 16

17 SHALLOW AR MODEL Implementation a 2 a 2 a 2 o 1 o 2 o 3 o 4 o 5 K = 2 a 1 a 1 a 1 a 1 p(o 1:T ; M 1:T )= = TY p(o t o t K:t 1 ; M t ) t=1 TY N (o t ; µ t + F(o t K:t 1 ), t ) t=1 F(o t K:t 1 )= Time invariant K: hyper-parameter {a k, b} KX a k o t k + b k=1 Trainable parameters Shallow AR (SAR) 17

18 Theory SHALLOW AR MODEL SAR versus RMDN with a recurrent output layer [11] SAR RMDN o a 1 o 2 o 1 o 2 w µ µ µ 1 µ 2 1 µ 2 output layer h 1 h 2 h 1 h 2 hidden layer x 1 x 2 x 1 x 2 Assume and, linear activation function o t 2 R t =1 [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages ,

19 Theory SHALLOW AR MODEL RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b + w µ µ 1 = µ 2 + w µ µ 1 x 1 x 2 SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b 19

20 Theory SHALLOW AR MODEL RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) o =[o 1,o 2 ] > µ =[µ 1, µ 2 + w µ µ 1 ] > = apple SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 Dependency between µ t or o t? p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) apple o =[o 1,o 2 ] > µ =[µ 1,µ 2 + aµ 1 ] > 1 a = a 1+a 2 20

21 Experiments q Data SHALLOW AR MODEL Corpus: Blizzard Challenge 2011 x : similar to HTS English [12] t o t : MGC, interpolated F0, U/V, BAP q Network RNN feedforward -> feedforward -> Bi-LSTM -> Bi->LSTM RMDN SAR K=1, 2,0 for MGC, F0 and BAP RNN+MLPG RNN with o, 2 o + MLPG [12] K. Tokuda, H. Zen, and A. W. Black. An HMM-based speech synthesis system applied to english. In Proc. SSW, pages , Sept

22 Experiments q Results 8 SHALLOW AR MODEL 6 MGC (1st dim) MGC (15th dim) NAT RNN RNN+MLPG RMDN SAR AR-RMDN 1 st dimension of MGC Frame index (utterance BC2011_nancy_APDC ) NAT RNN RNN+MLPG RMDN AR-RMDN SAR 15 th dimension of MGC Frame index (utterance BC2011_nancy_APDC ) 1/12/18 22

23 MGC (30th dim) Experiments q Results SHALLOW AR MODEL NAT RNN RNN+MLPG RMDN SAR AR-RMDN F0 (Hz) th dimension of MGC Frame index (utterance BC2011_nancy_APDC ) NAT RNN RNN+MLPG RMDN SAR AR-RMDN F0 (after U/V classification) Frame index (utterance BC2011_nancy_APDC ) 1/12/18 23

24 Experiments q Results SHALLOW AR MODEL 2 GV of generated MGC 0-2 AR-RMDN SAR GV of MGC -4-6 NAT -8 RNN+MLPG NAT RNN -10 RNN RNN+MLPG RMDN RMDN AR-RMDN Order of MGC 1/12/18 24

25 Experiments q Samples SHALLOW AR MODEL RNN RNN+MLPG RMDN SAR Natural w/o formant enhancement with formant enhancement 25

26 Interpretation q Feature transformation SHALLOW AR MODEL RMDN c 1 c 2 µ 1 µ 2 h 1 h 2 p(o) =p(c) =N (c; µ c, c ) apple µ c =[µ 1,µ 2 ] > 1 0 c = 0 1 x 1 x 2 o a 1 o 2 c = apple c1 c 2 = apple o 1 = o 2 ao 1 apple 1 0 a 1 apple o1 o 2 = Ao SAR µ 1 µ 2 h 1 h 2 x 1 x 2 o =[o 1,o 2 ] > p(o) =N (o; µ o, o ) µ o =[µ 1,µ 2 + aµ 1 ] > o = apple 1 a a 1+a 2 26

27 Interpretation q Feature transformation For o 1:T 2 R D T SHALLOW AR MODEL o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D A (1) A (D) 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T A (1) A (D) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T A (1) 1 A (D) 1 bc 1:T TY p(c t ; M t ) t=1 x 1:T 27

28 Interpretation q Signal and filters For o 1:T 2 R D T SHALLOW AR MODEL o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D filter 1 filter D 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T filters A 1 (z) A D (z) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T TY p(c t ; M t ) t=1 x 1:T 28

29 Interpretation q Secret of SAR o 1:T filters A 1 (z) A D (z) SHALLOW AR MODEL c 1:T Magnitude (db) A 1 (z) A 1 (z) Frequency bin (:/1024) bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T Magnitude (db) /A 1 (z) H 1 (z) Only due to 1/A d (z)? Frequency bin (:/1024) Due to {A d (z), 1/A d (z)}, less mismatch between c 1:T and RMDN 29

30 CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow q Summary 30

31 AR FLOW Is SAR good enough? q Random sampling on SAR 450 Randomly sampled F0 Natural F0 RMDN sample SAR sample F0 (Hz) Frame index (utterance BC2011 nancy APDC ) q Reason? Linear Time-invariant a k a 2 a 2 a 2 o 1 o 2 o 3 o 4 o 5 a 1 a 1 a 1 a 1 31

32 Theory q Change random variable AR FLOW o 1:T c = Ao c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo = A 1 bc bc 1:T TY p(c t ; M t ) t=1 32

33 Theory q Change random variable AR FLOW o 1:T c 1:T = f (o 1:T ) c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo 1:T = f 1 (bc 1:T ) bc 1:T TY p(c t ; M t ) t=1 Jacobian matrix must be simple f (.) must be invertible p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T [13] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages , [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages ,

34 Basic idea AR FLOW p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) 1:T SAR AR Flow Transform c t = o t X K k=1 a k o t k c t = o t X K k=1 f (k) (o 1:t k ) o t k De-transform KX bo t = c t + a k bo t k k=1 KX bo t = c t + f (k) (bo 1:t k ) bo t k k=1 Very simple: 1:T =1 µ t = RNN(o 1:t 1 ) 34

35 Implementation q Training AR FLOW o 1 o 2 o 3 o 4 o 5 µ t = RNN(o 1:t 1 ) 0 µ 2 µ 3 µ 4 µ 5 c 1 c 2 c 3 c 4 c 5 c t = o t + µ t M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 35

36 Implementation q Generation AR FLOW bo 1 bo 2 bo 3 bo 4 bo 5 bo t = bc t µ t µ 2 µ 3 µ 4 µ 5 bc 1 bc 2 bc 3 bc 4 bc 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 36

37 AR FLOW Experiments on MGC q Data Japanese data Neither MLPG nor formant enhancement q Network RNN SAR AR Flow RNN + AR flow (Uni-LSTM + feedforward) 37

38 Experiments on MGC q Results AR FLOW 1 st dimension of MGC 30 th dimension of MGC 38

39 Experiments on MGC q Results AR FLOW GV of generated MGC Modulation spectrum 30 th dimension of MGC Natural RNN AR Flow SAR 39

40 Experiments on MGC q Samples AR FLOW RNN SAR AR Flow Natural Natural duration F0 from RNN Experiments will be done on F0 40

41 CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow Deep AR model q Summary 41

42 DEEP AR MODEL Random sampling from AR Flow? q Not work c TY o 1:T 1:T c 1:T = f (o 1:T ) p(c t ; M t ) x 1:T t=1 1:T q One way to go o 1:T Network with AR dependency x 1:T 42

43 Definition DEEP AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 43

44 Definition DEEP AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T ; M 1:T )= TY p(o t o 1:t 1 ; M t ) t=1 44

45 DEEP AR MODEL Experiments on MGC q No better results yet Experiments on F0 q Samples RNN RMDN SAR DAR Natural Network only models F0 Given natural MGC (and duration) 45

46 DEEP AR MODEL Experiments on F0 p-value MOS NAT DAR MOS score SAR RMDN RNN 3.00 NAT DAR SAR RMDN RNN NAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30 DAR <1e e e e-30 SAR <1e e RMDN <1e e RNN <1e e GV of F0 at utterance-level (Hz) NAT RNN RMDN SAR DAR NAT RNN F0 GV RMDN SAR DAR 46

47 DEEP AR MODEL Random sampling on F0 q Japanese data 450 NAT RMDN sample SAR sample F0 (Hz) Frame index (utterance ATR Ximera F009 NIKKEIR T01) NAT DAR sample 450 F0 (Hz) Frame index (utterance ATR Ximera F009 NIKKEIR T01)

48 DEEP AR MODEL Random sampling on F0 q Japanese data 450 NAT DAR sample 1 F0 (Hz) NAT DAR sample 2 F0 (Hz) NAT DAR sample 2 F0 (Hz) Frame index (utterance ATR Ximera F009 NIKKEIR T01)

49 DEEP AR MODEL Random sampling on F0 q English data F0 (Hz) NAT SAMPLE F0 (Hz) NAT SAMPLE F0 (Hz) NAT SAMPLE

50 CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow Deep AR model q Summary 50

51 SUMMARY FNN RNN SAR AR Flow DAR Random sampling as a diagnostic tool Linear & invertible c 1:T = Ao 1:T Non-linear & invertible Non-linear & non-invertible c 1:T = f(o 1:T ) AR dependency 51

52 Message SUMMARY RMDN o 1 o 2 w µ µ 1 µ 2 Dependency between µ t or o t? SAR o 1 a o 2 µ 1 µ 2 h 1 h 2 h 1 h 2 x 1 x 2 x 1 x 2 52

53 Thank you for your attention Q & A Toolkit, scripts, slides tonywangx.github.io 53

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 07-03-07