Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1

https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn https://www.techemergence.com/deepminds-nando-de-freitas-why-deep-learning-is-like-building-with-legos/ 2

CONTENTS q Introduction q Models and methods q Summary 3

INTRODUCTION Text-to-speech (TTS) pipeline [1,2] Linguistic Text Front-end Back-end features Speech Statistical parametric speech synthesizer (SPSS) [3,4] Acoustic models Spectral features Acoustic features Fundamental frequency (F0) Vocoder [1] Taylor, P. (2009). Text-to-Speech Synthesis. [2] Dutoit, T. (1997). An Introduction to Text-to-speech Synthesis. [3] Tokuda, K., et al., (2013). Speech Synthesis Based on Hidden Markov Models. Proceedings of the IEEE, 101(5), 1234 1252. [4] Zen, H., et al. (2009). Statistical parametric speech synthesis. Speech Communication, 51, 1039 1064. 4

Topic of this talk INTRODUCTION Linguistic features Acoustic model Acoustic features x 1:T = {x 1,, x T } bo 1:T = {bo 1,, bo T } x 1 x 2 x T p(o bo 1 bo 2 bo T 1:T x 1:T ; ) x t 2 R D x, o t 2 R D o T frames 5

Roadmap INTRODUCTION Feedforward Naïve network model (FNN) Recurrent network (RNN) Autoregressive (AR) neural models Perfect model Time dependency modeling 6

CONTENTS q Introduction q Models and methods Baseline models q Summary 7

FNN q Computation flow BASELINE MODELS bo 1 bo 2 bo 3 bo 4 bo 5 H (FNN) ( ) x 1 x 2 x 3 x 4 x 5 bo t = H (FNN) (x t ) x t 2 R D x, o t 2 R D o 8

FNN q As a probabilistic model [6] BASELINE MODELS o 1 o 2 o 3 o 4 o 5 Output layer M 1 M 2 M 3 M 4 M 5 H (FNN) ( ) Mixture density network (MDN) [6] p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t ; M t )= t=1 M t = {µ t }, where µ t = H (FNN) (x t ) bo t = µ t TY N (o t ; µ t, I) t=1 [6] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer- Verlag New York, Inc., Secaucus, NJ, USA, 2006. 9

BASELINE MODELS RNN q As a recurrent MDN (RMDN) [7] o 1 o 2 o 3 o 4 o 5 Output layer M 1 M 2 M 3 M 4 M 5 H (RNN) ( ) p(o 1:T x 1:T ; ) = x 1 x 2 x 3 x 4 x 5 TY p(o t ; M t )= t=1 TY N (o t ; µ t, I) t=1 M t = {µ t }, where µ t = H (RNN) (x 1:T,t) bo t = µ t [7] M. Schuster. Better generative models for sequential data problems: Bidirectional recurrent mixture density networks. In Proc. NIPS, pages 589 595, 1999. 10

BASELINE MODELS Model assumption q (Conditional) Independence of o 1:T p(o 1:T x 1:T ; ) =p(o 1:T ; M 1:T )= TY p(o t ; M t ) t=1 q Verify the model assumption [8] Mean-based generation bo t = µ t Sampling bo t p(o t ; M t ) [8] M. Shannon. Probabilistic acoustic modelling for parametric speech synthesis. PhD thesis, 2014. 11

BASELINE MODELS Model assumption q Example: F0 modeling 450 p(o 1:T ; M 1:T ) by RMDN RMDN mean ±1 std range 350 250 150 100 200 300 400 500 Frame index (utterance BC2011 nancy APDC2-166-00) 450 Natural F0 NAT RMDN sample NAT 350 F0 (Hz) F0 (Hz) Sampling 250 150 100 200 300 400 500 Frame index (utterance BC2011 nancy APDC2-166-00) 12

CONTENTS q Introduction q Models and methods Baseline models AR models q Summary 13

AR MODELS Improve baseline models Directed graphic models AR models [8] o 1 o 2 o 3 p(o 1:T )= TY p(o t o 1:t 1 ) t=1 FNN RNN Undirected graphic models Trajectory model [9] o o, 2 1 o 2 ( o + MLPG [10] )* Q k o 3 p(o 1:T )= f k(o 1:T ) Z [8] M. Shannon. Probabilistic acoustic modelling for parametric speech synthesis. PhD thesis, 2014. [9] H. Zen, K. Tokuda, and T. Kitamura. Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences. 14 Computer Speech & Language, 21(1):153 173, 2007. [10] T. Keiichi, Y. Takayoshi, M. Takashi, K. Takao, and K. Tadashi. Speech parameter generation algorithms for HMM-based speech synthesis. In Proc. ICASSP, pages 936 939, 2000.

AR MODELS Improve baseline models Directed graphic models AR models [8] o 1 o 2 o 3 p(o 1:T )= TY p(o t o 1:t 1 ) t=1 FNN RNN 1. Shallow AR models (SAR) 2. AR flow 3. Deep AR models (DAR) 15

Definition SHALLOW AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T ; M 1:T )= TY p(o t o t K:t 1 ; M t ) t=1 16

SHALLOW AR MODEL Implementation a 2 a 2 a 2 o 1 o 2 o 3 o 4 o 5 K = 2 a 1 a 1 a 1 a 1 p(o 1:T ; M 1:T )= = TY p(o t o t K:t 1 ; M t ) t=1 TY N (o t ; µ t + F(o t K:t 1 ), t ) t=1 F(o t K:t 1 )= Time invariant K: hyper-parameter {a k, b} KX a k o t k + b k=1 Trainable parameters Shallow AR (SAR) 17

Theory SHALLOW AR MODEL SAR versus RMDN with a recurrent output layer [11] SAR RMDN o a 1 o 2 o 1 o 2 w µ µ µ 1 µ 2 1 µ 2 output layer h 1 h 2 h 1 h 2 hidden layer x 1 x 2 x 1 x 2 Assume and, linear activation function o t 2 R t =1 [11] H. Zen and H. Sak. Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis. In Proc. ICASSP, pages 4470 4474, 2015. 18

Theory SHALLOW AR MODEL RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b + w µ µ 1 = µ 2 + w µ µ 1 x 1 x 2 SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) µ 1 = w > h 1 + b µ 2 = w > h 2 + b 19

Theory SHALLOW AR MODEL RMDN o 1 o 2 w µ µ 1 µ 2 h 1 h 2 x 1 x 2 p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + w µ µ 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) o =[o 1,o 2 ] > µ =[µ 1, µ 2 + w µ µ 1 ] > = apple 1 0 0 1 SAR o a 1 o 2 µ 1 µ 2 h 1 h 2 x 1 x 2 Dependency between µ t or o t? p(o 1:2 )=N (o 1 ; µ 1, 1)N (o 2 ; µ 2 + ao 1, 1) = 1 2 exp( 1 2 (o µ)> 1 (o µ)) apple o =[o 1,o 2 ] > µ =[µ 1,µ 2 + aµ 1 ] > 1 a = a 1+a 2 20

Experiments q Data SHALLOW AR MODEL Corpus: Blizzard Challenge 2011 x : similar to HTS English [12] t o t : MGC, interpolated F0, U/V, BAP q Network RNN feedforward -> feedforward -> Bi-LSTM -> Bi->LSTM RMDN SAR K=1, 2,0 for MGC, F0 and BAP RNN+MLPG RNN with o, 2 o + MLPG http://tonywangx.github.io/#sar [12] K. Tokuda, H. Zen, and A. W. Black. An HMM-based speech synthesis system applied to english. In Proc. SSW, pages 227 230, Sept 2002. 21

Experiments q Results 8 SHALLOW AR MODEL 6 MGC (1st dim) MGC (15th dim) 4 2 0-2 0.8 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8 NAT RNN RNN+MLPG RMDN SAR AR-RMDN 1 st dimension of MGC 100 200 300 400 500 600 700 800 1000 Frame index (utterance BC2011_nancy_APDC2-166-00) NAT RNN RNN+MLPG RMDN AR-RMDN SAR 15 th dimension of MGC 100 200 300 400 500 600 700 800 1000 Frame index (utterance BC2011_nancy_APDC2-166-00) 1/12/18 22

MGC (30th dim) Experiments q Results 0.2 0.1 0-0.1 SHALLOW AR MODEL NAT RNN RNN+MLPG RMDN SAR AR-RMDN F0 (Hz) -0.2-0.3 450 400 350 300 250 200 150 100 50 0 30 th dimension of MGC 100 200 300 400 500 600 700 800 1000 Frame index (utterance BC2011_nancy_APDC2-166-00) NAT RNN RNN+MLPG RMDN SAR AR-RMDN F0 (after U/V classification) 100 150 200 250 300 350 400 Frame index (utterance BC2011_nancy_APDC2-166-00) 1/12/18 23

Experiments q Results SHALLOW AR MODEL 2 GV of generated MGC 0-2 AR-RMDN SAR GV of MGC -4-6 NAT -8 RNN+MLPG NAT RNN -10 RNN RNN+MLPG RMDN RMDN AR-RMDN -12 1 20 40 60 Order of MGC 1/12/18 24

Experiments q Samples SHALLOW AR MODEL RNN RNN+MLPG RMDN SAR Natural w/o formant enhancement with formant enhancement 25

Interpretation q Feature transformation SHALLOW AR MODEL RMDN c 1 c 2 µ 1 µ 2 h 1 h 2 p(o) =p(c) =N (c; µ c, c ) apple µ c =[µ 1,µ 2 ] > 1 0 c = 0 1 x 1 x 2 o a 1 o 2 c = apple c1 c 2 = apple o 1 = o 2 ao 1 apple 1 0 a 1 apple o1 o 2 = Ao SAR µ 1 µ 2 h 1 h 2 x 1 x 2 o =[o 1,o 2 ] > p(o) =N (o; µ o, o ) µ o =[µ 1,µ 2 + aµ 1 ] > o = apple 1 a a 1+a 2 26

Interpretation q Feature transformation For o 1:T 2 R D T SHALLOW AR MODEL o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D A (1) A (D) 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T A (1) A (D) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T A (1) 1 A (D) 1 bc 1:T TY p(c t ; M t ) t=1 x 1:T 27

Interpretation q Signal and filters For o 1:T 2 R D T SHALLOW AR MODEL o 1:T = 2 3 o 1:T,1 4 5 o 1:T,D filter 1 filter D 2 3 c 1:T,1 4 5 c 1:T,D SAR is equivalent to: Training o 1:T filters A 1 (z) A D (z) c 1:T TY p(c t ; M t ) t=1 Generation bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T TY p(c t ; M t ) t=1 x 1:T 28

Interpretation q Secret of SAR o 1:T filters A 1 (z) A D (z) SHALLOW AR MODEL c 1:T Magnitude (db) 5 0-5 -10 A 1 (z) A 1 (z) 1 250 500 750 1000 Frequency bin (:/1024) bo 1:T filters 1/A 1 (z) 1/A D (z) bc 1:T Magnitude (db) 5 0-5 -10 1/A 1 (z) H 1 (z) Only due to 1/A d (z)? 1 250 500 750 1000 Frequency bin (:/1024) Due to {A d (z), 1/A d (z)}, less mismatch between c 1:T and RMDN http://tonywangx.github.io/#sar 29

CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow q Summary 30

AR FLOW Is SAR good enough? q Random sampling on SAR 450 Randomly sampled F0 Natural F0 RMDN sample SAR sample F0 (Hz) 350 250 150 100 200 300 400 500 Frame index (utterance BC2011 nancy APDC2-166-00) q Reason? Linear Time-invariant a k a 2 a 2 a 2 o 1 o 2 o 3 o 4 o 5 a 1 a 1 a 1 a 1 31

Theory q Change random variable AR FLOW o 1:T c = Ao c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo = A 1 bc bc 1:T TY p(c t ; M t ) t=1 32

Theory q Change random variable AR FLOW o 1:T c 1:T = f (o 1:T ) c 1:T TY p(c t ; M t ) t=1 x 1:T bo 1:T bo 1:T = f 1 (bc 1:T ) bc 1:T TY p(c t ; M t ) t=1 Jacobian matrix must be simple f (.) must be invertible p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) det @c 1:T @o 1:T [13] D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530 1538, 2015. [14] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. In Proc. NIPS, pages 4743 4751, 2016. 33

Basic idea AR FLOW p o (o 1:T x 1:T )=p c (c 1:T x 1:T ) det @c 1:T @o 1:T SAR AR Flow Transform c t = o t X K k=1 a k o t k c t = o t X K k=1 f (k) (o 1:t k ) o t k De-transform KX bo t = c t + a k bo t k k=1 KX bo t = c t + f (k) (bo 1:t k ) bo t k k=1 Very simple: det @c 1:T @o 1:T =1 µ t = RNN(o 1:t 1 ) 34

Implementation q Training AR FLOW o 1 o 2 o 3 o 4 o 5 µ t = RNN(o 1:t 1 ) 0 µ 2 µ 3 µ 4 µ 5 c 1 c 2 c 3 c 4 c 5 c t = o t + µ t M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 35

Implementation q Generation AR FLOW bo 1 bo 2 bo 3 bo 4 bo 5 bo t = bc t µ t µ 2 µ 3 µ 4 µ 5 bc 1 bc 2 bc 3 bc 4 bc 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 36

AR FLOW Experiments on MGC q Data Japanese data Neither MLPG nor formant enhancement q Network RNN SAR AR Flow RNN + AR flow (Uni-LSTM + feedforward) 37

Experiments on MGC q Results AR FLOW 1 st dimension of MGC 30 th dimension of MGC 38

Experiments on MGC q Results AR FLOW GV of generated MGC Modulation spectrum 30 th dimension of MGC Natural RNN AR Flow SAR 39

Experiments on MGC q Samples AR FLOW RNN SAR AR Flow Natural Natural duration F0 from RNN Experiments will be done on F0 40

CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow Deep AR model q Summary 41

DEEP AR MODEL Random sampling from AR Flow? q Not work c TY o 1:T 1:T c 1:T = f (o 1:T ) p(c t ; M t ) x 1:T t=1 Simple @c 1:T @o 1:T q One way to go o 1:T Network with AR dependency x 1:T 42

Definition DEEP AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 43

Definition DEEP AR MODEL o 1 o 2 o 3 o 4 o 5 M 1 M 2 M 3 M 4 M 5 x 1 x 2 x 3 x 4 x 5 p(o 1:T ; M 1:T )= TY p(o t o 1:t 1 ; M t ) t=1 http://tonywangx.github.io/#dar 44

DEEP AR MODEL Experiments on MGC q No better results yet Experiments on F0 q Samples RNN RMDN SAR DAR Natural Network only models F0 Given natural MGC (and duration) 45

DEEP AR MODEL Experiments on F0 p-value MOS 4.25 4.00 3.75 3.50 3.25 NAT DAR MOS score SAR RMDN RNN 3.00 NAT DAR SAR RMDN RNN NAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30 DAR <1e-30 1.6e-28 6.3e-19 2.4e-30 SAR <1e-30 1.6e-28 0.015 0.949 RMDN <1e-30 6.3e-19 0.015 0.014 RNN <1e-30 2.4e-30 0.949 0.014 GV of F0 at utterance-level (Hz) 120 100 80 60 40 20 NAT RNN RMDN SAR DAR NAT RNN F0 GV RMDN SAR DAR 46

DEEP AR MODEL Random sampling on F0 q Japanese data 450 NAT RMDN sample SAR sample F0 (Hz) 350 250 150 200 400 600 800 1000 1200 Frame index (utterance ATR Ximera F009 NIKKEIR 03362 T01) NAT DAR sample 450 F0 (Hz) 350 250 150 200 400 600 800 Frame index (utterance ATR Ximera F009 NIKKEIR 03362 T01) 1000 1200 47

DEEP AR MODEL Random sampling on F0 q Japanese data 450 NAT DAR sample 1 F0 (Hz) 350 250 150 200 400 600 800 1000 1200 450 NAT DAR sample 2 F0 (Hz) 350 250 150 200 400 600 800 1000 1200 450 NAT DAR sample 2 F0 (Hz) 350 250 150 200 400 600 800 Frame index (utterance ATR Ximera F009 NIKKEIR 03362 T01) 1000 1200 48

DEEP AR MODEL Random sampling on F0 q English data F0 (Hz) 400 300 200 100 0 NAT SAMPLE 1 100 200 300 400 500 600 700 800 900 1000 F0 (Hz) 400 300 200 100 0 NAT SAMPLE 2 100 200 300 400 500 600 700 800 900 1000 F0 (Hz) 400 300 200 100 0 NAT SAMPLE 3 100 200 300 400 500 600 700 800 900 1000 http://tonywangx.github.io/#dar 49

CONTENTS q Introduction q Models and methods Baseline models Shallow AR model AR flow Deep AR model q Summary 50

SUMMARY FNN RNN SAR AR Flow DAR Random sampling as a diagnostic tool Linear & invertible c 1:T = Ao 1:T Non-linear & invertible Non-linear & non-invertible c 1:T = f(o 1:T ) AR dependency 51

Message SUMMARY RMDN o 1 o 2 w µ µ 1 µ 2 Dependency between µ t or o t? SAR o 1 a o 2 µ 1 µ 2 h 1 h 2 h 1 h 2 x 1 x 2 x 1 x 2 52

Thank you for your attention Q & A Toolkit, scripts, slides tonywangx.github.io 53