Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Size: px

Start display at page:

Download "Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February"

Jeffrey Cameron
6 years ago
Views:

1 Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February

2 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

3 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

4 Text-to-speech as sequence-to-sequence mapping Automatic speech recognition (ASR)! Hello my name is Heiga Zen Machine translation (MT) Hello my name is Heiga Zen! Ich heiße Heiga Zen Text-to-speech synthesis (TTS) Hello my name is Heiga Zen!

5 Speech production process text (concept) fundamental freq modulation of carrier wave by speech information voiced/unvoiced freq transfer char frequency transfer characteristics magnitude start--end fundamental frequency Sound source voiced: pulse unvoiced: noise speech air flow

6 Typical ow of TTS system Sentence segmentation Word segmentation Text normalization Part-of-speech tagging Pronunciation discrete ) discrete NLP Frontend TEXT Text analysis Speech synthesis SYNTHESIZED SEECH Prosody prediction Waveform generation discrete ) continuous Speech Backend

7 Speech synthesis approaches

8 Speech synthesis approaches Rule-based, formant synthesis []

9 Speech synthesis approaches Rule-based, formant synthesis [] Sample-based, concatenative synthesis []

10 Speech synthesis approaches Rule-based, formant synthesis [] Sample-based, concatenative synthesis [] Model-based, generative synthesis p(speech= text= Hello, my name is Heiga Zen. )

11 Probabilistic formulation of TTS Random variables X Speech waveforms (data) Observed W Transcriptions (data) Observed w Given text Observed x Synthesized speech Unobserved Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

12 Probabilistic formulation of TTS Random variables X Speech waveforms (data) Observed W Transcriptions (data) Observed w Given text Observed x Synthesized speech Unobserved Synthesis Estimate posterior predictive distribution! p(x w, X, W) Sample x from the posterior distribution X W w x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

13 Probabilistic formulation Introduce auxiliary variables (representation) +factorizedependency y X X p(x w, X, W) = p(x o)p(o l, )p(l w) 8l 8L p(x O)p(O L, )p( )p(l W)/ p(x ) dodod where O, o: Acoustic features L, l: Linguistic features : Model X W w O L l o x

14 Approximation () Approximate {sum & integral} by best point estimates (like MAP) [] where p(x w, X, W) p(x ô) X W w {ô, ˆl, Ô, ˆL, ˆ} = arg max o,l,o,l, O L l p(x o)p(o l, )p(l w) p(x O)p(O L, )p( )p(l W) o ô x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

15 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô x

16 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) O ˆL = arg max p(l W) L ˆ = arg max p( Ô ˆL, )p( ) Extract acoustic features X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x

17 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) O ˆL = arg max p(l W) L ˆ = arg max p( Ô ˆL, )p( ) Extract acoustic features Extract linguistic features X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x

18 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) ˆ o ô x

19 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features ˆ o ô x

20 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features ˆ o ô x

21 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô x

22 Approximation () Joint! Step-by-step maximization [] Ô = arg max p(x O) Extract acoustic features O ˆL = arg max p(l W) Extract linguistic features L ˆ = arg max p( Ô ˆL, )p( ) Learn mapping X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Predict ô = arg max p(o o x f x (ô) =p(x ô) Predict linguistic features acoustic features Synthesize waveform ˆ o ô Representations: acoustic, linguistic, mapping x

23 Representation Linguistic features Hello, hello Hello, world. world. world Sentence: length,... Phrase: intonation,... Word: POS, grammar,... h-e2 l-ou1 w-er1-l-d Syllable: stress, tone,... h e l ou w er l d Phone: voicing, manner,...

24 Representation Linguistic features Hello, hello Hello, world. world. world Sentence: length,... Phrase: intonation,... Word: POS, grammar,... h-e2 l-ou1 w-er1-l-d Syllable: stress, tone,... h e l ou w er l d Phone: voicing, manner,...! Based on knowledge about spoken language Lexicon, letter-to-sound rules Tokenizer, tagger, parser Phonology rules

25 Representation Acoustic features Piece-wise stationary, source-lter generative model p(x o) Vocal source Vocal tract filter Pulse train (voiced) Fundamental frequency Aperiodicity, voicing,... + e(n) Cepstrum, LPC,... [db] [khz] overlap/shift windowing Speech x(n) = h(n)*e(n) h(n) White noise (unvoiced)

26 Representation Acoustic features Piece-wise stationary, source-lter generative model p(x o) Vocal source Vocal tract filter Pulse train (voiced) Fundamental frequency Aperiodicity, voicing,... + e(n) Cepstrum, LPC,... [db] [khz] overlap/shift windowing Speech x(n) = h(n)*e(n) h(n) White noise (unvoiced)! Needs to solve inverse problem Estimate parameters from signals Use estimated parameters (e.g., cepstrum) as acoustic features

27 Representation Mapping Rule-based, formant synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Extract rules X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Apply ô = arg max p(o o x f x (ô) =p(x ô) Text analysis rules Vocoder synthesis ˆ o ô x

28 Representation Mapping Rule-based, formant synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Extract rules X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Apply ô = arg max p(o o x f x (ô) =p(x ô) Text analysis rules Vocoder synthesis ˆ o ô! Hand-crafted rules on knowledge-based features x

29 Representation Mapping HMM-based [], statistical parametric synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Train HMMs X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Parameter ô = arg max p(o o x f x (ô) =p(x ô) Text analysis generation Vocoder synthesis ˆ o ô x

30 Representation Mapping HMM-based [], statistical parametric synthesis [] Ô = arg max p(x O) Vocoder analysis O ˆL = arg max p(l W) Text analysis L ˆ = arg max p( Ô ˆL, )p( ) Train HMMs X O Ô W L ˆL w l ˆl ˆl = arg max p(l w) l ˆl, ˆ) Parameter ô = arg max p(o o x f x (ô) =p(x ô) Text analysis generation Vocoder synthesis ˆ o ô! Replace rules by HMM-based generative acoustic model x

31 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

32 HMM-based generative acoustic model for TTS Context-dependent subword HMMs Decision trees to cluster & tie HMM states! interpretable l 1... l N l... q 1 q 2 q 3 q 4 : Discrete o 1 o 2 o 3 o 4 o o T o2 o2 o 5 o 1 o 2 o 3 o 4 : Continuous p(o l, )= X 8q = X 8q TY p(o t q t, )P (q l, ) q t :hiddenstateatt t=1 TY N (o t ; µ qt, qt )P (q l, ) t=1

33 HMM-based generative acoustic model for TTS Non-smooth, step-wise statistics! Smoothing is essential Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Use low-dimensional features (e.g., cepstra) Data fragmentation! Ineffective, local representation Alotofresearchworkhavebeendonetoaddresstheseissues Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

34 Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic! acoustic Linguistic features l yes no yes no yes no yes no... yes no Statistics of acoustic features o Regression tree: linguistic features! Stats. of acoustic features

35 Alternative acoustic model HMM: Handle variable length & alignment Decision tree: Map linguistic! acoustic Linguistic features l yes no yes no yes no yes no... yes no Statistics of acoustic features o Regression tree: linguistic features! Stats. of acoustic features Replace the tree w/ a general-purpose regression model! Articial neural network

36 FFNN-based acoustic model for TTS [6] Target Frame-level acoustic feature o t o t 1 o t o t+1 h t Frame-level linguistic feature Input l t l t 1 l t l t+1 X h t = g (W hl l t + b h ) ˆ = arg min ko t ô t k 2 ô t = W oh h t + b o = {W hl, W oh, b h, b o } ô t E [o t l t ]! Replace decision trees & Gaussian distributions t Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

37 RNN-based acoustic model for TTS [] Target Frame-level acoustic feature o t o t 1 o t o t+1 Recurrent connections Frame-level linguistic feature Input lt l t 1 l t l t+1 X h t = g (W hl l t + W hh h t 1 + b h ) ˆ = arg min ko t ô t k 2 ô t = W oh h t + b o = {W hl, W hh, W oh, b h, b o } FFNN: ô t E [o t l t ] RNN: ô t E [o t l 1,...,l t ] t

38 NN-based generative acoustic model for TTS Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [, 8] Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [] Data fragmentation! Distributed representation []

39 NN-based generative acoustic model for TTS Non-smooth, step-wise statistics! RNN predicts smoothly varying acoustic features [, 8] Difcult to use high-dimensional acoustic features (e.g., raw spectra)! Layered architecture can handle high-dimensional features [] Data fragmentation! Distributed representation [] NN-based approach is now mainstream in research & products Models: FFNN [6], MDN [], RNN [], Highway network [], GAN [] Products: e.g., Google []

40 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

41 NN-based generative model for TTS text (concept) fundamental freq modulation of carrier wave by speech information voiced/unvoiced freq transfer char frequency transfer characteristics magnitude start--end fundamental frequency Sound source voiced: pulse unvoiced: noise speech air flow Text! Linguistic! (Articulatory)! Acoustic! Waveform

42 Knowledge-based features! Learned features Unsupervised feature learning ~ x(t) w(n+1) world Acoustic ) feature o(t) l(n-1) Linguistic ) feature l(n) x(t) (raw FFT spectrum) hello w(n) (1-hot representation of word) Speech: auto-encoder at FFT spectra [, ]! positive results Text: word [6], phone & syllable []! less positive

43 Relax approximation Joint acoustic feature extraction & model training Two-step optimization! Joint optimization 8 >< Ô = arg max p(x O) O >: ˆ = arg max p( Ô ˆL, )p( ) + X O W L ˆL w l ˆl {ˆ, Ô} = arg max p(x O)p(O ˆL, )p( ),O Joint source-lter & acoustic model optimization HMM [8,, ] NN [, ] ˆ o ô x

44 Relax approximation Joint acoustic feature extracion & model training Mixed-phase cepstral analysis + LSTM-RNN [] p n Pulse train x n Speech... Cepstrum Hu -1 fn G(z) - (z) sn en z z z -1 z -1-1 z -1 z d... t (v) d (u) t Derivatives o t... (v) o (u) t Forward propagation... Linguistic... features l t Back propagation

45 Relax approximation Direct mapping from linguistic to waveform No explicit acoustic features {ˆ, Ô} = arg max p(x O)p(O ˆL, )p( ),O X W w + L l ˆ = arg max p(x ˆL, )p( ) ˆL ˆl Generative models for raw audio LPC [] WaveNet [] SampleRNN [] ˆ x Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

46 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, )

47 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, ) WaveNet []! p(x n x 0,...,x n 1, ) is modeled by convolutional NN

48 WaveNet: A generative model for raw audio Autoregressive (AR) modelling of speech signals x = {x 0,x 1,...,x N 1 } :rawwaveform p(x )=p(x 0,x 1,...,x N 1 )= NY 1 n=0 p(x n x 0,...,x n 1, ) WaveNet []! p(x n x 0,...,x n 1, ) is modeled by convolutional NN Key components Causal dilated convolution: capture long-term dependency Gated convolution + residual + skip: powerful non-linearity Softmax at output: classication rather than regression

49 WaveNet Causal dilated convolution ms in 6kHz sampling =,6 time steps! Too long to be captured by normal RNN/LSTM Dilated convolution Exponentially increase receptive eld size w.r.t. # of layers Output p(x x,..., x ) n n-16 n-1 Hidden layer3 Hidden layer2 Hidden layer1 Input x n x n-3 x n-2 x n-1 Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

50 WaveNet Non-linearity Residual block Skip connections 30 Residual block ReLU 1x1 256 ReLU 1x1 256 Softmax p(x n x 0,..., x n-1 ) Residual block Residual block To residual block 512 Residual block 1x1 1x1 ReLU Gated : 1x1 convolution : ReLU activation : Gated activation Gated 2x1 dilated 1x1 256 To skip connection Softmax : Softmax activation

51 WaveNet Softmax Amplitude Time Analog audio signal

52 WaveNet Softmax Amplitude Time Sampling & Quantization

53 WaveNet Softmax 1 Category index Amplitude 16 Time Categorical distribution Histogram - Unimodal - Multimodal - Skewed...

54 WaveNet Conditional modelling Linguistic features l Embedding at time n h n 1x1 1x1... 1x1 Residual block Residual block... Residual block... + ReLU 1x1 ReLU 1x1 Softmax p(x n x 0,..., x n-1,l) Residual block 1x1 Residual block 1x1 Gated 1x1 conditioning 2x1 dilated

55 WaveNet vs conventional audio generative models Assumptions in conventional audio generative models [, 6,, ] Stationary process w/ xed-length analysis window! Estimate model within ms window w/ shift Linear, time-invariant lter within a frame! Relationship between samples can be non-linear Gaussian process! Assumes speech signals are normally distributed WaveNet Sample-by-saple, non-linear, capable to take additional inputs Arbitrary-shaped signal distribution SOTA subjective naturalness w/ WaveNet-based TTS [] HMM LSTM Concatenative WaveNet

56 Relax approximation Towards Bayesian end-to-end TTS Integrated end-to-end 8 >< ˆL = arg max p(l W) L >: ˆ = arg max p(x ˆL, )p( ) + ˆ = arg max p(x W, )p( ) X W w ˆ Text analysis is integrated to model x

57 Relax approximation Towards Bayesian end-to-end TTS Bayesian end-to-end 8 < ˆ = arg max p(x W, )p( ) : x f x (w, ˆ) =p(x w, ˆ) + x f x (w, X, W) =p(x w, X, W) Z = p(x w, )p( X, W)d 1 KX p(x w, K ˆk) Ensemble k=1 X W w x Marginalize model parameters & architecture

58 Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks Beyond parametric TTS Learned features WaveNet End-to-end Conclusion & future topics

59 Generative model-based text-to-speech synthesis Bayes formulation + factorization + approximations Representation: acoustic features, linguistic features, mapping Mapping: Rules! HMM! NN Feature: Engineered! Unsupervised, learned Less approximations Joint training, direct waveform modelling Moving towards integrated & Bayesian end-to-end TTS Naturalness: Concatenative apple Generative Flexibility: Concatenative Generative (e.g., multiple speakers) Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 6 of

60 Beyond text -to-speech synthesis TTS on conversational assistants Texts aren t fully contained Need more context Location to resolve homographs User query to put right emphasis

61 Beyond text -to-speech synthesis TTS on conversational assistants Texts aren t fully contained Need more context Location to resolve homographs User query to put right emphasis We need representation that can organize the world information & make it accessible & useful from TTS generative models,

62 Beyond generative TTS Generative model-based TTS Model represents process behind speech production Trained to minimize error against human-produced speech Learned model! speaker Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

63 Beyond generative TTS Generative model-based TTS Model represents process behind speech production Trained to minimize error against human-produced speech Learned model! speaker Speech is for communication Goal: maximize the amount of information to be received Missing listener! listener in training / model itself? Heiga Zen Generative Model-Based Text-to-Speech Synthesis February rd, 8 of

64 Thanks!

65 References I [] [] [] [] [] D. Klatt. Real-time speech synthesis by rule. Journal of ASA,68(S):S8 S8,8. A. Hunt and A. Black. Unit selection in a concatenative speech synthesis system using a large speech database. In Proc. ICASSP,pages 6,6. K. Tokuda. Speech synthesis as a statistical machine learning problem. Invited talk given at ASRU. T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura. Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis. IEICE Trans. Inf. Syst.,J8-D-II():,. (in Japanese). H. Zen, K. Tokuda, and A. Black. Statistical parametric speech synthesis. Speech Commn.,(): 6,. [6] H. Zen, A. Senior, and M. Schuster. Statistical parametric speech synthesis using deep neural networks. In Proc. ICASSP,pages6 66,. [] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong. TTS synthesis with bidirectional LSTM based recurrent neural networks. In Proc. Interspeech,pages6 68,.

66 References II [8] H. Zen. Acoustic modeling for speech synthesis: from HMM to RNN. Invited talk given at ASRU. [] [] [] [] [] [] S. Takaki and J. Yamagishi. Adeepauto-encoderbasedlow-dimensionalfeatureextractionfromFFTspectralenvelopesforstatisticalparametricspeech synthesis. In Proc. ICASSP,pages,6. G. Hinton, J. McClelland, and D. Rumelhart. Distributed representation. In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in the microstructure of cognition.mitpress,86. H. Zen and A. Senior. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis. In Proc. ICASSP,pages8 86,. X. Wang, S. Takaki, and J. Yamagishi. Investigating very deep highway networks for parametric speech synthesis. In Proc. ISCA SSW,6. Y. Saito, S. Takamichi, and Saruwatari. Training algorithm to deceive anti-spoong verication for DNN-based speech synthesis. In Proc. ICASSP,. H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak. Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices. In Proc. Interspeech,6.

67 References III [] P. Muthukumar and A. Black. Adeeplearningapproachtodata-drivenparameterizationsforstatisticalparametricspeechsynthesis. arxiv:.88,. [6] P. W a n g, Y. Q i a n, F. S o o n g, L. H e, a n d H. Z h a o. Word embedding for recurrent neural network based TTS synthesis. In Proc. ICASSP,pages8 88,. [] [8] [] [] [] [] X. Wang, S. Takaki, and J. Yamagishi. Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech synthesis. IEICE Trans. Inf. Syst.,E-D(): 8,6. T. Toda and K. Tokuda. Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm. In Proc. ICASSP,pages 8,8. Y.-J. Wu and K. Tokuda. Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis. In Proc. Interspeech,pages 8,8. R. Maia, H. Zen, and M. Gales. Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters. In Proc. ISCA SSW,pages88,. K. Tokuda and H. Zen. Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis. In Proc. ICASSP,pages,. K. Tokuda and H. Zen. Directly modeling voiced and unvoiced components in speech waveforms by neural networks. In Proc. ICASSP,pages6 6,6.

68 References IV [] [] [] [6] [] F. Itakura and S. Saito. Astatisticalmethodforestimationofspeechspectraldensityandformantfrequencies. Trans. IEICE,JA:,. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arxiv:6.,6. S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio. SampleRNN: An unconditional end-to-end neural audio generation model. arxiv:6.8,6. S. Imai and C. Furuichi. Unbiased estimation of log spectrum. In Proc. EURASIP,pages 6,88. H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux. Speech analysis with multi-kernel linear prediction. In Proc. Spring Conference of ASJ,pages,. (in Japanese).

69 X W w X W w X W w X W w O L l O L l O Ô L ˆL l ˆl o o ô ˆ o x x x ô x (1) Bayesian (2) Auxiliary variables + factorization (3) Joint maximization (4) Step-by-step maximization e.g., statistical parametric TTS X W w X W w X W w X W w O L ˆL l ˆl L ˆL l ˆl ˆ ˆ ˆ o ô x x x x (5) Joint acoustic feature extraction + model training (6) Conditional WaveNet -based TTS (7) Integrated end-to-end (8) Bayesian end-to-end

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi