MULTI-DISTRIBUTION DEEP BELIEF NETWORK FOR SPEECH SYNTHESIS. Shiyin Kang, Xiaojun Qian and Helen Meng

Size: px

Start display at page:

Download "MULTI-DISTRIBUTION DEEP BELIEF NETWORK FOR SPEECH SYNTHESIS. Shiyin Kang, Xiaojun Qian and Helen Meng"

Lester Arnold
5 years ago
Views:

1 MULTI-DISTRIBUTION DEEP BELIEF NETORK FOR SPEECH SYNTHESIS Siyin Kang, Xiaojun Qian and Helen Meng Human Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, Te Cinese University of Hong Kong, Hong Kong SAR, Cina ABSTRACT Deep belief network (DBN) as been sown to be a good generative model in tasks suc as and-written digit image generation. Previous work on DBN in te speec community mainly focuses on using te generatively pre-trained DBN to initialize a discriminative model for better acoustic modeling in speec recognition (SR). To fully utilize its generative nature, we propose to model te speec parameters including spectrum and F0 simultaneously and generate tese parameters from DBN for speec syntesis. Compared wit te predominant HMM-based approac, objective evaluation sows tat te spectrum generated from DBN as less distortion. Subjective results also confirm te advantage of te spectrum from DBN, and te overall quality is comparable to tat of context-independent HMM. Index Terms Speec syntesis, Deep belief network 1. INTRODUCTION Te past decade as witnessed te success of HMM-based text-to-speec (TTS) syntesis [1]. Te core underpinning tecniques include: (1) te adoption of multi-space distribution HMMs in estimating te statistical beavior of speec parameters [2]; (2) te parameter generation algoritm wic uses dynamic features to smoot te originally piece-wise constant speec parameters drawn from te HMM states [3]. Tis work is our first attempt in using deep belief network (DBN) to syntesize speec. DBN is a probabilistic generative model wic is composed of multiple layers of stocastic idden variables [4]. Previously, DBN as been sown to learn very good generative and discriminative models on ig-dimensional data suc as andwritten digits [4],facial pictures [5], uman motion [6] and large vocabulary speec recognition [7]. Tis motivates us to use DBN to model te wide-band spectrogram and F0 contour for speec syntesis. In tis paper, te basic linguistic unit is te Mandarin tonal syllable. e keep te speec vocoding framework as in conventional statistical parametric speec syntesis [8]. DBN is used to model te joint probability of syllables and speec parameters. Te major difference between our approac and te prevalent HMM-based approac is: Instead of using a variable sequence of states to represent te time dynamics of eac syllable, we evenly sample a fixed number of frames witin te delimited syllable boundary wit ig resolution. Besides, in contrast to previous application of DBN in acoustic modeling for SR, we model te continuous spectrum, discrete voice/unvoiced decision and te multi-space F0 pattern simultaneously rater tan only te continuous spectrum [9]. Te rest of te paper is organized as follows: e will introduce DBN in te context of speec syntesis in Section 2. Te procedures to syntesize speec using a DBN will be described in Section 3. Te experiments and results will be sown in Section 4. Finally we present te conclusions and te future directions. 2. MULTI-DISTRIBUTION DEEP BELIEF NETORK (MD-DBN) Speec production teory describes ow te linguistic message undergoes a series of neuromuscular and articulatory processes before acoustic realization. To mimic tis sequential process, we attempt to model speec production wit a directed belief network. However, it is difficult to learn a multi-layered belief network layer by layer, wic involves inferring te posterior of idden units immediately above. Te insigt of [4] states tat tis inference can be greatly simplified, i.e., just deriving te posterior from an up-pass, if we assume te factorial prior coming from te upper layers is defined by a restricted Boltzmann macine (RBM) wic sares te same connection weigts wit te current layer. Tis insigt also enables a layer-wise greedy construction of a DBN from bottom-up using RBMs as te building blocks. To build an MD-DBN for speec syntesis, we use tree types of RBMs: (1) mixed Gaussian-Bernoulli RBMs, for spectrum, log-f0 and voiced-unvoiced representation wit assumed Gaussian or Bernoulli distributions; (2) mixed Categorical-Bernoulli RBMs, for capturing te correspondence between syllable identities and te binary data derived from te speec representation; and (3) Bernoulli RBMs, wic are used to encode binary data.

2 2.1. Bernoulli RBM (B-RBM) A B-RBM is an undirected grapical model wit one layer of stocastic visible binary units v and one layer of stocastic idden binary units. Tere is no interaction between units in te same layer and is tus restricted. It defines te energy of a visible-idden configuration (v, ) as follows: E(v, ; Θ) = T v b T v a T, (1) were Θ = {, a, b} is te set of parameters of a RBM and Θ will be omitted for clarity ereafter. w ij is te weigt of te symmetric connection between te idden unit i and te visible unit j, wile a i and b j are teir bias terms. Te distribution of te (v, ) configuration is: Pr(v, ) = exp( E(v, ))/Z (Z is te normalization term), i.e., te iger te energy, te lower te probability. Te nice property of tis setting is tat te two conditionals Pr( i = 1 v) and Pr(v j = 1 ) can be obtained easily using te fact tat i and v j can only be eiter 0 or 1: Pr( i = 1 v) = σ( j Pr(v j = 1 ) = σ( i w ij v j + a i ), (2) w ij i + b j ), (3) were σ(x) = (1 + e x ) 1. To optimize te log-likeliood of v in a first-order approac, we need te gradient of log Pr(v) wit respect to any θ in Θ: E(v, ) θ E(v, ) θ Pr( v) Pr(, v). v (4) Given te instantiated observation v, te expectation of derivatives in te first term in Eqn. (4) can be easily computed. Unfortunately, te second term in Eqn. (4) involves a summation over all possible v and is intractable. A widely applied metod tat approximates tis summation is te Gibbs sampler wic (optionally starts from v) proceeds in a Markov cain as follows: v (0) v, (0) Pr( v (0) ); (5a) v (1) Pr(v (0) ), (1) Pr( v (1) ); (5b) Contrastive divergence (CD) training [4] makes two furter approximations: (1) tat te cain starts from te clamped v and is run for only k steps (CD-k); (2) te summation is replaced by a single sample. In particular, CD-1 measures te discrepancy between v and its reconstruction to present a direction for optimization. Starting from a training frame v (0), we only sample (0) in Eqn. (5a) and use te expectations for Eqn. (2) and Eqn. (3) to replace te random samples v (1) and (1) in Eqn. (5b) for stability Mixed Gaussian-Bernoulli RBM (GB-RBM) Speec parameters for syntesis include te spectrum and te log-f0 wit assumed Gaussian distribution, and te voicedunvoiced switces wic are essentially binary. To model tese parameters simultaneously, we design te following energy function for GB-RBM: E(v g, v b, ) = T g v g (vg µ) T (v g µ) T b v b b T v b a T, (6) were v g and v b are te Gaussian units and te Bernoulli units in te visible layer, g and b are te respective weigt matrices, and µ is te mean of v g. Te conditional Pr( v g, v b ) can be similarly derived as: Pr( i = 1 v g, v b ) = σ( j w g ij vg j + j w b ijv b j + a i ), (7) and Pr(v b j = 1 ) follows Eqn. (3). Te conditional Pr(v g j ) involves an integral over te continuous vg. e can sow tat: Pr(v g j ) = N (vg j ; i w g ij i + µ, 1). (8) Here we ave assumed tat te data is normalized to ave unit variance. Given tese defined conditional probabilities, CD-1 training is te same as described in Section Mixed Categorical-Bernoulli RBM (CB-RBM) As mentioned previously tat te posterior of idden units can be inferred directly from an up-pass, to associate te inferred posteriors v b wit teir corresponding indexed syllable label l c, we need to define te energy of te CB-RBM as follows: E(l c, v b, ) = T l l c b ct l c T b v b b bt v b a T. (9) To make l c {0, 1} K follow a categorical distribution, i.e. representing te syllable identity using te 1-out-of-K code (K is te number of tonal syllables in te TTS system), we restrict l c to ave only K 1-out-of-K codes. It can be sown tat Pr(lj c = 1 ) is defined by te soft-max: Pr(lj c = 1 ) = exp( i wc ij i + b c j ) k exp( i wc ik i + b c (10) k ). Te conditionals Pr(vj b = 1 ) and Pr( i = 1 l c, v b ) take te same form as Eqn. (3) & (7) respectively in te CD-k training procedure.

3 3. DBN-BASED SPEECH SYNTHESIS 3.1. Training Stage Given a syllable s start and end times, we extract 50 uniformly-spaced frames of 24-order Mel-Generalized Cepstrum coefficients (MGCs)[10] plus log-energy, 200 uniformly-spaced frames of voiced/unvoiced (V/UV) decisions and te corresponding log-f0 values witin te syllable s boundary. Te voiced and unvoiced frames are assigned 1s and 0s in teir V/UV units, respectively. Te log-f0 values for te unvoiced frames are set to be all 0 a dummy value for log-f0. Bot MGC and log-f0 ave been normalized to ave zero mean and unit variance. Te MGCs, log-f0 and V/UV units are concatenated to form a 1650-dimensional super-vector for eac syllable, wic is used as te visible layer v for GB-RBM. Te posteriors of te idden layer Pr( (1) i v g, v b ) for all i yielded from a simple up-pass can be used as te visible data for training te immediate upper-layer B-RBM. Likewise, we stack up as many layers of B-RBMs as we want in a similar fasion. Te joint distribution of te top idden layer s posteriors (recursively propagated from below) and te associated syllable label is modeled by a CB-RBM Syntesis Stage For an arbitrary text prompt, we look up te caracters in a dictionary to find out teir tonal syllable pronunciations. Starting from a clamped 1-out-of-K coded syllable label l c and an initial all-zero (N 1) in te top-layer CB-RBM, we calculate te conditionals Pr( (N) i = 1 l c, (N 1) ) for all = 1 (N) ) for all j in layer i in layer (N), Pr( (N 1) j (N 1), in alternative fasion. Tis procedure continues until convergence or a maximum number of iterations is reaced., l c ) is ten recursively passed down in te DBN to Pr( (N 1) j give out Pr(v g j (1) ) and Pr(vj b (1) ) for te Gaussian units and te Bernoulli units respectively in te GB-RBM at te bottom. Te diagram of parameter generation from a DBN is sown in Fig. (1). For eac V/UV frame, we make a voicing decision depending on weter Pr(vj b (1) ) > 0.5. Te log-f0 (if te corresponding V/UV decision is voiced) and MGC parameters are recovered via scaling by te standard deviation and offsetting by te mean. Te duration of eac syllable is te average estimated from te training data. e apply a 25-point median filter on log-f0 trajectory to reduce noise. Ten bot MGCs and log-f0 parameter sequences are interpolated using cubic spline on te utterance level and decimated to yield parameter sequences wit a constant 5-ms frame sift. Te resulting speec parameter sequences are ten fed to te Mel Log Spectral Approximation (MLSA) filter [11] for te final output signal. l c Tonal Syllable ID g v l MGCs & log-f0 ( N ) (1) V/UV ( N ) ( N -1) (1) Fig. 1. Arcitecture of te MD-DBN for speec syntesis Experiment Setup 4. EXPERIMENTS A manually transcribed Mandarin corpus recorded from a female speaker is used for te experiments. Te training set contains 1,000 utterances wit a total lengt of 80.9 minutes, including 23,727 syllable samples. All tese samples are partitioned into 1,364 classes of tonal syllables. Anoter test set wit 100 utterances is used for model arcitecture determination and te objective evaluation. Te RBMs are trained using stocastic gradient descent wit a mini-batc size of 200 training samples. For GB-RBM, 400 epocs are executed wit a learning rate of 0.01 wile for B-RBMs and CB-RBM 200 epocs are executed wit a learning rate of 0.1. During te weigt updates, we apply a 0.9 momentum and a weigt decay. Te training procedure is accelerated by an NVIDIA Tesla M2090 GPU system. For an MD-DBN wit 4 idden layers and 2000 units per layer, te training takes about 1.1 ours. Eac epoc of training GB-RBM, B-RBM and CB-RBM takes 3.5s, 3.7s and 5.7s respectively. A single GPU system runs at about 8 times faster tan an 8-core 2.4 GHz Intel Xeon E CPU system HMM-based Syntesis Baseline e build a Mandarin HMM-based speec syntesis system on te same training set using a standard recipe [12]. MGCs and log-f0 togeter wit teir and 2 are modeled by multi-stream HMMs. Eac syllable HMM as a left-to-rigt topology wit 10 states. Initially, 416 mono-syllable HMMs are estimated as te seed for 1,364 tonal syllable HMMs. In te syntesis stage, speec parameters including MGCs and log-f0 are obtained by te maximum likeliood parameter generation algoritm [3], and are later used to produce te speec waveform troug te MLSA filter. Te syllable durations are te same as tose in te DBN approac. For a b v

4 fair comparison, no contextual information or post-processing voice enancement tecniques are incorporated Optimizing MD-DBN Arcitecture Finding an adequate arcitecture for MD-DBN is important in DBN-based speec syntesis. However, tweaking based on subjective evaluation is not practical due to te vast number of combinations of MD-DBN dept and layer widt. One commonly-used objective metod in voice conversion as well as speec syntesis is to employ spectral distortion [13] between generated speec and target speec. Here we use Mel-Generalized Cepstral Distortion (MGCD) to determine te MD-DBN arcitecture. MGCD is te Euclidean distance between te MGCs of syntesized speec and tat of original speec recording. All te test set prompts are syntesized to compute te average MGCDs for HMM baseline and MD-DBN wit different number of idden layers and different number of units in eac idden layer. As te syllable duration of te speec from te test set and tat of HMM and MD-DBN can be different, we align te MGCs according to te syllable boundary. MGC Distortion idden layers 4 idden layers 5 idden layers 6 idden layers idt of te idden layers Fig. 2. MGCD as a function of te widt of te idden layers. 3H-7H: No. of idden layers. Te MGCD of te HMM baseline is 0.223, wile te DBN approac arcives better result wit a minimal MGCD of As sown in Fig. (2), te MGC distortions are ig wen te layer widt is 1,000, and too many units (4,000) in te idden layer causes large variance of MGCD. Te MD- DBN wit 4 idden layers and idden layer widt of 2,000 units consistently gives te best MGCD. Hence we will be using tese parameters for te remaining of te evaluation Subjective Evaluation A Mean Opinion Scoring (MOS) test is conducted to compare te subjective perception between te DBN approac and te HMM baseline, togeter wit a ybrid approac (MIX) using MGCs from DBN and log-f0 from HMM. In te MOS test, eac of te 10 experienced listeners is asked to rate 10 utterances syntesized by te DBN and te HMM using a 5- point scale (5:excellent, 4:good, 3:fair, 2:poor, 1:bad). Te MOS result is sown in Table (1). System MOS HMM 2.86 DBN 2.88 MIX: DBN MGCs + HMM Log-F Table 1. MOS test result. Altoug te overall MOS score sows a draw between te DBN and te HMM, te two sets of syntesized speec sounds different 1. itout post-processing, HMM s voice sounds quite muffled, but te prosody remains smoot and stable. Te DBN voice sounds muc clearer tan te HMM baseline, and te prosody is lively. However, it seems tat tonal and U/VU decision errors ave a relatively iger cance to occur in te DBN approac, wic can be te main reason tat lowers its MOS score. Te MIX approac gets igest score, wic probably suggests tat te spectrum can be captured by DBN appropriately but HMM does a better job in F0 modeling. Te cross-comparison reveals more details. it te same log-f0 curve (HMM vs. MIX), MGCs generated by DBN leads to iger MOS scores. Tis result agrees wit te objective MGCD measure (MGCD HMM >MGCD DBN ). it te same MGCs (DBN vs. MIX), log-f0 curve generated by HMM results in iger MOS scores. It can be seen tat listeners prefer smooter F0 patterns wen tere is no igerlevel prosody control of te syntesized speec. 5. CONCLUSIONS AND FUTURE ORK e ave described a DBN model wit a multi-distribution visible layer for statistical parametric speec syntesis, in wic te spectrum and F0 are modeled simultaneously in a unified framework. It is sown tat DBN models spectrum better tan HMM and acieves an overall performance tat is comparable to context-independent HMM syntesis. Future directions for improvement include: (1) introducing context information to improve te prosody; and (2) better modeling of te F0 contour wit iger resolution of frame sampling. 6. ACKNOLEDGMENT Te autors would like to tank Dr. Li Deng from MSR and Prof. Fei Sa from USC for fruitful excanges. 1 Te syntesized speec samples can be downloaded from ttp:// sykang/icassp2013/

5 7. REFERENCES [1] T. Yosimura, K. Tokuda, T. Masuko, T. Kobayasi, and T. Kitamura, Simultaneous modeling of spectrum, pitc and duration in HMM-based speec syntesis, in Eurospeec, 1999, pp [13] S. Desai, E. V. Ragavendra, B. Yegnanarayana, A.. Black, and K. Praallad, Voice conversion using artificial neural networks, in ICASSP, 2009, pp [2] K. Tokuda, T. Mausko, N. Miyazaki, and T. Kobayasi, Multi-space probability distribution HMM, IEICE Trans. Inf. & Syst., vol. E85-D, pp , [3] K. Tokuda, T. Yosimura, T. Masuko, T. Kobayasi, and T. Kitamura, Speec parameter generation algoritms for HMM-based speec syntesis, in ICASSP, 2000, pp [4] G. E. Hinton, S. Osindero, and Y.. Te, A fast learning algoritm for deep belief nets, Neural Computation, vol. 18, no. 7, pp , [5] Josua M Susskind, Geoffrey E Hinton, Javier R Movellan, and Adam K Anderson, Generating facial expressions wit deep belief nets, Affective Computing, Emotion Modelling, Syntesis and Recognition, pp , [6] G.. Taylor, G. E. Hinton, and S. T. Roweis, Two distributed-state models for generating igdimensional time series, Journal of Macine Learning Researc, vol. 12, pp , [7] George E Dal, Dong Yu, Li Deng, and Alex Acero, Context-dependent pre-trained deep neural networks for large-vocabulary speec recognition, Audio, Speec, and Language Processing, IEEE Transactions on, vol. 20, no. 1, pp , [8] A.. Black, H. Zen, and K. Tokuda, Statistical parametric speec syntesis, in ICASSP, 2007, pp [9] A. Moamed, G. E. Dal, and G. E. Hinton, Acoustic modeling using deep belief networks, IEEE Transactions on Audio, Speec & Language Processing, vol. 20, no. 1, pp , [10] K. Tokuda, T. Kobayasi, T. Masuko, and S. Imai, Melgeneralized cepstral analysis - a unified approac to speec spectral estimation, in ICSLP, [11] T. Fukada, K. Tokuda, T. Kobayasi, and S. Imai, An adaptive algoritm for mel-cepstral analysis of speec, in ICASSP, 1992, pp [12] Z. Suang, S. Kang, Q. Si, Y. Qin, and L. Cai, Syllable HMM based mandarin TTS and comparison wit concatenative TTS, in INTERSPEECH, 2009, pp

THE hidden Markov model (HMM)-based parametric

THE hidden Markov model (HMM)-based parametric JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Modeling Spectral Envelopes Using Restricted Boltzmann Macines and Deep Belief Networks for Statistical Parametric Speec Syntesis Zen-Hua Ling,