THE hidden Markov model (HMM)-based parametric

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Modeling Spectral Envelopes Using Restricted Boltzmann Macines and Deep Belief Networks for Statistical Parametric Speec Syntesis Zen-Hua Ling, Member, IEEE, Li Deng, Fellow, IEEE, and Dong Yu, Senior Member, IEEE Abstract Tis paper presents a new spectral modeling metod for statistical parametric speec syntesis. In te conventional metods, ig-level spectral parameters, suc as mel-cepstra or line spectral pairs, are adopted as te features for idden Markov model (HMM) based parametric speec syntesis. Our proposed metod described in tis paper improves te conventional metod in two ways. First, distributions of low-level, un-transformed spectral envelopes (extracted by te STRAIGHT vocoder) are used as te parameters for syntesis. Second, instead of using single Gaussian distribution, we adopt te grapical models wit multiple idden variables, including restricted Boltzmann macines (RBM) and deep belief networks (DBN), to represent te distribution of te low-level spectral envelopes at eac HMM state. At te syntesis time, te spectral envelopes are predicted from te RBM-HMMs or te DBN-HMMs of te input sentence following te maximum output probability parameter generation criterion wit te constraints of te dynamic features. A Gaussian approximation is applied to te marginal distribution of te visible stocastic variables in te RBM or DBN at eac HMM state in order to acieve a closed-form solution to te parameter generation problem. Our experimental results sow tat bot RBM-HMM and DBN-HMM are able to generate spectral envelope parameter sequences better tan te conventional Gaussian- HMM wit superior generalization capabilities and tat DBN- HMM and RBM-HMM perform similarly due possibly to te use of Gaussian approximation. As a result, our proposed metod can significantly alleviate te over-smooting effect and improve te naturalness of te conventional HMM-based speec syntesis system using mel-cepstra. Index Terms Speec syntesis, idden Markov model, restricted Boltzmann macine, deep belief network, spectral envelope I. INTRODUCTION THE idden Markov model (HMM)-based parametric speec syntesis metod as become a mainstream speec syntesis metod in recent years [2], [3]. In tis metod, te spectrum, F0 and segment durations are modeled simultaneously witin a unified HMM framework [2]. At syntesis time, tese parameters are predicted so as to Tis work was partially funded by te National Nature Science Foundation of Cina (Grant No.61273032) and te Cina Scolarsip Council Young Teacer Study Abroad Project. Te paper is te expanded version of te conference paper publised in ICASSP-2013 [1]. Z.-H. Ling is wit te National Engineering Laboratory of Speec and Language Information Processing, University of Science and Tecnology of Cina, Hefei, 230027, Cina (e-mail: zling@ustc.edu.cn). Tis work was carried out wen e was a visiting scolar at te Department of Electrical Engineering, University of Wasington, WA, USA. L. Deng and D. Yu are wit Microsoft Researc, One Microsoft Way, Redmond WA, 98052, USA (e-mail: deng@microsoft.com; dongyu@microsoft.com). maximize teir output probabilities from te HMM of te input sentence. Te constraints of te dynamic features are considered during parameter generation in order to guarantee te smootness of te generated spectral and F0 trajectories [4]. Finally, te predicted parameters are sent to a speec syntesizer to reconstruct te speec waveforms. Tis metod is able to syntesize igly intelligible and smoot speec sounds [5], [6]. However, te quality of te syntetic speec is degraded due to tree main factors: limitations of te parametric syntesizer itself, inadequacy of acoustic modeling used in te syntesizer, and te over-smooting effect of parameter generation [7]. Many improved approaces ave been proposed to overcome te disadvantages of tese tree factors. In terms of te speec syntesizer, STRAIGHT [8], as a ig-performance speec vocoder, as been widely used in current HMM-based speec syntesis systems. It follows te source-filter model of speec production. In order to represent te excitation and vocal tract caracteristics separately, F0 and a smoot spectral envelope witout periodicity interference are extracted at eac frame. Ten, mel-cepstra [5] or line spectral pairs [6] can be derived from te spectral envelopes of training data for te following HMM modeling. During syntesis, te generated spectral parameters are used eiter to reconstruct speec waveforms directly or to recover te spectral envelopes for furter speec reconstruction by STRAIGHT. Acoustic modeling is anoter key component of te HMMbased parametric speec syntesis. In te common spectral modeling metods, te probability density functions (PDF) of eac HMM state is represented by a single Gaussian distribution wit diagonal covariance matrix and te distribution parameters are estimated under te maximum likeliood (ML) criterion [2]. Because te single Gaussian distributions are used as te state PDFs, te outputs of maximum output probability parameter generation tend to distribute near te modes (also te means) of te Gaussians, wic are estimated by averaging observations wit similar context descriptions in te ML training. Altoug tis averaging process improves te robustness of parameter generation, te detailed caracteristics of te spectral parameters are lost. Terefore, te reconstructed spectral envelopes are over-smooted, wic leads to a muffled voice quality in te syntetic speec. Te existing refinements on acoustic modeling include increasing te number of Gaussians for eac HMM state [4], reformulating HMM as a trajectory model [9], improving te model training criterion by minimizing te generation error [10].

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2 In order to alleviate te over-smooting effect, many improved parameter generation metods ave also been proposed, suc as modifying te parameter generation criterion by integrating a global variance model [11] or minimizing model divergences [12], post-filtering after parameter generation [6], [13], using real speec parameters or segments to generate te speec waveform [14], [15], or sampling trajectories from te predictive distribution [16], [17], and so on. In tis paper, we propose a new spectral modeling metod wic copes wit te first two factors mentioned above. First, te raw spectral envelopes extracted by te STRAIGHT vocoder are utilized directly witout furter deriving spectral parameters from tem during feature extraction. Comparing wit te ig-level 1 spectral parameters, suc as mel-cepstra or line spectral pairs, te low-level spectral envelopes are more pysically meaningful and more directly related wit te subjective perception on te speec quality. Tus, te influence of spectral parameter extraction on te spectral modeling can be avoided. Similar approac can be found in [18], were te spectral envelopes derived from te armonic amplitudes are adopted to replace te mel-cepstra for HMM-based Arabic speec syntesis and te naturalness improvement can be acieved. Second, te grapical models wit multiple idden layers, suc as restricted Boltzmann macines (RBM) [19] and deep belief networks (DBN) [20], are introduced to represent te distribution of te spectral envelopes at eac HMM state instead of single Gaussian distribution. An RBM is a bipartite undirected grapical model wit a two-layer arcitecture and a DBN contains more idden layers, wic can be estimated using a stack of RBMs. Bot of tese two models are better in describing te distribution of ig-dimensional observations wit cross-dimension correlations, i.e., te spectral envelopes, tan te single Gaussian distribution and Gaussian mixture model (GMM). Te acoustic modeling metod wic describes te production, perception and distribution of speec signals is always an important researc topic in speec signal processing [21]. In recent years, RBMs and DBNs ave been successfully applied to modeling speec signals, suc as spectrogram coding [22], speec recognition [23], [24], and acoustic-articulatory inversion mapping [25], were tey mainly act as te pre-training metods for a deep autoencoder or a deep neural network (DNN). Te arcitectures used in deep learning as applied to speec processing ave been motivated by te multi-layered structures in bot speec production and perception involving ponological features, motor control, articulatory dynamics, and acoustic and auditory parameters [26], [27]. Te approaces of applying RBMs, DBNs, and oter deep learning metods to te statistical parametric speec syntesis ave also been studied very recently [1], [28] [30]. In [28], a DNN-based statistical parametric speec syntesis metod is presented, wic maps te input context information towards te acoustic features using a neural network wit deep structures. In [29], a DNN wic is pre-trained by te DBN learning is adopted as a feature 1 Here, te level refers to te steps of signal processing procedures involved in te spectral feature extraction. Te ig-level spectral parameters are commonly derived from te low-level ones by functional representation and parameterization. extractor for te Gaussian process based F0 contour prediction. Furtermore, RBMs and DBNs can be used as density models instead of te DNN initialization metods for te speec syntesis application. In [30], a single DBN model is trained to represent te joint distribution between te tonal syllable ID and te acoustic features. In [1], a set of RBMs are estimated to describe te distributions of te spectral envelopes in te context-dependent HMM states. In tis paper, we extend our previous work in [1] by incorporating te dynamic features of spectral envelopes into te RBM modeling and developing RBMs to DBNs wic as more layers of idden units. Tis paper is organized as follows. In Section II, we will briefly review te basic tecniques of RBMs and DBNs. In Section III, we will describe te details of our proposed metod. Section IV reports our experimental results. Section V gives te conclusion and te discussion on our future work. II. RESTRICTED BOLTZMANN MACHINES AND DEEP BELIEF NETWORKS A. Restricted Boltzmann macines An RBM is a kind of bipartite undirected grapical model (i.e. Markov random field) wic is used to describe te dependency among a set of random variables using a twolayer arcitecture [19]. In tis model, te visible stocastic units v = [v 1,..., v V ] are connected to te idden stocastic units = [ 1,..., H ] as sown in Fig. 1.a), were V and H are te numbers of units of te visible and idden layers respectively, and ( ) means te matrix transpose. Assuming v {0, 1} V and {0, 1} H are bot binary stocastic variables, te energy function of te state {v, } is defined as V H V H E(v, ) = a i v i b j j w ij v i j, (1) i=1 j=1 i=1 j=1 were w ij represents te symmetric interaction between v i and j, a i and b j are bias terms. Te model parameters are composed of a = [a 1,..., a V ], b = [b 1,..., b H ], and W = {w ij } V H. Te joint distribution over te visible and idden units is defined as were P (v, ) = 1 exp ( E(v, )), (2) Z Z = v exp ( E(v, )) (3) is te partition function wic can be estimated using te annealed importance sampling (AIS) metod [31]. Terefore, te probability density function over te visible vector v can be calculated as P (v) = 1 exp ( E(v, )). (4) Z Given a training set, te RBM model parameters {a, b, W} can be estimated by maximum likeliood learning using te contrastive divergence (CD) algoritm [32]. RBM can also be applied to model te distribution of real-valued data (e.g. te speec parameters) by adopting its

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3 a) b) v Fig. 1. Te grapical model representations for a) an RBM and b) a treeidden-layer DBN. Gaussian-Bernoulli form, wic means v R V are real-valued and {0, 1} H are binary. Tus, te energy function of te state {v, } is defined as V (v i a i ) 2 H V H v i E(v, ) = b j j w ij j, σ i i=1 2σ 2 i 2 v j=1 3 1 i=1 j=1 were te variance parameters σi 2 are commonly fixed to a predetermined value instead of learning from te training data [33]. B. Deep belief networks A deep belief network (DBN) is a probabilistic generative model wic is composed of many layers of idden units [20]. Te grapical model representation for a tree-idden-layer DBN is sown in Fig. 1.b). In tis model, eac layer captures te correlations among te activities of idden features in te layer below. Te top two layers of te DBN form an undirected bipartite grap. Te lower layers form a directed grap wit a top-down direction to generate te visible units. Matematically, te joint distribution over te visible and all idden units can be written as P (v, 1,..., L ) = P (v 1 )P ( 1 2 ) P ( L 2 L 1 )P ( L 1, L ), (6) were l = [ l 1,..., l H l ] is te idden stocastic vector of te l-t idden layer, H l is te dimensionality of l, and L is te number of idden layers. Te joint distribution P ( L 1, L ) is represented by an RBM as (2) wit te weigt matrix W L and te bias vectors a L and b L. P (v 1 ) and P ( l 1 l ), l {2, 3,..., L 1} are represented by sigmoid belief networks [34]. Eac sigmoid belief network is described by a weigt matrix W l and a bias vector a l. Assuming v are real-valued and l, l {1, 2,..., L} are binary, te dependency between v and 1 in te sigmoid belief network is described by (5) P (v 1 ) = N (v; W 1 1 + a 1, Σ) (7) were N ( ) denotes a Gaussian distribution; Σ = diag{σi 2} and turns to an identity matrix wen σi 2 are fixed to 1 during model training. For l {2, 3,..., L 1}, te dependency between two adjacent idden layers is represented by P ( l 1 i = 1 l ) = g(a l i + j w l ij l j) (8) were g(x) = 1/(1+exp( x)) is te sigmoid function. For an L-idden-layer DBN, its model parameters are composed of {a 1, W 1,..., a L 1, W L 1, a L, b L, W L }. Furter, te marginal distribution of te visible variables for a DBN can be written as P (v) = P (v, 1,..., L ). (9) 1 L Given te training samples of te visible units, it is difficult to estimate te model parameters of a DBN directly under te maximum likeliood criterion due to te complex model structure wit multiple idden layers. Terefore, a greedy learning algoritm as been proposed and popularly applied to train te DBN in a layer-by-layer manner [20]. A stack of RBMs are used in tis algoritm. Firstly, it estimates te parameters {a 1, b 1, W 1 } of te first layer RBM to model te visible training data. Ten, it freezes te parameters {a 1, W 1 } of te first layer and draws samples from P ( 1 v) to train te next layer RBM {a 2, b 2, W 2 }, were P ( 1 j = 1 v) = g(b 1 j + wijv 1 i ). (10) i Tis training procedure is conducted recursively until it reaces te top layer and gets {a L, b L, W L }. It as been proved tat tis greedy learning algoritm can improve te lower bound on te log-likeliood of te training samples by adding eac new idden layer [20], [31]. Once te model parameters are estimated, to calculate te log-probability tat a DBN assigns to training or test data by (9) directly is also computationally intractable. A lower bound on te log-probability can be estimated by combining te AIS-based partition function estimation wit approximate inference [31]. III. SPECTRAL ENVELOPE MODELING USING RBMS AND DBNS A. HMM-based parametric speec syntesis At first, te conventional HMM-based parametric speec syntesis metod is briefly reviewed. It consists of a training stage and a syntesis stage. During training, te F0 and spectral parameters are extracted from te waveforms contained in te training set. Ten a set of context-dependent HMMs are estimated to maximize te likeliood function for te training acoustic features. Here o = [o 1, o 2,..., o T ] is te observation feature sequence and T is te lengt of te sequence. Te observation feature vector o t R 3D for te t-t frame typically consists of static acoustic parameters c t R D and teir delta and acceleration components as o t = [ c t, c t, 2 c ] t, (11) were D is te dimension of te static component; te dynamic components are commonly calculated as c t = 0.5c t+1 0.5c t 1 t [2, T 1], (12) c 1 = c 2, c T = c T 1 (13)

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4 and 2 c t = c t+1 2c t + c t 1 t [2, T 1], (14) 2 c 1 = 2 c 2, 2 c T = 2 c T 1. (15) Terefore, te complete feature sequence o can be considered to be a linear transform of te static feature sequence c = [c 1, c 2,..., c T ] as o = Mc, (16) were M R 3T D T D is determined by te delta and acceleration calculation functions in (12)-(15) [4]. A multi-space probability distribution (MSD) [35] is applied to incorporate a distribution for F0 into te probabilistic framework of te HMM considering tat F0 is only defined for voiced speec frames. In order to deal wit te data-sparsity problem of te context-dependent model training wit extensive context features, a decision-tree-based model clustering tecnique tat uses a minimum description lengt (MDL) criterion [36] to guide te tree construction is adopted after initial training of te context-dependent HMMs. Next, a state alignment is conducted using te trained HMMs to train context-dependent state duration probabilities [2] for state duration prediction. A single-mixture Gaussian distribution is used to model te duration probability for eac state. A decision-tree-based model clustering tecnique is similarly applied to tese duration distributions. At te syntesis stage, te maximum output probability parameter generation algoritm is used to generate acoustic parameters [4]. Te result of front-end linguistic analysis on te input text is used to determine te sentence HMM λ. Te state sequence q = {q 1, q 2,..., q T } is predicted using te trained state duration probabilities [2]. Ten, te sequence of speec features are predicted by maximizing P (o λ, q). Considering te constraints between static and dynamic features as in (16), te parameter generation criterion can be rewritten as c = arg max P (Mc λ, q), (17) c were c is te generated static feature sequence. If te emission distribution of eac HMM state is represented by a single Gaussian distribution, te closed-form solution to (17) can be derived. By setting we obtain P (Mc q, λ) c = 0, (18) c = ( M U 1 q M ) 1 M U 1 q m q, (19) were m q = [µ q 1,..., µ q T ] and U q = diag(σ q1,..., Σ qt ) are te mean vector and covariance matrix of te sentence as decided by te state sequence q [4]. B. Spectral envelope modeling and generation using RBM and DBN In tis paper, we improve te conventional spectral modeling metod in te HMM-based parametric speec syntesis from two aspects. First, te raw spectral envelopes extracted by te STRAIGHT vocoder are modeled directly witout F0 and Spectral Parameter Extraction Clustered CD-HMM Training Gaussian HMMs (Spectral Parameters) Training Syntesis Speec Corpus Fully CD-HMM Training Decision-Tree-based Model Clustering Input Text Spectral Envelope Extraction State Alignment Parameter Generation & Syntesizer Syntetic Speec Context-Dependent RBM/DBN Training RBM/DBN-HMMs (Spectral Envelopes) Gaussian Approximation Fig. 2. Flowcart of our proposed metod. Te modules in solid lines represent te procedures of te conventional HMM-based speec syntesis using ig-level spectral parameters, were CD-HMM stands for Context- Dependent HMM. Te modules in das lines describe te add-on procedures of our proposed metod for modeling te spectral envelopes using RBMs or DBNs. furter deriving ig-level spectral parameters. Second, te RBM and DBN models are adopted to replace te single Gaussian distribution at eac HMM state. In order to simplify te model training wit ig-dimensional spectral features, te decision trees for model clustering and te state alignment results are assumed to be given wen te spectral envelopes are modeled. Tus, we can focus on comparing te performance of different models on te clustered model estimation. In current implementation, te conventional context-dependent model training using ig-level spectral parameters and single Gaussian state PDFs is conducted at first to acieve te model clustering and state alignment results. Te flowcart of our proposed metod is sown in Fig. 2. During te acoustic feature extraction using STRAIGHT vocoder, te original linear frequency spectral envelopes 2 are stored besides te spectral parameters. Te context-dependent HMMs for conventional spectral parameters and F0 features are firstly estimated according to te metod introduced in Section III.A. A single Gaussian distribution is used to model te spectral parameters at eac HMM state. Next, a state alignment to te acoustic features is performed. Te state boundaries are used to gater te spectral envelopes for eac clustered context-dependent state. Similar to te ig-level spectral parameters, te feature vector of te spectral envelope at eac frame consists of static, velocity, and acceleration components as (11)-(15). Ten, an RBM or a DBN is esti- 2 Te mel-frequency spectral envelopes can also be used ere to represent te speec perception properties. In tis paper, we adopt te linear frequency spectral envelope because it is te most original description of te vocal tract caracters witout any prior knowledge and assumption on te spectral parameterization and speec perception.

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5 mated under te maximum likeliood criterion for eac state according to te metods introduced in Section II. Te model estimation of te RBMs or te DBNs is conducted only once using te fixed state boundaries. Finally, te context-dependent RBM-HMMs or DBN-HMMs can be constructed for modeling te spectral envelopes. At syntesis time, te same criterion in (17) is followed to generate te spectral envelopes. Te optimal sequence of spectral envelopes are estimated by maximizing te output probability from te RBM-HMM or te DBN-HMM of te input sentence. Wen single Gaussian distributions are adopted as te state PDFs, tere is a closed-form solution as sown in (19) to tis maximum output probability parameter generation wit te constraints of dynamic features once te state sequence as been determined [4]. However, te marginal distribution defined in (4) for an RBM or in (9) for a DBN is muc more complex tan a single Gaussian, wic makes te closed-form solution impractical. Terefore, a Gaussian approximation is applied before parameter generation to simply te problem. For eac HMM state, a Gaussian distribution N (v; µ, Σ) is constructed, were v is te spectral envelope feature vector containing static, velocity, and acceleration components; µ = arg max log P (v) (20) v is te mode estimated for eac RBM or DBN and P (v) is defined as (4) or (9); Σ is a diagonal covariance matrix estimated by calculating te sample covariances given te training samples of te state. Tese Gaussian distributions are used to replace te RBMs or te DBNs as te state PDFs at syntesis time. Terefore, te conventional parameter generation algoritm wit te constraints of dynamic features can be followed to predict te spectral envelopes by solving a group of linear equations (19). By incorporating te dynamic features of te spectral envelopes during model training and parameter generation, temporally smoot spectral trajectories can be generated at syntesis time. Te detailed algoritms of te mode estimation in (20) for an RBM and a DBN model will be introduced in te following subsections. C. Estimating RBM mode Here, we consider te RBM of te Gaussian-Bernoulli form because te spectral envelope features are real-valued. Given te estimated model parameters {a, b, W} of an RBM, te probability density function (4) over te visible vector v can be furter calculated as 3 P (v) = 1 exp ( E(v, )) Z ( ) = 1 V (v i a i ) 2 exp + b + v W Z 2 i=1 ) = 1 V ( Z exp (v i a i ) 2 2 = 1 Z exp ( i=1 H j=1 j {0,1} exp(b j j + v w j j ) ) V (v i a i ) 2 H ( 1 + exp(bj + v w j ) ), 2 i=1 j=1 (21) were w j denotes te j-t column of matrix W. Because tere is no closed-form solution to solve (20) for an RBM, te gradient descent algoritm is adopted ere, i.e., v (i+1) = v (i) + α log P (v) v v=v (i), (22) were i denotes te number of iteration; α is te step size; log P (v) H exp(b j + v w j ) = (v a) + v 1 + exp(b j + v w j ) w j. (23) j=1 Tus, te estimated mode of te RBM model is determined by a non-linear transform of te model parameters {a, b, W}. In contrast to te single Gaussian distribution, tis mode is no longer te Gaussian mean wic is estimated by averaging te corresponding training vectors under te maximum likeliood criterion. Because te likeliood of an RBM is multimodal, te gradient descent optimization in (22) only leads to a local maximum and te result is sensitive to te initialization of v (0). In order to find a representative v (0), we firstly calculate te means of te conditional distributions P ( v) for all training vectors v. Tese means are averaged and made binary using a fixed tresold of 0.5 to get (0). Ten, te initial v (0) for te iteratively updating in (22) is set as te mean of P (v (0) ). D. Estimating DBN mode Estimating te mode of a DBN model is more complex tan dealing wit an RBM. Te marginal distribution of te visible variables in (9) can be rewritten as P (v) = P (v, 1,..., L ) (24) 1 L = 1 P (v 1 )P ( 1 ), (25) were P (v 1 ) is described in (7) and P ( 1 ) can be calculated by applying P ( l 1 ) = l P ( l 1 l )P ( l ). (26) 3 Te variance parameters σi 2 in (5) are fixed to 1 to simplify te notation.

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 6 recursively for eac l from L 1 to 2. Te conditional distribution P ( l 1 l ) is represented by (8) and P ( L 1 ) is te marginal distribution (4) of te RBM representing te top two idden layers. Similar to te RBM mode estimation, te gradient descent algoritm can be applied ere to optimize (25) once te values of P ( 1 ) are determined for all possible 1. However, tis will lead to an exponential complexity wit respect to te number of idden units at eac idden layer due to te summation in (25) and (26). Tus, suc optimization becomes impractical unless te number of idden units is reasonably small. In order to get a practical solution to te DBN mode estimation, an approximation is made to (24) in tis paper. Te summation over all possible values of te idden units is simplified by considering only te optimal idden vectors at eac layer, i.e. were P (v) P (v, 1,..., L ), (27) { L 1, L } = argmax log P ( L 1, L ) (28) { L 1, L } and for eac l {L 1,..., 3, 2} l 1 = arg max l 1 log P ( l 1 l ). (29) Te joint distribution P ( L 1, L ) in (28) is modeled by a Bernoulli-Bernoulli RBM according to te definition of DBN in Section II.B. Because L 1 and L are bot binary stocastic vectors, te iterated conditional modes (ICM) [37] algoritm is adopted to solve (28). Tis algoritm determines te configuration tat maximizes te joint probability of a Markov random field by iteratively maximizing te probability of eac variable conditioned on te rest. Applying te ICM algoritm ere, we just update L by maximizing P ( L L 1 ) and update L 1 by maximizing P ( L 1 L ) iteratively. Bot of te two conditional distributions are multivariate Bernoulli distribution witout cross-dimension correlation [20]. Te optimal configuration at eac step can be determined simply by applying a tresold of 0.5 for eac binary unit. Te initial L 1 of te iteratively updating is set to be te L 1 wic is obtained by solving (28) for te DBN wit L 1 idden layers. For eac l from L 1 to 2, (29) can be solved recursively according to te conditional distribution in (8). After { 1,..., L } are determined, te mode of te DBN can be estimated by substituting (27) into (20). Considering P (v 1 ) is a Gaussian distribution as (7), we ave µ arg max log P (v, 1,..., L ) v = arg max log P (v 1 ) v = W 1 1 + a 1. (30) A. Experimental conditions IV. EXPERIMENTS A 1-our Cinese speec database produced by a professional female speaker was used in our experiments. It consisted Cumulative Probability 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 A B C 50 100 200 400 800 1600 3200 6400 12800 Number of Frames Fig. 3. Te cumulative probability curve for te number of frames belonging to eac context-dependent state. Te arrows indicate te numbers of frames of te tree example states used for te analysis in Section IV.B and Fig. 4. of 1,000 sentences togeter wit te segmental and prosodic labels. 800 sentences were selected randomly for training and te remaining 200 sentences were used as a test set. Te waveforms were recorded in 16kHz/16bit format. Wen constructing te baseline system, 41-order melcepstra (including 0-t coefficient for frame power) were derived from te spectral envelope by STRAIGHT analysis at 5ms frame sift. Te F0 and spectral features consisted of static, velocity, and acceleration components. A 5-state left-torigt HMM structure wit no skips was adopted to train te context-dependent pone models. Te covariance matrix of te single Gaussian distribution at eac HMM state was set to be diagonal. After te decision-tree-based model clustering, we got 1,612 context-dependent states in total for te mel-cepstral stream. Te model parameters of tese states were estimated by maximum likeliood training. In te spectral envelope modeling, te FFT lengt of te STRAIGHT analysis was set to 1024 wic led to 513 3 = 1539 visible units in te RBMs and DBNs corresponding to te spectral amplitudes witin te frequency range of [0, π] togeter wit teir dynamic components. After te HMMs for te mel-cepstra and F0 features were trained, a state alignment was conducted on te training set and te test set to assign te frames to eac state for te spectral envelope modeling and testing. Te cumulative probability curve for te number of frames belonging to eac context-dependent state is illustrated in Fig. 3. From tis figure, we can see tat te numbers of training samples vary a lot among different states. For eac context-dependent state, te logaritmized spectral amplitudes at eac frequency point were normalized to zero mean and unit variance. CD learning wit 1-step Gibbs sampling (CD1) was adopted for te RBM training and te learning rate was 0.0001. Te batc size was set to 10 and 200 epocs were executed for estimating eac RBM. Te DBNs were estimated following te greedy layer-by-layer training algoritm introduced in Section II.B.

ave. log-prob. ave. log-prob. ave. log-prob. ave. log-prob. ave. log-prob. ave. log-prob. JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7 (a) -80 number of GMM mixtures 1 2 4 8 16 32 64-130 number of GMM mixtures 1 2 4 8 16 32 64-155 number of GMM mixtures 1 2 4 8 16 32 64-100 -120-140 -160-180 -140-150 -160-170 -160-165 -170-175 GMM(train) GMM(test) RBM(train) RBM(test) -200 1 10 50 200 1000 number of RBM idden units -180 1 10 50 200 1000 number of RBM idden units -180 1 10 50 200 1000 number of RBM idden units (b) -700 number of GMM mixtures 1 2 4 8 16 32 64-1400 number of GMM mixtures 1 2 4 8 16 32 64-1700 number of GMM mixtures 1 2 4 8 16 32 64-1000 -1300-1600 -1900-2200 -1600-1800 -2000-2200 -1800-1900 -2000-2100 GMM(train) GMM(test) RBM(train) RBM(test) -2500 1 10 50 200 1000 number of RBM idden units -2400 1 10 50 200 1000 number of RBM idden units -2200 1 10 50 200 1000 number of RBM idden units Fig. 4. Te average log-probabilities on te training and test sets wen modeling (a) te mel-cepstra and (b) te spectral envelopes of state A (left column), state B (middle column), and state C (rigt column) using different models. Te number of training samples belonging to tese tree selected states are indicated in Fig. 3. B. Comparison between GMM and RBM as state PDFs At first we compared te performance of te GMM and te RBM in modeling te distribution of mel-cepstra and spectral envelopes for an HMM state. Tree representative states were selected for tis experiment, wic ave 270, 650, 1530 training frames and 60, 130, 410 test frames respectively. As sown in Fig. 3, te numbers of training frames of tese tree states correspond to te 0.1, 0.5, and 0.9 cumulative probabilities wic are calculated over te numbers of te training frames of all te 1,612 context-dependent states. GMMs and RBMs were trained under te maximum likeliood criterion to model tese tree states. Te covariance matrices in te GMMs were set to be diagonal and te number of Gaussian mixtures varied from 1 to 64. Te number of idden units in te RBMs varied from 1 to 1, 000. Te average log-probabilities on te training and test sets for different models and states are sown in Fig. 4 for te mel-cepstra and te spectral envelopes respectively. Examining te difference between te training and test log-probabilities for bot te melcepstra and te spectral envelopes, we see tat te GMMs ave a clear tendency of over-fitting wit te increasing of model complexity. Tis over-fitting effect becomes less significant wen a larger training set is available. On te oter and, te RBM sows consistently good generalization ability wit te increasing of te number of idden units. Tis can be attribute to utilizing te binary idden units wic create a information bottleneck and act as an effective regularizer during model training. Te differences between te test log-probabilities of te best GMM or RBM models and te single Gaussian distributions for te tree states are listed in Table I. From Fig. 4 and Table I, we can see tat te model accuracy improvements obtained by using te density models tat are more complex tan a single Gaussian distribution are relatively small wen te mel-cepstra are used for spectral modeling. Once te spectral envelopes are used, suc improvements become muc more significant for bot te GMM and RBM models. Besides, te RBM also gives muc iger log-probability to te test data tan te GMM wen modeling te spectral envelopes. Tese results can be attributed to tat te mel-cepstral analysis is a kind of decorrelation processing to te spectrums. A GMM wit multiple components is able to describe te interdimensional correlations of a multivariate distribution to some extend even if te diagonal covariance matrices are used. An RBM wit H idden units can be considered as a GMM wit 2 H structured mixture components according to (21). Terefore, it is good at analyzing te latent patterns embedded in te ig-dimensional raw data wit inter-dimensional correlations. Fig. 5 sows te estimated weigt matrices W in te RBMs wen modeling te spectral envelopes for te tree states. We can see tat te weigt matrices are somewat sparse, indicating eac idden unit tries to capture te caracteristics of te spectral envelope in some specific frequency bands. Tis is similar to a frequency analysis for spectral envelopes wic makes use of te amplitudes of te critical frequency bands context-dependently. For te spectral envelope modeling, we can furter improve te model accuracy by training RBMs layer-by-layer and constructing a DBN. Te average log-probabilities on te training and test sets wen modeling te spectral envelopes using an RBM and a two-idden-layer DBN are compared in Table II. Here, te lower bound estimation [31] to te log-

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8 TABLE I THE DIFFERENCES BETWEEN THE TEST LOG-PROBABILITIES OF THE BEST GMM OR RBM MODELS AND THE SINGLE GAUSSIAN DISTRIBUTIONS FOR THE THREE SELECTED STATES. THE NUMBERS IN THE BRACKETS INDICATE THE NUMBERS OF GAUSSIAN MIXTURES FOR THE GMMS AND THE NUMBERS OF HIDDEN UNITS THE RBMS WHICH LEAD TO THE HIGHEST LOG-PROBABILITIES ON THE TEST SET. State mel-cepstra spectral envelope GMM RBM GMM RBM A 2.40(4) 9.32(200) 321.84(8) 421.67(1000) B 4.80(8) 7.73(200) 221.66(8) 410.06(1000) C 1.80(8) 3.14(1000) 237.07(32) 332.19(1000) TABLE III SUMMARY OF DIFFERENT SYSTEMS CONSTRUCTED IN THE EXPERIMENTS. System Spectral Features State PDF Baseline mel-cepstra single Gaussian GMM(1) spectral envelope single Gaussian GMM(8) spectral envelope GMM, 8 mixtures RBM(10) spectral envelope RBM, 10 idden units RBM(50) spectral envelope RBM, 50 idden units DBN(50-50) spectral envelope 2-idden-layer DBN, 50 idden units eac layer DBN(50-50-50) spectral envelope 3-idden-layer DBN, 50 idden units eac layer Frequency (Hz) 8000 7000 6000 5000 4000 3000 State A State B State C TABLE IV AVERAGE LOG-PROBABILITIES OF THE SAMPLE MEANS AND THE ESTIMATED MODES FOR THE FOUR RBM OR DBN BASED SYSTEMS. System Sample Means PDF Modes RBM(10) -1652.1-1488.0 RBM(50) -1847.2-1534.2 DBN(50-50) -1604.5-1430.2 DBN(50-50-50) -1648.5-1432.3 2000 1000 0 10 20 30 40 50 Hidden Unit No. 10 20 30 40 50 Hidden Unit No. 10 20 30 40 50 Hidden Unit No. Fig. 5. Visualization of te estimated weigt matrices W in te RBMs wen modeling te spectral envelopes for te tree states. Te number of idden units is 50 and only te first 513 rows of te weigt matrices are drawn. Eac column in te gray-scale figures corresponds to te weigts connecting one idden unit wit te 513 visible units wic compose te static component of te spectral envelope feature vector. probability of a DBN is adopted. From tis table, we can observe a monotonic increase of test log-probabilities by using more idden layers. C. System construction Seven systems were constructed wose performance we compared in our experiments. Te definitions of tese systems are explained in Table III. As sown in Table I, te model accuracy improvement acieved by adopting te distributions more complicated tan te single Gaussian is not significant wen te mel-cepstra are used as spectral features. Terefore, we focus on te performance of spectral envelope modeling TABLE II THE AVERAGE LOG-PROBABILITIES ON THE TRAINING AND TEST SETS WHEN MODELING THE SPECTRAL ENVELOPES USING AN RBM OF 50 HIDDEN UNITS AND A TWO-HIDDEN-LAYER DBN OF 50 HIDDEN UNITS AT EACH LAYER. State RBM(50) DBN(50-50) train test train test A -1968.267-2133.704-1862.665-2033.919 B -1930.347-2025.088-1852.420-1943.159 C -1970.269-2006.260-1837.336-1875.578 using different forms of state PDFs in our experiments. Considering te computational complexity of training state PDFs for all context-dependent states, te maximum number of idden units in te RBM and DBN models were set to 50. All tese systems sared te same decision trees for model clustering and te same state boundaries wic were derived from te Baseline system. Te F0 and duration models of te seven systems were identical. D. Mode estimation for te RBMs and DBNs Wen constructing te RBM(10), RBM(50), DBN(50-50), and DBN(50-50-50) systems, te mode of eac RBM or DBN trained for a context-dependent state was estimated for Gaussian approximation following te metods introduced in Section III.C and III.D. For eac system, te average logprobabilities of te estimated modes and te sample means were calculated. Te results are listed in Table IV. From tis table, we see tat te estimated modes ave muc iger logprobabilities tan te sample means known to ave te igest probability for a single Gaussian distribution. Tis means tat wen te RBMs or te DBNs are adopted to represent te state PDFs, te feature vector wit te igest output probability is not te sample means anymore. Tis implies te superiority of RBM and DBN over single Gaussian distribution in alleviating te over-smooting problem during parameter generation under te maximum output probability criterion. Te spectral envelopes recovered from te modes of different systems for one HMM state 4 are illustrated in Fig. 6. Here, only te static components of te spectral envelope feature vectors are drawn. Te mode of te GMM(1) system is just te Gaussian mean vector. Te mode of te GMM(8) system is approximated as te Gaussian mean of te mixture wit te igest mixture weigt. Comparing GMM(8) wit GMM(1), we can see tat using more Gaussian mixtures 4 Tis state is not one of te tree states used in Section IV.B.

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 9 Norm. Spec. Amp. Norm. Spec. Amp. 1 0.8 0.6 0.4 0.2 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) 1 0.8 0.6 0.4 0.2 GMM(1) GMM(8) RBM(10) RBM(50) GMM(1) RBM(50) DBN(50 50) DBN(50 50 50) 0 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency(Hz) Fig. 6. Te spectral envelopes recovered from te modes of different systems for one HMM state. can elp alleviate te over-smooting effect on te spectral envelope. Besides, te estimated state mode of te RBM and DBN based systems ave sarper formant structures tan te GMM-based ones. Comparing RBM(50) wit RBM(10), we can see te advantages of using more idden units in an RBM. Wile te differences among te estimated modes of te RBM(50), DBN(50-50), and DBN(50-50-50) systems are less significant. We will investigate te performance of tese systems furter by te following subjective evaluation. E. Subjective evaluation Because te mel-cepstrum extraction can be considered as a kind of linear transform to te logaritmized spectral envelope, te spectral envelope recovered from te mean of te melcepstra in a state is very close to te one recovered from te mean of te corresponding logaritmized spectral envelopes. Terefore, te Baseline and te GMM(1) systems ad very similar syntetic results and te Baseline system was adopted as a representative for tese two systems in te subjective evaluation to simplify te test design. For te GMM(8) system, te EM-based parameter generation algoritm [4] could be applied to predict te spectral envelope trajectories by iteratively updating. In order to get a closed-form solution, we made a single Gaussian approximation to te GMMs at syntesis time by only using te Gaussian mixture wit te igest mixture weigt at eac HMM state. Te first subjective evaluation was to compare among te Baseline, GMM(8), RBM(10), and RBM(50) systems. Fifteen sentences out of te training database were selected and syntesized using tese four systems respectively. 5 Five groups of preference tests were conducted and eac one was to make comparison between two of te four systems as sown in eac row of Table V. Eac of te pairs of syntetic sentences were evaluated in random order by five Cinese-native listeners. 5 Some examples of te syntetic speec using te seven systems listed in Table III can be found at ttp://staff.ustc.edu.cn/~zling/dbnsyn/demo.tml. TABLE V SUBJECTIVE PREFERENCE SCORES (%) AMONG SPEECH SYNTHESIZED USING THE Baseline, GMM(8), RBM(10), AND RBM(50) SYSTEMS, WHERE N/P DENOTES NO PREFERENCE AND p MEANS THE p-value OF t-test BETWEEN THESE TWO SYSTEMS. Baseline GMM(8) RBM(10) RBM(50) N/P p 18.67 48.00 33.33 0.0014 12.00 50.67 37.33 0.00 5.33 70.67 24.00 0.00 16.00 69.33 14.67 0.00 9.33 37.33 53.33 0.00 TABLE VI SUBJECTIVE PREFERENCE SCORES (%) AMONG THE RBM(50), DBN(50-50), AND DBN(50-50-50) SYSTEMS. RBM(50) DBN(50-50) DBN(50-50-50) N/P p 25.33 17.33 57.33 0.2919 38.67 21.33 40.00 0.0520 Te listeners were asked to identify wic sentence in eac pair sounded more natural. Table V summarizes te preference scores among tese four systems and te p-values given by t- test. From tis table, we can see tat introducing te density models tat are more complex tan single Gaussian, suc as GMM and RBM, to model te spectral envelopes at eac H- MM state can acieve significantly better naturalness tan te single Gaussian distribution based metods. Compared wit te GMM(8) system, te RBM(50) system as muc better preference in naturalness. Tis demonstrates te superiority of RBM over GMM in modeling te spectral envelope features. A comparison between te spectral envelopes generated by te Baseline system and te RBM(50) system is sown in Fig. 7. From tis figure, we can observe te enanced formant structures after modeling te spectral envelopes using RBMs. Besides, we can also find in Table V tat te performance of te RBM-based systems is influenced by te number of idden units used in te model definition wen comparing RBM(10) wit RBM(50). Tese results are consistent wit te formant sarpness of te estimated modes for different systems sown in Fig. 6. In order to investigate te effect of extending RBM to DBN wit more idden layers, anoter subjective evaluation was conducted among te RBM(50), DBN(50-50), and DBN(50-50-50) systems. Anoter fifteen sentences out of te training database were used and two groups of preference tests were conducted by five Cinese-native listeners. Te results are sown in Table VI. We can see tat tere is no significant differences among tese tree systems at 0.05 significance level. Altoug we can improve te model accuracy by introducing more idden layers as sown in Table II, te naturalness of syntetic speec can not be improved correspondingly. One possible reason is te approximation we make in (27) wen estimating te DBN mode. F. Objective evaluation Besides te subjective evaluation, we also calculated te spectral distortions on te test set between te spectral en-

JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10 a) Baseline 8000 Frequency (Hz) Frequency (Hz) 6000 4000 2000 0 b) RBM(50) 8000 6000 4000 2000 0 100 200 300 400 500 600 Frame 100 200 300 400 500 600 Frame Fig. 7. Te spectrograms of a segment of syntetic speec using a) te Baseline system and b) te RBM(50) system. Tese spectrograms are not calculated by STFT analysis on te syntetic waveform. For te Baseline system, te spectrogram is drawn based on te spectral envelopes recovered from te generated mel-cepstra. For te RBM(50) system, te spectrogram is drawn based on te generated spectral envelopes directly. velopes generated by te systems listed in Table III and te ones extracted from te natural recordings. Te syntetic spectral envelopes used te state boundaries of te natural recordings to simplify te frame alignment. Ten, te natural and syntetic spectral envelopes at eac frame were normalized to te same power and te calculation of te spectral distortion between tem followed te metod introduced in [38]. For te Baseline system, te generated mel-cepstra were converted to spectral envelopes before te calculation. Te average spectral distortions of all te systems are listed in Table VII. We can see tat te objective evaluation results are inconsistent wit te subjective preference scores sown in Table V. For example, te RBM(50) system as significant better naturalness tan te Baseline system in te subjective evaluation, wile its average spectral distortion is te igest. Te reason is tat te spectral distortion in [38] is a Euclidean distance between two logaritmized spectral envelopes, wic treats eac dimension of te spectral envelopes independently and equally. However, te superiority of our proposed metod is to provide better representation of te cross-dimension correlations for te spectral envelope modeling, wic can not be reflected by tis spectral distortion measurement. Similar inconsistency between subjective evaluation results and objective acoustic distortions for speec syntesis as been observed in [12], [39]. V. CONCLUSION AND FUTURE WORK We ave proposed an RBM and DBN based spectral envelope modeling metod for statistical parametric speec syntesis in tis paper. Te spectral envelopes extracted by STRAIGHT vocoder are modeled by an RBM or a DBN for eac HMM state. At te syntesis time, te mode vectors of te trained RBMs and DBNs are estimated and used in place of te Gaussian means for parameter generation. Our experimental results sow te superiority of RBM and DBN over Gaussian mixture model in describing te distribution TABLE VII AVERAGE SPECTRAL DISTORTIONS (SD) ON TEST SET BETWEEN THE SPECTRAL ENVELOPES GENERATED BY THE SYSTEMS LISTED IN TABLE III AND THE ONES EXTRACTED FROM THE NATURAL RECORDINGS. system ave. SD (db) Baseline 3.85 GMM(1) 3.77 GMM(8) 3.86 RBM(10) 3.89 RBM(50) 4.11 DBN(50-50) 4.10 DBN(50-50-50) 4.10 of spectral envelopes as density models and in mitigating te over-smooting effect of te syntetic speec. As we discussed in Section I, tere are also some oter approaces tat can significantly reduce te over-smooting and improve te quality of te syntetic speec, suc as te GV-based parameter generation [11] and te post-filtering tecniques [6], [13]. In tis paper, we focus on te acoustic modeling to tackle te over-smooting problem. It worts to investigate alternative parameter generation and post-filtering algoritms tat are appropriate for our proposed spectral envelope modeling metod in te future. Tis paper only makes some preliminary exploration on applying te ideas of deep learning into statistical parametric speec syntesis. Tere are still several issues in te current implementation tat require furter investigation. First, it is wort examining te system performance wen te number of idden units in te RBMs keeps increasing. As sown in Fig. 3, te training samples are distributed among many contextdependent HMM states in a igly unbalanced manner. Tus, it may be difficult to optimize te model complexity for all states simultaneously. An alternative solution is to train te joint distribution between te observations and te context labels using a single network [20] wic is estimated using all training samples. Similar approac for te statistical parametric speec syntesis as been studied in [30], were te joint distribution between te tonal syllable ID and te spectral and excitation features is modeled using a multi-distribution DBN. Second, increasing te number of idden layers in te DBNs didn t acieve improvement in our subjective evaluation. A better algoritm to estimate te mode of a DBN wit less approximation is necessary. We plan as our future work to implement te Gaussian approximation according to (25) wen te number of idden units is reasonably small and compare its performance wit our current implementation. Anoter strategy is to adopt te sampling outputs rater tan te model modes during parameter generation. As better density models, te RBM and DBN are more appropriate tan te GMM for generating acoustic features by sampling, wic may elp make te syntetic speec less monotonic and boring. Tird, in te work presented in tis paper, te decision tress for model clustering are still constructed using melcepstra and single Gaussian state PDF. To extend te RBM and DBN modeling from PDF estimation for te clustered states to model clustering for te fully context-dependent states will also be a task of our future work. Besides te spectral