STATISTICAL F 0 PREDICTION FOR ELECTROLARYNGEAL SPEECH ENHANCEMENT CONSIDERING GENERATIVE PROCESS OF F 0 CONTOURS WITHIN PRODUCT OF EXPERTS FRAMEWORK

Size: px

Start display at page:

Download "STATISTICAL F 0 PREDICTION FOR ELECTROLARYNGEAL SPEECH ENHANCEMENT CONSIDERING GENERATIVE PROCESS OF F 0 CONTOURS WITHIN PRODUCT OF EXPERTS FRAMEWORK"

Harold Wheeler
5 years ago
Views:

1 STATISTICAL F 0 PREDICTION FOR ELECTROLARYNGEAL SPEECH ENHANCEMENT CONSIDERING GENERATIVE PROCESS OF F 0 CONTOURS WITHIN PRODUCT OF EXPERTS FRAMEWORK Kou Tanaka 1 Hirokazu Kameoka Tomoki Toda 3 and Satoshi Nakamura 1 1 Graduate School of Information Science Nara Institute of Science and Technolog Japan NTT Communication Science Laboratories NTT Corporation Japan 3 Information Technolog Center Nagoa Universit Japan {ko-t s-nakamura}@is.naist.jp kameoka.hirokazu@lab.ntt.co.jp tomoki@icts.nagoa-u.ac.jp ABSTRACT We have previousl proposed a statistical fundamental frequenc (F 0 ) prediction method that makes it possible to predict the underling F 0 contour of electrolarngeal (EL) speech from its spectral feature sequence. Although this method was shown to contribute to improving the naturalness of EL speech as a whole the predicted F 0 contour was still unnatural compared with that in normal speech. One possible solution to improve the naturalness of the predicted F 0 contours would be to take account of the phsical mechanism of vocal phonation. Recentl a statistical model of voice F 0 contours was formulated b constructing a stochastic counterpart of the Fujisaki model a well-founded mathematical model representing the control mechanism of vocal fold vibration. This paper proposes a Product-of-Experts model to incorporate this generative model of voice F 0 contours into the statistical F 0 prediction model. Based on the constructed model we derive algorithms for parameter training and F 0 prediction. Experimental results revealed that the proposed method successfull outperformed our previousl proposed method in terms of the naturalness of the predicted F 0 contours. Index Terms Electrolarngeal speech enhancement F 0 prediction Generative model Product of Experts 1. INTRODUCTION Speech is one of the most common tools in human communication. Since speech is produced b the vocal apparatus the produced sounds are phsicall constrained b the conditions of human bod. Unfortunatel there are man people with disabilities that prevent them from producing speech freel leading to communication barriers. Those who are unable to produce speech freel involve larngectomees who have undergone an operation to remove the larnx including the vocal folds for such reasons as injur or larngeal cancer. The abilit b these people to generate excitation sounds is severel impaired because the no longer have their vocal folds. One alternative means of producing voice for these patients involves the use of electrolarngeal (EL) speech which is produced b using the excitation signals mechanicall generated from an electrolarnx. EL speech is reasonabl intelligible but somewhat unnatural particularl due to the mechanical sounding of the excitation signals. To address this problem we have previousl proposed methods that aim to convert EL speech to normal-sounding speech b predicting the fundamental frequenc (F 0 ) contour from the spectrum sequence of the EL speech based on This work was supported in part of JSPS KAKENHI Gaussian Mixture Models (GMMs) followed b snthesizing the speech waveforms according to the predicted acoustic parameters [1 3]. These methods were shown to contribute to improving the naturalness of EL speech [1 ] and also preserving its intelligibilit [3]. However the F 0 contours predicted using these methods still sounded unnatural compared with that in normal speech. This was because the predicted F 0 contours were not necessaril guaranteed to satisf the phsical constraint of the actual control mechanism of the throid cartilage even though the were optimal in a statistical sense. In this regard these methods still had a plent of room for improvement. One possible solution to improve the naturalness of the F 0 contours of the converted speech would be to incorporate a generative model of voice F 0 contours into the statistical F 0 prediction model to take account of the phsical mechanism of vocal phonation. One of the authors previousl proposed a statistical model of voice F 0 contours [4 6] formulated b constructing a stochastic counterpart of the Fujisaki model [7] a wellfounded mathematical model representing the control mechanism of vocal fold vibration. The Fujisaki model [7] assumes that an F 0 contour on a logarithmic scale is the superposition of a phrase component an accent component and a base value. The phrase and accent components are considered to be associated with mutuall independent tpes of movement of the throid cartilage with different degrees of freedom and muscular reaction times. The model proposed in [4 6] has made it possible to estimate the underling parameters of the Fujisaki model that best explain the given F 0 contour b using powerful statistical inference techniques. To incorporate the generative F 0 contour model into the statistical F 0 prediction framework this paper proposes a Product-of-Experts (PoE) model [8] combining the abovementioned two models. Since the PoE model is obtained b multipling the densities of different models it usuall becomes complicated due to the renormalization term. To avoid this we introduce a latent trajector model proposed in [9] to reformulate the prediction model so that it can be smoothl combined with the generative F 0 contour model.. GMM-BASED STATISTICAL F 0 PREDICTION We briefl review our statistical F 0 prediction method [1 3] which exploits the idea of statistical voice conversion techniques [10 11]. The aim of this method is to predict F 0 contours from the spectral parameters of EL speech. As with voice conversion methods it consists of training and prediction processes. In the training process the parameters λ G of the joint probabilit densit p((x[k] T o[k] T ) T λ G ) described as a Gaussian mixture model (GMM) are trained where T de /16/$ IEEE 5665 ICASSP 016

2 Fig. 1. Original Fujisaki model [7]. notes transposition and x[k] and o[k] denote a source feature and a target feature at time frame k respectivel. The corresponding joint feature vectors can be obtained b performing automatic frame alignment with Dnamic Time Warping. As a source feature the spectral segment feature of EL speech is extracted on a frame-b-frame basis from the mel-cepstra at multiple frames around the current frame k [1]. The target feature o[k] = ([k] [k]) T consists of the static and delta (time derivative) components of the log-scaled F 0 value [k] extracted on a frame-b-frame basis from the target normal speech. Note that to improve prediction accurac we interpolate unvoiced frames of F 0 patterns b using spline interpolation and remove micro-prosod [13]. In the prediction process given the spectral segment sequence x = (x[1] T... x[k] T ) T of EL speech the most likel F 0 sequence = ([1]... [K]) T can be obtained as follows: ŷ = argmax p(o x λ G ) subject to o = W (1) where o = (o[1] T... o[k] T ) T denotes the joint static and dnamic feature vector sequence W is a constant matrix that transforms the static feature vector sequence to o. Namel each row of W consists of the coefficients of an identit mapping operator or time differential operator. p(o x λ G ) is the GMM with the trained parameters which we approximate as p(o x λ G )= p(o x m λ G )p(m x λ G ) m p(o x ˆm λ G )p( ˆm x λ G ) () with ˆm = argmax m p(m x λ G ). Here m = (m 1... m K ) indicates a sequence of mixture indices. p(o x m λ G ) is given as the product of p(o[k] x[k] λ G ) = N (o[k]; e (o x) D (o x) ) over k where e (o x) and D (o x) are the mean vector and covariance matrix of -th mixture component respectivel. Thanks to this approximation the solution to Eq. (1) is given explicitl as follows: where e (o x) ˆm e (o x) ˆm K is D (o x) ˆm 1 ŷ = (W T D (o x) 1 W ) 1 W T D (o x) 1 ˆm e (o x) ˆm (3) is a stacked vector of the mean vectors e(o x) ˆm 1... and D (o x) ˆm is a block diagonal matrix where each block... D (o x) ˆm K. 3. GENERATIVE MODEL OF VOICE F 0 CONTOURS The generative model of F 0 contours proposed in [4 6] is a stochastic counterpart of a discrete-time version of the Fujisaki model [7]. The Fujisaki model (shown in Fig. 1) assumes that a logscaled F 0 contour (t) is the superposition of a phrase component p (t) an accent component a (t) and a base value µ b. The phrase and accent components are assumed to be the outputs of different second-order criticall damped filters excited with Dirac deltas u p (t) (phrase commands) and rectangular pulses u a (t) (accent commands) respectivel. Here Fig.. Command function modeling with HMM. it must be noted that the phrase and accent commands do not usuall overlap each other. The base value is a constant value related to the lower bound of the speaker s F 0 below which no regular vocal fold vibration can be maintained. The log F 0 contour (t) is thus expressed as (t) = p (t) + a (t) + µ b (4) where p (t) = g p (t) u p (t) (5) a (t) = g a (t) u a (t). (6) Here denotes convolution over time. g p (t) and g a (t) are the impulse responses of the two second-order sstems which are known to be almost constant within an utterance as well as across utterances for a particular speaker. A ke idea in the proposed model [4 6] is that the sequence of the phrase and accent command pair (i.e. the underling parameters of the Fujisaki model) is modeled as a path-restricted hidden Markov model (HMM) with Gaussian emission densities (shown in Fig. ) so that estimating the state transition of the HMM directl amounts to estimating the Fujisaki-model parameters. We hereafter use k to indicate the discrete time index. Given a state sequence s = (s 1... s K ) of the above HMM the conditional distributions of the phrase command sequence u p = (u p [1]... u p [K]) T and the accent command sequence u a = (u a [1]... u a [K]) T are given as p(u p s λ F ) = N (u p ; µ p Σ p ) (7) p(u a s λ F ) = N (u a ; µ a Σ a ) (8) respectivel where λ F denotes the parameters of the HMM. µ p and µ a denote the mean sequences of the state emission densities and Σ p and Σ a are diagonal matrices whose diagonal elements correspond to the variances of the state emission densities. From Eqs. (5) and (6) the relationships between p = ( p [1]... p [K]) T and u p and between a = ( a [1]... a [K]) T and u a can be written as G p u p = p (9) G a u a = a (10) where G p and G a are Toeplitz matrices where each row is a shifted cop of the convolution kernels g p [1]... g p [K] and g a [1]... g a [K]. B using u b to denote the baseline component the log F 0 sequence is given as = p + a +u b +n where n is an additive noise component corresponding to micro prosod. If we assume that n follows a Gaussian distribution with mean 0 and covariance Γ the conditional distribution of given u = (u T p u T a u T b )T is defined as 5666

3 p( u) = N (; G p u p + G a u a + u b Γ). (11) We further assume that u b follows a Gaussian distribution with mean µ b 1 and covariance Σ b. Then from Eqs. (7) (8) and (11) the conditional distribution of given s is given as p( s Λ F )= p( u)p(u s λ F )du = N (; µ F Σ F ) (1) where µ F = G p µ p + G a µ a + µ b 1 and Σ F = G p Σ p G T p + G a Σ a G T a + Σ b + Γ. 4. PROPOSED MODEL 4.1. Product-of-Experts Strateg PoE [8] is a general technique to model a complicated distribution of data b combining relativel simpler distributions (experts). Since the distribution is obtained b multipling the densities of the experts the wa the experts are combined is somewhat similar to an and operation. In this section we construct a PoE model b treating the two models introduced in the previous sections as the experts. Training a PoE model b maximizing the likelihood of the data usuall becomes difficult since it is hard even to approximate the derivatives of the renormalization term. B contrast we propose an elegant formulation that allows the use of the EM algorithm for both parameter training and F 0 prediction. To do so we first introduce a latent trajector model proposed in [9] to reformulate the GMM-based statistical F 0 prediction model which plas a ke role in making this possible. 4.. Latent-Trajector-GMM-Based F 0 Prediction We reformulate the GMM-based statistical F 0 prediction model presented in Sec. b emploing the idea proposed in [9]. Instead of treating o as a function of we treat o as a latent variable to be marginalized out that is related to through a soft constraint o W. The relationship o W can be expressed through the conditional distribution p( o) p( o) exp { 1 (W o)t Λ(W o) } (13) = N (; Ho V ) (14) where H = (W T ΛW ) 1 W T Λ and V = (W T ΛW ) 1. Λ is a constant positive definite matrix that can be set arbitraril. As with Sec. the joint distribution p(x o λ G ) is modeled as a GMM. Namel given mixture indices m the conditional distribution p(x o m Λ G ) is defined as a Gaussian distribution. Thus the joint distribution of x o and m can be described using the distributions defined above p( x o m λ G ) = p( o)p(x o m λ G )p(m λ G ) (15) where p(m λ G ) is the product of mixture weights of the GMM. B marginalizing o and m out we can readil obtain the joint distribution p( x λ G ) which can be used as a criterion to train λ G and predict optimal in a consistent manner unlike the method presented in Sec.. We use this model to construct our PoE model in the next subsection Deriving PoE In the same wa as Eq. (15) we write the model presented in Sec. 3 in the form of a joint distribution p( u s λ F ) = p( u)p(u s λ F )p(s λ F ) (16) where p(u s λ F ) is given as the product of state emission densities and p(s λ F ) the product of the state transition probabilities given a state sequence s. We consider constructing a PoE model b combining Eqs. (15) and (16) followed b marginalization rather than b simpl combining the marginal distributions p( x λ G ) and p( λ F ) which makes the parameter training and F 0 prediction problems excessivel hard. To do so we first combine the densities of p( o) and p( u) to obtain p( o u). Since both of these distributions are Gaussians the product of their distributions can be easil obtained b completing the square of the exponent p( o u) N (; Ho V ) N (; Gu Γ) = N (; µ ou Σ ou ) (17) µ ou = (V 1 + Γ 1 ) 1 (V 1 Ho + Γ 1 Gu) (18) Σ ou = (V 1 + Γ 1 ) 1 (19) where G = [G p G a I] and u = (u T p u T a u T b )T. From Eqs. (15) (16) and (17) the joint distribution of x o u m and s can be constructed as p( x o u m s λ G λ F ) (0) = p( o u)p(x o m λ G )p(u s λ F ) p(m λ G )p(s λ F ). }{{} p(xou msλ Gλ F ) This can be used as the complete data likelihood for parameter training and F 0 prediction as explained later. B marginalizing o and u out we can readil obtain the joint distribution p( x m s λ G λ F ) which can be used as a criterion to train λ G and λ F and predict optimal in a consistent manner. Since both p(x o m λg ) and p(u s λ F ) are Gaussians let us write them as p(x o m λ G )= N ( [xo ] [ ] [ ] µx P ; µ xx P ) 1 xo o P ox P (1) oo p(u s λ F )= N (u; µ u P 1 u ). () Then from Eqs. (17) (1) and () it can be shown that p( x o u m s λ G λ F ) is given as p( x o u m s λ G λ F ) = N x o u where [ ] A11 A 1 A 1 A [ ] A11 b ; 1 + A 1 b A 1 b 1 + A b = ] b = [ b1 [ A11 A 1 ] A 1 A (3) Σ 1 ou O V 1 H Γ 1 G O P xx P xo O H T V T P ox P oo O G T Γ T O O P u O P xx µ x P xo µ o P oo µ o P ox µ x P u µ u 1 (4) (5) b completing the square of the exponent. Note that A 11 A 1 A 1 and A can be written explicitl using the blockwise inversion formula Parameter Training and F 0 Prediction The problems of parameter training and F 0 prediction can be formulated as the following optimization problems: {ˆλ G ˆλ F ˆm ŝ} = argmax log p(ỹ m s λ G λ F ) (6) λ Gλ F ms {ŷ ˆm ŝ} = argmax log p( m s ˆλ G ˆλ F ). (7) ms 5667

4 where ỹ and denote the observed F 0 contour extracted from normal speech and the observed spectral sequence extracted from non-larnx speech. Both of these problems can be solved using the EM algorithm b treating o and u as latent variables. Owing to space limitations here we onl derive an algorithm for solving Eq. (7). The likelihood of m and s given the complete data { o u} is given b Eq. (0). B taking the conditional expectation of log p( o u m s ˆλ G ˆλ F ) with respect to o and u given = m = m and s = s and then adding log p(m ˆλ G )p(s ˆλ F ) we obtain an auxiliar function Q(θ θ ) =E ou m s [ log p( o u m s ˆλG ˆλ F ) ] + log p(m ˆλ G ) + log p(s ˆλ F ) (8) where θ = { m s}. From Eq. (3) we obtain [ [ou ] [ ] ] E m q = A 1 b 1 + A b + E A 1 A 1 11 ([ ] ) A 11 b 1 A 1 b =: [ōū] (9) [ [ou ] [ ] ou T [ ] ] [ōū] [ōū] T m q = A A 1 A 1 11 A 1 + (30) which are the values to be computed at the E-step b substituting θ into θ. At the M-step we compute { m s} argmax Q(θ θ ). (31) ms The update equations are omitted owing to space limitations. 5. EXPERIMENTAL EVALUATION 5.1. Experimental Conditions We conducted objective and subjective evaluation experiments to evaluate the performance of the proposed method. For the objective evaluation we evaluated the F 0 correlation coefficients between the predicted and target F 0 contours. We also subjectivel evaluated the naturalness of the F 0 contour of converted speech. The source speech was EL speech uttered b one male larngectomee and the target speech was normal speech uttered b a professional female speaker. Each speaker uttered about 50 sentences in the ATR phoneticall balanced sentence set [15]. We conducted a 5-fold cross validation test in which 40 utterance pairs were used for training and the remaining 10 utterance pairs were used for evaluation. The sampling frequenc was set at 16 khz. The settings of HMM in the generative F 0 contour model were the same as reported in [4 6]. For initialization of the mixture index sequence m and state sequence s we performed the conventional GMM-based and Fujisaki-model-based methods. Note that the Fujisakimodel-based method refers to a post-processing method that consists of first appling the GMM-based method and then fitting the Fujisaki model to the predicted F 0 contour using the method of [5 6] which is similar to a post-processing method for HMM-based speech snthesis [16]. Note that in this experiment we implemented a simplified and approximated version of the proposed method in which the E-step procedure is replaced with the conventional Fujisaki-modelbased and GMM-based methods. Therefore the convergence of the algorithm implemented for the current evaluation is not strictl guaranteed. The speech used for evaluation were snthesized using STRAIGHT [17] given the mel-cepstrum sequence and F 0 contour. The methods selected for comparison were: Correlation coefficients GMM-based Fujisaki-model-based1 Proposed Fujisaki-model-based Number of iteration times Fig. 3. F 0 correlation coefficients between predicted F 0 patterns from EL speech and extracted F 0 patterns from normal speech. Mean Opinion Score GMM-based Confidence interval (95%) Fujisaki-model -based1 Proposed Fig. 4. Result of opinion test on naturalness. GMM-based: Predict F 0 contours with the GMMbased method. Fujisaki-model-based1: Fit the Fujisaki model to the predicted F 0 contours obtained with the GMM-based method. Proposed: Predict F 0 contours with an approximated version of the proposed method in which the E-step is replaced with the GMM-based and Fujisaki-modelbased methods. Fujisaki-model-based: Fit the Fujisaki model to the predicted F 0 contours obtained with Proposed. 5.. Experimental Results As Fig. 3 shows Proposed obtained the highest prediction accurac because of eq. (18) meaning an and operation for eq. (11) and (14). Therefore we found that it is effectiveness to construct a PoE model. Futhermore since Fujisaki-model-based has higher correration coefficients than Fujisaki-model-based1 we found that the predicted F 0 contours b Proposed were given good influences b considering not onl the GMM-based method but also the Fujisaki model. Note that we used the predicted F 0 contours obtained with the GMM-based method as the input for the Fujisakimodel-based1 method in this experiment. To make a more fair comparison it would be necessar to modif the Fujisakimodel-based method so as not to depend on the GMM-based method. In additional as for proposed there is no large difference of correration coefficients in each iteration. To make exactl evaluation we have to conduct the PoE model not replaced E-step with the conventional methods. As Fig. 4 shows Proposed outperformed the conventional methods GMM-based and Fujisaki-model-based1. This result is reasonable since Proposed obtained the highest prediction accurac as in Fig CONCLUSIONS In this paper to improve F 0 prediction performance in electrolarngeal speech enhancement we proposed a Product-of- Experts model that combined two conventional methods a statistical F 0 prediction method and a statistical F 0 contour modeling method based on its generative process. Experimental results revealed that the proposed method successfull outperformed our previousl proposed method in terms of the naturalness of the predicted F 0 contours. 5668

5 7. REFERENCES [1] K. Nakamura T. Toda H. Saruwatari and K. Shikano Speaking-aid sstems using gmm-based voice conversion for electrolarngeal speech in Proc. Speech Communication vol. 54 pp Januar 01. [] H. Doi T. Toda K. Nakamura H. Saruwatari and K. Shikano Alarngeal speech enhancement based on one-to-man eigenvoice conversion Audio Speech and Language Processing IEEE/ACM Transactionson vol. no. 1 pp Januar 014. [3] K. Tanaka T. Toda G. Neubig S. Sakti and S. Nakamura A hbrid approach to electrolarngeal speech enhancement based on noise reduction and statistical excitation generation IEICE Transactions on Information and Sstems vol. E97-D no. 6 pp June 014. [4] H. Kameoka J. Le Roux Y. Ohishi A statistical model of speech F 0 contours ISCA Tutorial and Research Workshop on Statistical And Perceptual Audition pp September 010. [5] K. Yoshizato H. Kameoka D. Saito and S. Sagaama Statistical approach to fujisaki-model parameter estimation from speech signals and its quantitative evaluation Proc. Speech Prosod pp Ma 01. [6] H. Kameoka K. Yoshizato T. Ishihara K. Kadowaki Y. Ohishi and K. Kashino Generative modeling of voice fundamental frequenc contours Audio Speech and Language Processing IEEE/ACM Transactionson vol. 3 no. 6 pp June 015. [7] H. Fujisaki A note on the phsiological and phsical basis for the phrase and accent components in the voice fundamental frequenc contour Vocal Fold Phsiolog: Voice Production Mechanisms and Functions pp [8] G. E Hinton Training products of experts b minimizing contrastive divergence Neural computation vol. 14 no. 8 pp August 00. [9] H. Kameoka Modeling speech parameter sequences with latent trajector hidden markov model in Proc. The 5th IEEE International Workshop on Machine Learning for Signal Processing September 015. [10] Y. Stlianou O. Cappe and E. Moulines Continuous probabilistic transform for voice conversion Speech and Audio Processing IEEE Transactions on vol. 6 no. pp March [11] T. Toda A.W. Black and K. Tokuda Voice conversion based on maximum-likelihood estimation of spectral parameter trajector IEEE Transactions on Audio Speech and Language Processing vol. 15 no. 8 pp. 35 November 007. [1] T. Toda M. Nakagiri and K. Shikano Statistical voice conversion techniques for bod-conducted unvoiced speech enhancement IEEE Transactions on Audio Speech and Language Processing vol. 0 no. 9 pp November 01. [13] K. J. Kohler Macro and micro F 0 in the snthesis of intonation Papers in Laborator Phonolog I pp [14] S. Narusawa N. Minematsu K. Hirose and H. Fujisaki A method for automatic extraction of model parameters from fundamental frequenc contours of speech in Proc. 00 IEEE International Conference on Acoustics Speech and Signal Processing vol. 1 pp [15] M. Abe Y. Sagisaka T. Umeda and H. Kuwabara Speech database ATR Technical Report TR-I-0166 September [16] T. Matsuda K. Hirose and N. Minematsu Appling generation process model constraint to fundamental frequenc contours generated b hidden-markov-modelbased speech snthesis Acoustical Science and Technolog Vol. 33 No. 4 pp [17] H. Kawahara I. Masuda-Katsuse and A. Cheveigne Restructuring speech representations using a pitch-adaptive time-frequenc smoothing and an instantaneous-frequenc-based F 0 extraction: Possible role of a repetitive structure in sounds in Proc. Speech Communication Vol. 7 No. 3-4 pp April

A Statistical Model of Speech F 0 Contours

A Statistical Model of Speech F Contours Hirokazu Kameoka, Jonathan Le Roux, Yasunori Ohishi 1 1 NTT Communication Science Laboratories, NTT Corporation, Japan {kameoka,leroux,ohishi}@cs.brl.ntt.co.jp