EXEMPLAR-BASED VOICE CONVERSION USING NON-NEGATIVE SPECTROGRAM DECONVOLUTION

Size: px

Start display at page:

Download "EXEMPLAR-BASED VOICE CONVERSION USING NON-NEGATIVE SPECTROGRAM DECONVOLUTION"

Benjamin Townsend
6 years ago
Views:

1 8th ISCA Speech Synthesis Worshop August 31 September 2, 2013 Barceona, Spain EXEMPLAR-BASED VOICE CONVERSION USING NON-NEGATIVE SPECTROGRAM DECONVOLUTION Zhizheng Wu 1,2, Tuomas Virtanen 3, Tomi Kinnunen 4, Eng Siong Chng 1,2, Haizhou Li 1,2,5 1 Schoo of Computer Engineering, Nanyang Technoogica University, Singapore 2 Temase Laboratories@NTU, Nanyang Technoogica University, Singapore 3 Department of Signa Processing, Tampere University of Technoogy, Tampere, Finand 4 Schoo of Computing, University of Eastern Finand, Joensuu, Finand 5 Human Language Technoogy Department, Institute for Infocomm Research, Singapore wuzz@ntu.edu.sg ABSTRACT In the traditiona voice conversion, converted speech is generated using statistica parametric modes (for exampe Gaussian mixture mode) whose parameters are estimated from parae training utterances. A we-nown probem of the statistica parametric methods is that statistica average in parameter estimation resuts in the over-smoothing of the speech parameter trajectories, and thus eads to ow conversion quaity. Inspired by recent success of so-caed exempar-based methods in robust speech recognition, we propose a voice conversion system based on non-negative spectrogram deconvoution with simiar ideas. Exempars, which are abe to capture tempora context, are empoyed to generate converted speech spectrogram convoutey. The exempar-based approach is seen as a data-driven, non-parametric approach as an aternative to the traditiona parametric approaches to voice conversion. Experiments on VOICES database indicate that the proposed method outperforms the conventiona joint density Gaussian mixture mode by a wide margin in terms of both objective and subjective evauations. Index Terms Voice conversion, exempar, non-negative matrix factorization, non-negative matrix deconvoution, tempora information 1. INTRODUCTION Voice conversion is a process of modifying source speaer s voice to sound ie it was spoen by another speaer (target). It can be appied to speaer identity conversion in speech synthesis systems when ony a few recording sampes from a specific target speaer are avaiabe. In genera, voice conversion techniques operate on severa different speech features, such as spectra enveope [1, 2], formants [3], fundamenta frequency [4, 5] and duration [6]. Spectra enveope contains most of the speaer identity information and is the focus in most of the voice conversion studies, incuding this one. Spectra conversion invoves two phases, training and run-time conversion. During training, a transformation function is estimated from frameaigned source-target feature vectors. The trained conversion mode is then appied to unseen utterances at system run-time. Impementation of the conversion function is the most important part of a voice conversion system. To impement a robust spectra conversion function, a number of data-driven statistica parametric methods have been proposed in the past two decades. A straightforward way to mode the reationship between source and target speech is to empoy vector quantization (VQ) to earn a codeboo from the paired source-target frame vectors, and appy this codeboo during conversion phase [7]. To aeviate the frame-to-frame discontinuity probem caused by VQ, joint density Gaussian mixture mode (JD-GMM) was proposed [8, 9, 1]. It impements a smoothed oca inear transformation function for each frame. Other oca inear transformation methods, such as partia east square regression [10], trajectory GMM/hidden Marov mode (HMM) [11], mixture of factor anayzers [12], oca inear transformation [13], noisy channe mode [2] and so on, have been proposed to reduce the over-smoothing and over-fitting probems of JD-GMM. In addition to the inear transformation functions, which assume the source and target speech features to be ineary correated, noninear methods, such as artificia neura networ [3, 14], support vector regression [15], erne partia east square regression [16], and conditiona restricted Botzmann machine [17], have been studied to impement noninear conversion. Due to inherent statistica averaging in parametric methods, over-smoothed speech sampes are generated from the averaged parameters, which eads to unnatura speech quaity. Inspired by the success of so-caed exempar-based noise robust speech recognition [18, 19, 20], we propose a non-parametric exempar-based voice conversion method as an aternative to statistica parametric methods. We define an exempar to be a segment of speech spectrogram spanning mutipe frames. Utiizing mutipe frames, as opposed to singe frame in the conventiona methods, aows contextua modeing which heps increasing the resuting speech quaity. We study two exampar-based voice conversion variants: nonnegative spectrogram factorization (NMF) and non-negative spectrogram deconvoution (NMD). In the former variant, each spectrogram frame is represented as a convex combination of severa basis spectra (atoms) forming a dictionary. In the deconvoution variant, a converted spectrogram is generated as a convoution of exempars and activations. Comparing with the most reated wor in [21], our wor has the foowing nove contributions: a) We utiize mutipe-frame exempar rather than singe-frame spectrum as the basis in the dictionary; b) We empoy ow-dimensiona fiter-ban energies instead of the origina magnitude spectrum to represent source spectrogram and source dictionary for efficient computation; c) We empoy a convoutive mode to incude tempora context information in the converted spectrogram. 201

2 Z. Wu, T. Virtanen, T. Kinnunen, E.S. Chng, H. Li 2. BASELINE JOINT DENSITY GAUSSIAN MIXTURE MODEL METHOD Among the statistica parametric methods, joint density Gaussian mixture mode (JD-GMM) method [8, 1] is one of the most successfu methods, due to the probabiistic treatment and fexibe impementation. Therefore, we empoy the JD-GMM method as our baseine method in this study. The JD-GMM method invoves two phases: off-ine training and run-time conversion phases. During the training phase, given parae training data from a source speaer X and a target speaer Y, dynamic time warping (DTW) agorithm is used to aign the source speech vectors and target speech vectors to obtain the paired speech feature vector Z [z 1,z 2,...,z t,...,z T], where z t [x n,y m] R 2d, and x n R d and y m R d are source and target speech feature vectors, respectivey. Gaussian mixture mode (GMM) is adopted to mode the distribution of the paired feature vector sequencez, which represents the joint distribution of source speechxand target speechy. The joint probabiity density is given as foows: P(X,Y) P(Z) µ (z) [ µ (x) µ (y) ] K 1,Σ (z) w (z) N(z µ(z) [ Σ (xx) Σ (yx),σ(z) Σ (xy) Σ (yy) ), (1) where K is the number of Gaussian components, µ (z) and Σ (z) are the mean vector and the covariance matrix of the th Gaussian componentn(z µ (z),σ(z) ), respectivey. The prior probabiityw(z) of the th Gaussian component is constrained by K 1. To ], 1 w(z) estimate the mode parameters of the joint density Gaussian mixture mode λ (z) {w (z),µ(z),σ(z) 1,2,...,K}, the we-nown expectation-maximization (EM) agorithm is adopted to maximize ieihood of the training data. In the run-time conversion phase, JD-GMM mode parameters are empoyed to impement the conversion function. To be more specific, for each input source speech feature vector x, the conversion function F(x) impemented with minimum mean square error is used to predict the target s feature vectorŷis given as: ŷ F(x) K p (x)(µ (y) 1 p (x) +Σ(yx) (Σ (xx) w N(x µ x,σ xx ) K 1 w N(x µ x,σxx ), ) 1 (x µ (x) )), (2) where p (x) is the posterior probabiity of the source vector x generated from the th Gaussian component. We note that during the JD-GMM mode parameter estimation process, the mean vector of each Gaussian component is updated as: µ (z) t1 ztp (z t,λ (z) ) t1 p (z t,λ (z) ). (3) Simiary, the covariance matrix of each Gaussian component is updated as: Σ (z) t1 p (z t,λ (z) )(z t µ (z) t1 p (z t,λ (z) ) )(zt µ(z) ) (4) From (3) and (4), we observe that when cacuating mean and covariance for each Gaussian component, a the training sampes are used, which is the so-caed statistica average. The statistica average resuts in over-smoothing of the converted speech. We aso note that if the correation between the paired source and target feature vectors is ow, the vaue of the covariance matrix Σ (yx) wi be very sma, therefore, ony µ (y) contributes to the converted speech as observed and reported in [22]. 3. PROPOSED EXEMPLAR-BASED VOICE CONVERSION METHOD To tace the over-smoothing probem, we propose an exemparbased method to generate the converted speech from the spectrogram segments (exempar). We empoy two matrix factorization techniques to impement the exempar-based method: non-negative spectrogram factorization and non-negative spectrogram deconvoution. Both impementations have the same procedures as foows: 1 Training: construct parae source and target dictionaries; 2 Conversion: 2.a Extract source spectrogram; 2.b Given source spectrogram and source dictionary, estimate activation matrix; 2.c Utiize the activation matrix estimated in step 2.b and the target dictionary to generate the converted spectrogram; The two impementations using matrix factorization techniques are briefy introduced in this section Non-negative spectrogram factorization (NMF) The first exempar-based method is based on non-negative spectrogram factorization. The basic idea of this method is to represent a magnitude spectrum as a inear combination of a set of basis spectra (speech atoms). It is formuated as foows: x T t1 a (X) t h t A (X) h, (5) wherex R p 1 represents the spectrum of one frame,t is the tota number of speech atoms, A (X) [a (X) 1,a (X) 2,,a (X) T ] Rp T is the dictionary of speech atoms buit from training source speech, a (X) t is thet th speech atom which has the same dimension asx,h [h 1,h 2,,h T] R T 1 is the non-negative weight or activation vector and h t is the activation of thet th speech atom. Therefore, the spectrogram of each source utterance can be represented as: X A (X) H, (6) where X R p M is the source spectrogram, and H R T M is the activation matrix, the coumn vector of which is the activation vector in Eq. (5). In order to generate converted speech spectrogram, we assume that the aigned source and target dictionaries share the same activation matrix. To this end, we represent the converted spectrogram as: Y A (Y) H, (7) where Y R q M is the converted spectrogram, and A (Y) R q T is the dictionary of the target speech atoms from target training data. 202

3 8th ISCA Speech Synthesis Worshop August 31 September 2, 2013 Barceona, Spain The iustration of Eq. (6) and (7) is presented in Fig. 1. The source and target dictionaries A (X) and A (Y) are constructed from parae training data and they remain the same during the conversion phase. During the conversion phase, the source spectrogram is given and the activation matrix is obtained as a soution of non-negative matrix factorization as in [18]. Then, the activation matrix estimated from Eq. (6) is then directy empoyed in Eq. (7) to generate the converted spectrogram Dictionary construction As discussed above, dictionary is important for both estimating the activation matrix and generating the converted speech signa. Before introducing how to construct dictionary, we first introduce the reated features used to represent spectrum. In this wor, the STRAIGHT [23] system is empoyed to extract spectra enveope and fundamenta frequency (F0). The foowing three features are invoved in this study: a) Magnitude spectra (MSP): Magnitude spectra consist a sequence of spectra enveopes extracted by STRAIGHT. We use 513 dimensiona spectra. Magnitude spectra can be passed to STRAIGHT for reconstructing speech signa. In this wor, target dictionary and converted spectrogram are aways represented by MSP. b) Me-scae magnitude spectra (MMSP): MMSP is obtained by passing the magnitude spectrogram to a 23-channe Me-scae fiter-ban. The minimum frequency is set to be Hz, and the maximum frequency is set to be 6,855.5 Hz. In this wor, MMSP is ony used in the source dictionary to estimate the activation matrix but not for synthesizing speech. Fig. 1. Iustration of non-negative spectrogram factorization for exempar-based voice conversion 3.2. Non-negative spectrogram deconvoution (NMD) Athough tempora constraints can be incuded in the estimation of activation matrix by using mutipe-frame exempars as source speech atoms, the converted speech spectrogram is sti generated frame-by-frame. In order to utiize tempora context in the generation process of the converted spectrogram, we propose non-negative spectrogram deconvoution (NMD) method for exempar-based voice conversion. In the NMD method, a spectrogram is represented as a convoution of exempars and activations. The idea is formuated as foows: X Y L 1 L 1 A (X) ( 1) H, (8) A (Y) ( 1) H, (9) where A (X) R p T and A (Y) R q T are the matrices consisting of the th frame of the source and target atoms, respectivey, L is the number of adjacent frames within an exampar and H is the ( 1) ( ) operator shifts activation (weights) matrix as that in Eq. (6). the matrix entries (coumns) to the right by( 1) units. In practice, severa consecutive frames of an exact frame can be staced into one supervector to represent the exact frame for constructing the source dictionary A (X) R p T. Therefore, p L d other than p d, wheredis the dimension the spectrum. During conversion, a source spectrogram X is first decomposed to estimate the activation matrix, and then the converted speech spectrogram Y is generated as a convoution of the target speech atoms and the corresponding activation matrix. The activation matrix is obtained by minimizing the generaized Kubac-Leiber divergence as expained in [19]. c) Me-cepstra coefficient (MCC): MCC is obtained by empoying me-cepstra anaysis technique on the magnitude spectrogram and eeping 24 coefficients as the feature. During synthesis, MCCs are converted bac to magnitude spectrogram, which is then passed to the STRAIGHT synthesis fiter to reconstruct speech signa. In this wor, MCC is ony used in the JD-GMM method and in the dynamic time warping to aign two parae utterances. Given one pair of parae utterances from source and target, the foowing process is empoyed to construct the dictionary. 1) Extract magnitude spectrogram (spectra enveopes) from both source and target speech signa using STRAIGHT; 2) Appy me-cepstra anaysis [24] on the spectrograms to obtain me-cepstra coefficients (MCCs); 3) Appy 23-channe Me-scae fiter-ban to obtain 23-dimensiona MMSP; 4) Perform dynamic time warping on the source and target MCC sequence to aign the speech to obtain source-target frame pairs; 5) Appy the aignment information to the source and target spectrograms. The resuting spectrum pairs are stored in the source and target dictionaries (coumn vectors), respectivey. The above five steps are appied for a the parae training utterances. A the spectrum pairs (coumn vectors in source and target dictionaries) are used as speech atoms. In order to incude mutipe frames, consecutive frames are staced into a super-vector to represent one frame. We note that for simpe expanation, same features (both spectra enveopes) are used in step 5. As the size of the activation matrix is independent of the dimensionaity of the features (coumn dimensionaity), therefore, 23-dimensiona MMSP can be used to repace 513-dimensiona MSP in the source dictionary. Whie 513-dimensiona MSP is aways used in the target dictionary for synthesizing speech purpose. More detais wi be discussed in Section

4 Z. Wu, T. Virtanen, T. Kinnunen, E.S. Chng, H. Li 4. EXPERIMENTS To evauate the proposed methods, we conduct experiments using the VOICES database [25]. Mae-to-femae and femae-to-mae conversions are conducted. For each conversion, 10 utterances from each speaer are seected as training data and 20 utterances, which are not incuded in the training data, are used as testing data. In the experiments, three methods are compared. They are summarized as foows: a) JD-GMM: The joint density Gaussian mixture mode method (Section 2). The number of Gaussian components is set to be 32. b) NMF: The proposed non-negative spectrogram factorization method (Section 3.1). c) NMD: The proposed non-negative spectrogram deconvoution method (Section 3.2). In the JD-GMM method, 24-dimensiona MCC features are used to represent spectra enveope and to synthesize speech signa, whie in NMF and NMD method, 513-dimensiona MSP is used in the target dictionary and to synthesize speech signa. Log-scae F0 is converted by equaizing the mean and variance of the source and target speech Objective evauation Two objective measures are empoyed to evauate the proposed method objectivey. The first objective measure is spectra distortion: me-cepstra distortion (MCD), which is cacuated between a converted frame and the corresponding origina target frame. We note that the frame aignment is obtained by performing dynamic time warping between parae source and target sentences. The MCD for them th frame is cacuated as: MCD[dB] (c m,d c conv m,d ), (10) og10 d1 where, M is frame number in one utterance, c m,d and c conv m,d are the d th dimension of the origina target and converted MCCs of them th frame, respectivey. We report the average MCD vaue over a the frames. A ower MCD vaue indicates smaer distortion. The second objective measure is the correation coefficient, which is cacuated between the origina target and the converted MCC parameter trajectories dimension-by-dimension. The correation coefficient γ d of the d th MCC trajectory is computed as foows: M m1 γ d (c m,d c d )(c conv m,d c conv d ) M m1 (c M, (11) m,d c d ) 2 m1 (cconv m,d cconv d ) 2 where c d and c conv d are the mean vaues of the origina target and converted MCCs of the d th dimension, respectivey. We note that correation coefficient is cacuated sentence-by-sentence and we report the average correation coefficient. Different from MCD, correation coefficient focuses on the trajectory-eve simiarity, which is not affected by the mean and variance of the MCC trajectory, and has been used to measure the fundamenta frequency trajectory simiarity [5, 26]. Bigger correation coefficient indicates Higher simiarity between the origina target and the converted MCC trajectories. We report the average correation coefficient over a dimension. In order to obtain comparabe MCD and correation resuts, in the NMF and NMD method, me-cepstra anaysis is appied to the converted spectrogram to get the 24-dimensiona MCCs for computing MCD and correation coefficient. Both MCD and correation coefficient resuts reported in this wor are averaged over the conversion pairs. As shown in Eq. (6) and (7), and Fig. 1, the dimensionaity of the activation matrix is independent of the dimensionaity of the exempars in both the source and the target dictionaries. Therefore, we first evauate the performance of NMF using different features in source dictionary for estimating the activation matrix. We note that for a the experiments, target dictionary aways use the 513- dimensiona magnitude spectra, as the target dictionary does not affect the activation matrix and aso is used to synthesize speech signa. As discussed above, the dimensionaity of the spectra enveope from STRAIGHT is 513 (1024-point FFT). If the origina magnitude spectra (MSP) are used to estimate the activation matrix, as iustrated in Fig. 1, the dimensionaity of the source dictionary A (X) wi be513 T, assuming that each exempar spans ony one frame. If each exempar spans 11 frame, the dimensionaity of the source dictionary A (X) wi be 5,643 T, where T is the number of atoms. The huge dimensionaity of the source dictionary wi increase the computation and memory usage consideraby. To reduce computation and memory usage, ow-dimensiona features wi be a better choice. In this study, we propose to use 23-dimensiona MMSP instead of the 513-dimensiona origina MSP to mae the source dictionary for estimating the activation matrix. Whie the target dictionary reminds same as discussed above. Tabe 1 presents the spectra distortions and correations of NMF using 513-dimensiona MSP and 23-dimensiona MMSP in the source dictionary. Here, an exempar spans ony one frame. The resuts show that, even the dimensionaity is reduced from 513 to 23, the distortion ony increases 0.06 db, and the correation decreases by The benefit of using 23-dimensiona MMSP instead of 513-dimensiona MSP in source dictionary to represent speech signa is that more consecutive frames can be incuded in the exempar to estimate the activation matrix without increasing the computation cost and memory usage too much. Tabe 1. Comparison of NMF resuts using 513-dimensiona magnitude spectra (MSP) and 23-dimensiona Me-scae magnitude spectra (MMSP) in the source dictionarya (X). 513-dimensiona MSP is aways used in the target dictionarya (Y). Features in source dictionarya (X) MCD (db) Correation MSP (513 dimensions) MMSP (23 dimensions) We then evauate the performance of NMF using mutipe frames in an exempar for source dictionary. The spectra distortion resuts as a function of the window size (number of consecutive frames) of an exempar is presented in Fig. 2. For Me-scae magnitude spectra, the window size of exempar is varied. Whie for 513-dimensiona magnitude spectra, ony one frame spectrum is empoyed in the exempar due to computation restrictions discussed above. The resuts show that when the window size is arger than 3, 23-dimensiona MMSP yieds ower MCD and higher correation coefficient than 513-dimensiona MSP. NMF with exempar using MMSP and spanning 9 frames gives the owest spectra distortion. We note that the dimensionaity of exempar using MMSP and spanning 9 frames is , which is sti much smaer than 513. The correation resuts in Fig. 3 agree we with the spectra distortion resuts. Next, we evauate the proposed non-negative deconvoution (NMD) method using 23-dimensiona MMSP. As shown above, 9 204

5 8th ISCA Speech Synthesis Worshop August 31 September 2, 2013 Barceona, Spain Spectra distortion NMF using 513 dimensiona MSP NMF using 23 dimensiona MMSP Spectra distortion (db) NMF NMD 5.35 Window size of an exempar Fig. 2. The spectra distortion resuts of NMF method using different features with the baseine JD-GMM method as a reference Window size of an exempar Fig. 4. Comparison of the spectra distortion resuts of JD-GMM, NMF and NMD methods as a function of the window size of an exempar 0.45 Correation coefficient NMF using 513 dimensiona MSP NMF using 23 dimensiona MMSP 0.4 Window size of an exempar Correation coefficient NMF NMD Fig. 3. The correation coefficient resuts of NMF method using different features with the baseine JD-GMM method as a reference frame exempars give owest distortion in the NMF method, therefore, in the NMD method, we stac 9 consecutive frames of an exact frame to represent the exact frame. Therefore, In Eq. (1), p The spectra distortion resuts are presented in Fig. 4, as a function of the window size of an exempar. Comparing with JD-GMM method, we observe that NMD method aways obtains ower spectra distortion. NMD and NMF methods have simiar performance in terms of spectra distortion when the window size is 5 or 7. The correation coefficient resuts are shown in Fig. 5. It ceary shows that NMD has the highest correation coefficients in a the cases. We note that different from NMF, NMD method utiizes mutipes target frames (an exempar) not ony to estimate the activation matrix but aso to generate the converted spectrogram Window size of an exempar Fig. 5. Comparison of the correation resuts of JD-GMM, NMF and NMD methods as a function of the window size of an exempar baseine JD-GMM method. We note that during the istening test, when the subjects are not abe to distinguish the simiarity across speech sampes, they prefer to choose the one which gives better quaity. Therefore, the simiarity can refect the speech quaity of the three methods to some degree Subjective evauation To assess the simiarity of the converted speech to the target speech, a simiarity preference istening test was conducted. The JD-GMM, and the two proposed methods: NMF and NMD are compared. 10 converted utterances from each method were randomy seected, incuding 5 utterances from the mae-to-femae conversion and the other 5 utterances from the femae-to-mae conversion. 11 subjects were ased to isten to a reference target speech and then the three converted speech sampes representing the three methods. After that they were ased to decide which speech sampe is more coser to the reference target speech sampe. The preference scores with 95% confidence interva are presented in Fig. 6. We can ceary observe that the proposed NMF and NMD methods are both abe to generate speech sampes which are more simiar to the target speaer than the Preference score (%) NMF NMD Fig. 6. Simiarity resuts of the preference score with 95% confidence interva 205

6 Z. Wu, T. Virtanen, T. Kinnunen, E.S. Chng, H. Li 5. CONCLUSIONS In this paper, we proposed an exempar-based voice conversion method utiizing the matrix/spectrogram factorization techniques. Two impementations, non-negative spectrogram factorization and non-negative spectrogram deconvoution, are proposed to use origina target spectrogram directy without any dimension reduction to synthesize the converted speech. The experiment resuts show the proposed method outperforms the conventiona joint density Gaussian mixture mode consideraby. 6. ACKNOWLEDGEMENT The wor of Tuomas Virtanen (projects no ) and Tomi Kinnunen was supported by Academy of Finand (projects no ). The authors woud ie to than a the isteners who tae part in the subjective evauation test. 7. REFERENCES [1] T. Toda, A.W. Bac, and K. Touda, Voice conversion based on maximum-ieihood estimation of spectra parameter trajectory, IEEE Transactions on Audio, Speech, and Language Processing, vo. 15, no. 8, pp , [2] D. Saito, S. Watanabe, A. Naamura, and N. Minematsu, Statistica voice conversion based on noisy channe mode, IEEE Transactions on Audio, Speech, and Language Processing, vo. 20, no. 6, pp , [3] M. Narendranath, H.A. Murthy, S. Rajendran, and B. Yegnanarayana, Transformation of formants for voice conversion using artificia neura networs, Speech communication, vo. 16, no. 2, pp , [4] B. Giet and S. King, Transforming F0 contours, in Proceedings of Eurospeech, 2003, pp [5] Z.Z. Wu, T. Kinnunen, E.S. Chng, and H. Li, Textindependent F0 transformation with non-parae data for voice conversion, in Eeventh Annua Conference of the Internationa Speech Communication Association, [6] C. H. Wu, C. C. Hsia, T. H. Liu, and J. F. Wang, Voice conversion using duration-embedded bi-hmms for expressive speech synthesis, IEEE Transactions on Audio, Speech, and Language Processing, vo. 14, no. 4, pp , [7] M. Abe, S. Naamura, K. Shiano, and H. Kuwabara, Voice conversion through vector quantization, in ICASSP [8] A. Kain and M.W. Macon, Spectra voice conversion for textto-speech synthesis, in ICASSP [9] Y. Styianou, O. Cappé, and E. Mouines, Continuous probabiistic transform for voice conversion, IEEE Transactions on Speech and Audio Processing, vo. 6, no. 2, pp , [10] E. Heander, T. Virtanen, J. Nurminen, and M. Gabbouj, Voice conversion using partia east squares regression, IEEE Transactions on Audio, Speech, and Language Processing, vo. 18, no. 5, pp , [11] H. Zen, Y. Nanau, and K. Touda, Continuous stochastic feature mapping based on trajectory hmms, IEEE Transactions on Audio, Speech, and Language Processing, vo. 19, no. 2, pp , [12] Z. Wu, T. Kinnunen, E. Chng, and H. Li, Mixture of factor anayzers using priors from non-parae speech for voice conversion, Signa Processing Letters, IEEE, [13] V. Popa, H. Sien, J. Nurminen, and M. Gabbouj, Loca inear transformation for voice conversion, in ICASSP [14] S. Desai, E.V. Raghavendra,.B Yegnanarayana, A.W. Bac, and K. Prahaad, Voice conversion using artificia neura networs, in ICASSP [15] P. Song, Y.Q. Bao, L. Zhao, and C.R. Zou, Voice conversion using support vector regression, Eectronics etters, vo. 47, no. 18, pp , [16] E. Heander, H. Sién, T. Virtanen, and M. Gabbouj, Voice conversion using dynamic erne partia east squares regression, IEEE Transactions on Audio, Speech, and Language Processing, vo. 20, no. 3, pp , [17] Z. Wu, E.S. Chng, and H. Li, Conditiona restricted botzmann machine for voice conversion, in the first IEEE China Summit and Internationa Conference on Signa and Information Processing (ChinaSIP), [18] J.F. Gemmee, T. Virtanen, and A. Hurmaainen, Exemparbased sparse representations for noise robust automatic speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, vo. 19, no. 7, pp , [19] A. Hurmaainen, J. Gemmee, and T. Virtanen, Non-negative matrix deconvoution in noise robust speech recognition, in ICASSP [20] T.N. Sainath, B. Ramabhadran, D. Nahamoo, D. Kanevsy, D. Van Compernoe, K. Demuync, J.F. Gemmee, J.R. Beegarda, and S. Sundaram, Exempar-based processing for speech recognition: An overview, IEEE Signa Processing Magazine, vo. 29, no. 6, pp , nov [21] R. Taashima, T. Taiguchi, and Y. Arii, Exempar-based voice conversion in noisy environment, in Spoen Language Technoogy Worshop (SLT), 2012 IEEE. IEEE, 2012, pp [22] Y. Chen, M. Chu, E. Chang, J. Liu, and R. Liu, Voice conversion with smoothed GMM and MAP adaptation, in Eurospeech [23] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, Restructuring speech representations using a pitch-adaptive time frequency smoothing and an instantaneous-frequency-based f0 extraction: Possibe roe of a repetitive structure in sounds, Speech communication, vo. 27, no. 3, pp , [24] T. Fuada, K. Touda, T. Kobayashi, and S. Imai, An adaptive agorithm for me-cepstra anaysis of speech, in ICASSP [25] A. KAIN, High resoution voice transformation, Ph. D. Thesis, OGI Schoo of Science and Engineering, Oregon Heath and Science University, [26] Y. Qian, Z. Wu, B. Gao, and F.K. Soong, Improved prosody generation by maximizing joint probabiity of state and onger units, IEEE Transactions on Audio, Speech, and Language Processing, vo. 19, no. 6, pp ,

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore