Vocoding approaches for statistical parametric speech synthesis

Similar documents
An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

Modeling the creaky excitation for parametric speech synthesis.

Feature extraction 2

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Chirp Decomposition of Speech Signals for Glottal Source Estimation

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Glottal Source Estimation using an Automatic Chirp Decomposition

Timbral, Scale, Pitch modifications

Voiced Speech. Unvoiced Speech

Improved Method for Epoch Extraction in High Pass Filtered Speech

L7: Linear prediction of speech

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

Speech Signal Representations

L8: Source estimation

Signal representations: Cepstrum

Linear Prediction 1 / 41

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

SPEECH ANALYSIS AND SYNTHESIS

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Lab 9a. Linear Predictive Coding for Speech Processing

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Automatic Speech Recognition (CS753)

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Proc. of NCC 2010, Chennai, India

representation of speech

Feature extraction 1

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

ELEN E4810: Digital Signal Processing Topic 11: Continuous Signals. 1. Sampling and Reconstruction 2. Quantization

Application of the Bispectrum to Glottal Pulse Analysis

Frequency Domain Speech Analysis

CS578- Speech Signal Processing

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Robust Speaker Identification

Topic 6. Timbre Representations

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

Allpass Modeling of LP Residual for Speaker Recognition

COMP Signals and Systems. Dr Chris Bleakley. UCD School of Computer Science and Informatics.

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Optimal Design of Real and Complex Minimum Phase Digital FIR Filters

Fusion of Spectral Feature Sets for Accurate Speaker Identification

Design of a CELP coder and analysis of various quantization techniques

Lab 4: Quantization, Oversampling, and Noise Shaping

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING

GMM-Based Speech Transformation Systems under Data Reduction

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction

Review of Fundamentals of Digital Signal Processing

The Equivalence of ADPCM and CELP Coding

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

E : Lecture 1 Introduction

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

L6: Short-time Fourier analysis and synthesis

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

Introduction to Biomedical Engineering

Digital Signal Processing. Midterm 1 Solution

EE538 Final Exam Fall :20 pm -5:20 pm PHYS 223 Dec. 17, Cover Sheet

Model-based unsupervised segmentation of birdcalls from field recordings

CHAPTER 2 RANDOM PROCESSES IN DISCRETE TIME

Estimation of Cepstral Coefficients for Robust Speech Recognition

Source modeling (block processing)

Thursday, October 29, LPC Analysis

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion

MITIGATING UNCORRELATED PERIODIC DISTURBANCE IN NARROWBAND ACTIVE NOISE CONTROL SYSTEMS

arxiv: v1 [cs.sd] 25 Oct 2014

Causal anticausal decomposition of speech using complex cepstrum for glottal source estimation

Voice Activity Detection Using Pitch Feature

Lecture 5: GMM Acoustic Modeling and Feature Extraction

System Identification and Adaptive Filtering in the Short-Time Fourier Transform Domain

Filter structures ELEC-E5410

Various signal sampling and reconstruction methods

Speech Synthesis as A Statistical. Keiichi Tokuda Nagoya Institute of Technology

A 600bps Vocoder Algorithm Based on MELP. Lan ZHU and Qiang LI*

Time-varying quasi-closed-phase weighted linear prediction analysis of speech for accurate formant detection and tracking

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer.

New Statistical Model for the Enhancement of Noisy Speech

Chapter 5 Frequency Domain Analysis of Systems

Review of Fundamentals of Digital Signal Processing

On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR

Keywords: Vocal Tract; Lattice model; Reflection coefficients; Linear Prediction; Levinson algorithm.

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Evaluation of the modified group delay feature for isolated word recognition

Stress detection through emotional speech analysis

Gaussian Processes for Audio Feature Extraction

Discrete-Time Signals and Systems. Frequency Domain Analysis of LTI Systems. The Frequency Response Function. The Frequency Response Function

A FAST AND ACCURATE ADAPTIVE NOTCH FILTER USING A MONOTONICALLY INCREASING GRADIENT. Yosuke SUGIURA

Glottal Modeling and Closed-Phase Analysis for Speaker Recognition

MANY digital speech communication applications, e.g.,

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters

Transcription:

Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge, UK March 2nd, 2011

2 Topics of this presentation 1. Existing methods to generate the speech waveform in statistical parametric speech synthesis 2. An idea for closing the gap between acoustic modeling and waveform generation

3 Notation and acronyms in this presentation Notation x(n) X(z) X `ejω X `ejω magnitude X `ejω X `ejω 2 x X Acronyms OLA MELP STRAIGHT FFT IFFT LF LP PCA LSP a discrete-time signal x(n) in the z-transform domain Discrete-Time Fourier Transform of x(n) (frequency domain representation of x(n)) response of x(n) phase response of x(n) power spectrum of x(n) a vector a matrix OverLap and Add Mixed Excitation Linear Prediction Speech Transformation and Representation using Adaptive Interpolation of weighted spectrum Fast Fourier Transform Inverse Fast Fourier Transform Liljencrants-Fant model Linear Prediction Principal Component Analysis Line Spectral Pairs

4 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

5 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

6 Statistical parametric speech synthesis Speech synthesis methods 1. Rule-based 1.1 Parametric 1.2 Unit concatenation 2. Corpus-based 2.1 Unit selection and concatenation 2.2 Statistical parametric 2.3 Hybrid

7 Statistical parametric speech synthesis 1. Advantages several voices, small data, small footprint, language portability, etc 2. Unnatural synthesized speech 2.1 Parametric model of speech production 2.2 Parameters of the model are averaged How to alleviate this unnaturalness? 1. Statistical modeling 2. Choice of the speech production model 3. Choice of the parameters to represent such model 4. Way of synthesizing speech with these parameters

8 Statistical parametric speech synthesis 1. Advantages several voices, small data, small footprint, language portability, etc 2. Unnatural synthesized speech 2.1 Parametric model of speech production 2.2 Parameters of the model are averaged How to alleviate this unnaturalness? 1. Statistical modeling 2. Choice of the speech production model 3. Choice of the parameters to represent such model 4. Way of synthesizing speech with these parameters

9 Statistical parametric speech synthesis Training time Speech waveform s = [ s(0) s(n 1) ] Parameter extraction Synthesis time Speech Parameters c = [ c 0 c T 1 ] Labels Acoustic model training ˆλ c = arg max λ c p (c l, λ c ) Acoustic model parameters ˆλ c Labels Speech parameters Acoustic model parameters Parameter ĉ = [ ] ĉ 0 ĉ T 1 ˆλ generation c ĉ = arg max p (c l, ˆλ ) c λ c Waveform generation Speech waveform s = [ s(0) s(n 1) ]

10 Waveform generation part 1. Choice of the speech production mechanism Simple Speech synthesis filter Excitation Complete Vocal tract, glottal and lip radiation filters Excitation 2. Appropriate parameters for the chosen speech mechanism Good quantization/compression properties 3. Given the speech model and corresponding parameters, design the best way to synthesize the speech signal according to some criteria

11 Waveform generation part 1. Choice of the speech production mechanism Simple Speech synthesis filter Excitation Complete Vocal tract, glottal and lip radiation filters Excitation 2. Appropriate parameters for the chosen speech mechanism Good quantization/compression properties 3. Given the speech model and corresponding parameters, design the best way to synthesize the speech signal according to some criteria

12 Digital speech models The complete model [Deller, Jr. et al., 2000] Pitch period Gain Pulse train generator Uncorrelated noise generator Glottal filter G(z) Voice volume velocity V U Vocal tract filter H(z) Lip radiation filter L(z) Speech signal Gain The simplified model, assumed for LP analysis Pitch period Pulse train generator White noise generator V U Gain All pole filter Speech signal

13 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

14 Standard vocoder for statistical parametric synthesis Labels representing the text to be synthesized Trained acoustic models F 0 Pulse train P 0... White noise t(n) w(n) F 0 Spectral parameters V e(n) Synthesis U Excitation filter Speech Very simple Analysis: F 0 extraction Synthesis: pulse/white noise switch Poor speech quality!

15 Improved vocoding methods for statistical parametric synthesis 1. Methods that focus solely on the excitation signal 1.1 Fully parametric excitation models 1.2 Methods that attempt to mimic the LP residual 2. Methods that focus on source and vocal tract modeling

16 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

17 MELP mixed excitation [Yoshimura et al., 2001] MELP excitation building part Jitter Fourier magnitudes Bandpass voicing strengths Pulse train t(n) Position Jittering H v (z) H p (z) v(n) Voiced Excitation e(n) White noise w(n) H n (z) u(n) Unvoiced Excitation Mixed Excitation Period jitter derived from voicing strengths for aperiodic frames Fourier magnitudes simulates the glottal filter Filters H p (z) and H n (z) control the amount of pulse and noise in the final excitation e(n)

18 Pulse and noise shaping filters Filters H p (z) and H n (z) switch between noise and pulse excitation according to each band H p (z) = J 1 j=0 m=0 M β j h j (m)z m, H n (z) = β j = J 1 { 1 if β j 0.5 0 if β j < 0.5 M j=0 m=0 h j (m): bandpass filter coefficients for the j band Bandpass voicing strength for the j band obtained according to a normalized correlation coefficient β j = f(r t ) ; r t = N 1 n=0 [ N 1 n=0 s2 (n) (1 β j ) h j (m)z m s(n)s(n + t) ] [ N 1 ] n=0 s2 (n + t)

19 Application to statistical parametric synthesis Additional parameters for acoustic modeling 1. Bandpass voicing strengths: 5 2. Fourier magnitudes: 10

20 STRAIGHT excitation [Zen et al., 2007a] STRAIGHT vocoder: excitation construction no phase manipulation case Pulse T ( e jω) T(e jω ) Zero phase Noise ( ) W e jω Random phase 1 W ( e jω) 1 ω ω ω ω H p ( e jω ) Hn ( e jω ) Aperiodicity parameters ω ω V ( e jω) Voiced Excitation U ( e jω) Unvoiced Excitation E ( e jω) e(n) Á Ì Mixed Excitation

21 STRAIGHT vocoder for statistical parametric synthesis Aperiodicity parameters extracted and averaged over specified frequency sub-bands Band-aperiodicity parameters (BAP) At synthesis time the generated BAP are converted in aperiodicity Speech is synthesized in the frequency domain Achieves very good quality Additional parameters for acoustic modeling BAP: usually 5 coefficients

22 Pulse and noise weighting filters Filters H p ( e jω ) and H n ( e jω ) shape the pulse and noise inputs, just like in MELP Frequency responses are obtained from the aperiodicity parameters a(w) ( Hp e jω ) = 1 a(w) 0 ω π ( H p e jω ) =0 ( Hn e jω ) = a(w) H n ( e jω ) =0 0 ω π 0 ω π 0 ω π

23 Band aperiodicity parameters Aperiodicity at frequency ω a(ω) = werb (λ; ω) S ( e jλ) 2 Υ ( ) S L(e jλ ) 2 S U(e jλ ) 2 werb (λ; ω) S (e jλ ) 2 dλ dλ S `e jω : speech spectral envelope S U `ejω : envelope constructed by connecting the peaks of S `ejω S L `ejω : envelope constructed by connecting the valleys of S `ejω w ERB (λ; ω): auditory filter to smooth S `ejω Υ( ): look-up table operation Band-aperiodicity b j = 1 a(ω)dω Ω j Ω j Ω j : j-th frequency band

24 Aperiodicity and band aperiodicity: examples 5 bands: 0-1kHz, 1-2kHz, 2-4kHz, 4-6kHz, 6-8kHz Log Magnitude Aperiodicity Band aperiodicity 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) 24 Bark critical bands Log Magnitude Aperiodicity Band aperiodicity 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz)

25 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

26 State-dependent mixed excitation [Maia et al., 2007] Labels Trained HMMs HMM state sequence State s 0 State s 1 State s Q 1 State durations c (0) 0 c (0) c (1) T 1 0 c (1) T 1 c (Q 1) 0 c (Q 1) T 1 Spectral parameters (generated) F0 (0) 0 F0 (0) F0 (1) T 1 0 F0 (1) T 1 F0 (Q 1) 0 F0 (Q 1) T 1 F0 (generated) H (0) v (z), H (0) u (z) H (1) v (z), H (1) u (z) H v (Q 1) (z), H u (Q 1) (z) Filters Pulse train Generator Noise w(n) t(n) H v (z) H u(z) v(n) Voiced Excitation ũ(n) Unvoiced Excitation ẽ(n) Mixed Excitation H(z) Synthesized Speech Filters: H v(z) = M/2 X m= M/2 h(m)z m, H u(z) = 1 P L l=0 g(l)z l

27 State-dependent mixed excitation: training Residual (target signal) e(n) a 0a1 a 2 a Z 1 v(n) H v (z)... Voiced Excitation p 0 p 1 p 2 p Z 1 u(n) ÈÙÐ ØÖ Ò t(n) White noise w(n) (error signal) Filter coefficients h = [ h ( ) M 2 G(z) = 1 H u (z) And pulse positions and amplitudes Are optimized in a way that Unvoiced Excitation h ( )] M [ ] 2, g = g(0) g(l) {p 0,..., p J 1 }, {a 0,..., a J 1 } {h, g, t(n)} = arg max g,h,t(n) P (e(n) g, h, t(n))

28 State-dependent mixed excitation: synthesis F 0 State sequence s q = {s 0,...,s Q 1 } Pulse train generator t(n) H (q) v (z) ṽ(n) αṽ(n) α Excitation ẽ(n) White noise w(n) H (q) u (z) u(n) Time modulation H HP (z) ũ(n) ρ(n) Noise component is colored through 1. High-pass filtering (F c = 2kHz) 2. Time modulation with a pitch-synchronous triangular window: ρ(n) H v(z) is normalized in energy Gain α adjusts the energy of the voiced component so that the power of the excitation signal ẽ(n) becomes one

29 Voiced filter effect Pulse train Voiced filter Voiced excitation 10 0.4 4 Amplitude 8 6 4 Amplitude 0.2 0 0.2 0.4 Amplitude 2 0 2 4 2 0.6 6 0 0 10 20 30 Time (ms) 0.8 5 0 5 Time (ms) 8 0 10 20 30 Time (ms) 30 Pulse train 10 Voiced filter 40 Voiced excitation Magnitude (db) 20 10 0 10 Magnitude (db) 0 10 20 30 Magnitude (db) 20 0 20 20 0 2000 4000 6000 8000 Frequency (Hz) 40 0 2000 4000 6000 8000 Frequency (Hz) 40 0 2000 4000 6000 8000 Frequency (Hz)

30 Unvoiced filter effect White noise Unvoiced filter Unvoiced excitation 2 1 2 Amplitude 1 0 1 2 3 0 10 20 30 Time (ms) Amplitude 0.8 0.6 0.4 0.2 0 0.2 0 100 200 Time (ms) Amplitude 1 0 1 2 3 0 10 20 30 Time (ms) Magnitude (db) 40 30 20 10 0 White noise 0 2000 4000 6000 8000 Frequency (Hz) Magnitude (db) 6 4 2 0 2 Unvoiced filter 4 0 2000 4000 6000 8000 Frequency (Hz) Magnitude (db) 40 30 20 10 0 Unvoiced excitation 0 2000 4000 6000 8000 Frequency (Hz)

31 Deterministic plus stochastic residual modeling [Drugman et al., 2009] Assumed model of the LP residual e(n) e(n) = e d (n) + e s (n) e d (n): deterministic part e s (n): stochastic part Maximum voiced frequency F m Boundary between deterministic and stochastic components Set to 4 khz

32 Deterministic modeling: eigenresidual calculation Spectral parameters Speech data Inverse filtering Residual GCI detection F 0 extraction Pitch periods Windowing Residual segments Resampling and energy normalization PCA Eigenresiduals Normalized frequency F0 for resampling the residual segments F 0 F Nyquist F m F 0,min PCA: eigenresiduals explain about 80% of the total dispersion

33 Stochastic modeling Stochastic component model e s (n) = ρ(n) [h u (n) w(n)] ρ(n): pitch synchronous modulation window h u (n): AR filter impulse response w(n): white noise Unvoiced filter h u (n) Fixed Auto-regressive (all-pole) Coefficients obtained through LP analysis

34 Application to statistical parametric synthesis Additional parameters for acoustic modeling PCA weights: 15 Use of eigenresidual of superior ranks makes no difference = Optionally, the stream of PCA weights can be removed Synthesis part Eigenresiduals F 0 PCA weights White noise w(n) Linear combination H u (z) Resampling Time modulation r d (n) r s (n) OLA e(n) ρ(n)

35 Waveform interpolation [Sung et al., 2010] Waveform interpolation (WI) Each cycle of the excitation signal represented by a characteristic waveform (CW) P/2 [ ( ) ( )] 2πkn 2πkn e(n) = A k cos + B k sin P P k=0 {A k, B k }: discrete-time Fourier series coefficients P : pitch period CW extracted from the LP residual at a fixed rate Information to reconstruct the excitation signal Pitch period: P CW coefficients: {A k, B k }

36 Application to statistical parametric synthesis Analysis is similar to eigenresidual calculation [Drugman et al., 2009] Speech data WI analysis CW coefficients PCA Basis vector selection Basis vector coefficient computation Additional parameters for acoustic modeling Coefficients of the basis vectors: 8 Synthesis Basis vector coefficients Pitch Basis vector coefficients CW reconstruction WI synthesis Speech

37 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

38 Glottal inverse filtering [Raitio et al., 2008] Uses Iterative Adaptive Inverse Filtering [Alku, 1992] Speech signal Harmonic to noise ratio (HNR) extraction Energy extraction Vocal tract spectrum V (z) Glottal inverse filtering (IAIF) Estimated voice source signal g(n) Features LP analysis Voice source spectrum G(z) Features for acoustic modeling 1. F 0 2. Energy 3. HNR in 4 bands: 0-2kHz, 2-4kHz, 4-6kHz,6-8kHz 4. Voice source spectrum = glottal flow = 10 LSPs 5. Vocal tract spectrum: 30 LSPs F 0 extraction

39 At synthesis time F 0 Energy HNR Glottal source spectrum G(z) Library pulse Interpolation Gain Noise addition Spectral macthing L(z) V White noise Gain U Excitation Library pulse extracted from the speech data through glottal inverse filtering Noise is separately added to each band of the voiced excitation according to HNR G(z) implements the spectral shape of the glottal pulse L(z) is a fixed lip radiation filter

40 Glottal spectrum separation (GSS) [Cabral et al., 2008] Speech production model S ( e jω) = D ( e jω) G ( e jω) V ( e jω) R ( e jω) D `e jω : pulse train G `e jω : glottal pulse V `e jω : vocal tract R `e jω : lip radiation Simplified speech production model S ( e jω) = D ( e jω) V ( e jω) What GSS does 1. Estimate a model of the glottal flow derivative = E `e jω 2. Remove its effect from the speech spectral envelope = Ĥ V e jω Ĥ `e jω = E (e jω ) 3. Re-synthesize speech S e jω = D e jω E e jω Ĥ `e jω E (e jω ) = S e jω = D e jω V `e jω e jω

41 Glottal flow model utilized in GSS LF model is used to represent the glottal pulse T 0 T a 0 t 0 t p t e t a t c E e Parameters t c: instant of complete closure t e: instant of maximum excitation T 0: fundamental period Time t p: instant of maximum flow T a = t a t e E e: amplitude of maximum excitation

42 Application to statistical parametric synthesis Speech Inverse filtering STRAIGHT Glottal flow derivative Ĥ ( e jω) Speech spectral envelope LF model parameterization LF model parameters e LF (n) FFT Features E LF ( e jω ) Spectral separation Spectral parameters V ( e jω) = Ĥ(ejω ) E LF(e jω ) Acoustic modeling 1. Spectral parameters: mel-cepstral coefficients 2. Band aperiodicity parameters 3. 5 LF-model parameters: t e, t p, T a, E e, T 0

43 Synthesis part LF model parameters Aperiodicity parameters LF model generator e LF (n) FFT E LF ( e jω ) H p ( e jω ) ω White noise w(n) FFT W ( e jω) H n ( e jω ) ω IFFT Mixed Excitation STRAIGHT vocoder is utilized Original delta pulse is replaced by the pulse created by the generated LF-model parameters

44 Vocal tract and LF-model plus noise separation [Lanchantin et al., 2010] Assumed speech model S ( e jω) = [ D ( e jω) G ( e jω) + W ( e jω)] V ( e jω) R ( e jω) D ( e jω) : pulse train W ( e jω) : white noise R ( e jω) : lip radiation Parameterization 1. Fundamental frequency D ( e jω) 2. LF model parameter G ( e jω) 3. Noise power W ( e jω) 4. Mel-cepstral coefficients V ( e jω) G ( e jω) : glottal pulse V ( e jω) : vocal tract

45 Application to statistical parametric synthesis 1. Estimate a single parameter for a simplified LF model: R d 2. Determine a maximum voiced frequency ω V U 3. Estimate power of the noise W ( e jω) : σg 2 4. Estimate vocal tract parameters V e jω = 8 «>< τ o S(e jω ) γ R(e jω )G(e jω ) 1, w < ω V U «>: C o S(e jω ) Γ 1, w ω R(e jω )G(e jω V U ) V U τ o : cepstral analysis by fitting the harmonic peaks C o : power cepstrum Γ, γ: normalization terms Acoustic modeling 1. Simplified LF-model parameter: R d 2. Standard deviation of the noise component: σ g 3. Cepstral coefficients that represent V ( e jω) 4. F 0

46 Synthesis time Excitation construction for voiced segments R d Glottal pulse generator White noise w(n) g(n) Windowing and time modulation FFT FFT G ( e jω) W ( e jω) H HP ( e jω ) ω V U E ( e jω) Excitation Excitation construction for unvoiced segments White noise w(n) Windowing FFT Synthesized speech signal E ( e jω) Excitation S ( e jω) = E ( e jω) V ( e jω) R ( e jω)

Vocoding methods: summary and examples Method Simple MELP STRAIGHT SDF DSM WI GlottiHMM GSS SVLN Description Pulse train/white noise simple switch MELP mixed excitation STRAIGHT mixed excitation State-dependent filtering mixed excitation Deterministic plus stochastic of the residual Waveform interpolation to statistical parametric synthesis Glottal inverse filtering Glottal source separation Separation of vocal tract and LF-model plus noise Sample 47

48 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

49 Joint vocoding-acoustic modeling The goal of any speech synthesizer is to reproduce the speech waveform Parameters of a joint acoustic-excitation model are estimated by maximizing the probability of the speech waveform l Speech waveform s = [ s(0) s(n 1) ] Joint acoustic excitation model training ˆλ = arg max λ = arg max λ p (s l, λ) p (s c, λ) p (c l, λ) dc c = [ ] c 0 c T 1 Spectral parameter as hidden variable ˆλ λ = {λ c, λ e}: acoustic-excitation model parameters λ c: acoustic model part λ e: excitation model part

50 Another viewpoint: waveform-level modeling Comparison with typical modeling for parametric synthesis Typical: state sequence is hidden variable, spectrum is the observation New model: state sequence and spectrum are hidden variables, speech is the observation q q c c Similar concepts [Toda and Tokuda, 2008]: factor analyzed trajectory HMM for spectral estimation [Wu and Tokuda, 2009]: closed-loop training for HMM-based synthesis s

51 A close look into the probabilities involved Typical statistical modeling for parametric synthesis X ˆλ c = arg max p (c q, λ c) p (q l, λ c) λ c Augmented statistical modeling X Z ˆλ = arg max p (s c, q, λ) λ {z } q w q Speech generative model (speech production from spectrum) p (c q, λ) p (q l, λ) {z } w Can be modeled by existing machines, e.g. HMM, HSMM, trajectory HMM dc

52 One possible speech generative model Pulse train t(n) a 0 a 2 a1 a Z 1 H v (z)... p 0 p 1 p 2 p Z 1 Gaussian white noise w(n) (zero mean, variance one) H u (z) v(n) Voiced Excitation u(n) Unvoiced Excitation e(n) Excitation H c (z) Speech s(n) Probability of the speech signal p (s H c, q, λ e ) = H c 1 N ( Hc 1 ) s; H v,q t, Φ q ) 1 Φ q = (G q G q λ e = {H v, G, t} : excitation model parameters

53 Vocal tract filter impulse response and spectral parameters: relationship We need p (s c, q, λ), not p (s H c, q, λ) Mapping between H c and c is necessary! Two possibilities 1. Relationship between H c and c represented as a Gaussian process 2. Relationship between H c and c is deterministic Cepstral coefficients!

54 Acoustic modeling Trajectory HMM [Zen et al., 2007b] for acoustic modeling p (c l, λ c ) = q p (c q, λ c ) }{{} N (c ; c q, P q ) p (q l, λ c ) }{{} π T 1 q0 t=0 α q tq t+1 R q = P 1 q c q = P qr q = W Σ 1 q W r q = W Σ 1 q µ q W : append dynamic features to c Why: modeling of p (c l, λ c ) instead of p (W c l, λ c ) (conventional HMM)

55 Joint acoustic-excitation model Final joint model p (s l, λ) = p (s c, q, λ e ) }{{} q Excitation model H c 1 N ( ) Hc 1 s; H v,q t, Φ q p (c q, λ c ) p (q l, λ c ) }{{} Acoustic model N (c; c q, P q ) π q0 T 1 t=0 α q tq t+1 dc

56 Training procedure Speech Initial cepstral analysis Initialization ÁÒ Ø Ð Ô ØÖÙÑ Ú ØÓÖ c Trajectory HMM training State sequence ˆq ÁÒ Ø Ð Ô ØÖÙÑ Ú ØÓÖ c Excitation model training Rˆq,rˆq H v (z), G(z), t(n) Estimation of the best cepstral coefficients Recursion Ô ØÖÙÑ Ú ØÓÖ ĉ Ô ØÖÙÑ Ú ØÓÖ ĉ Trajectory HMM training State sequence ˆq Excitation model training

57 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

58 Conclusions Quality improvement of statistical parametric synthesizers through better waveform generation methods Existing approaches that use source-filter modeling can be classified into Methods that attempt to improve the excitation signal solely Methods that focus on the speech production model as a whole Naturalness degradation of statistical parametric synthesizers has basically two causes Acoustic modeling that produces averaged parameter trajectories Use of parametric speech production models Methods which can integrate both acoustic modeling and speech production

59 Acknowledgments Many thanks to João Cabral, from University College Dublin, Ireland Tuomo Raitio, from Helsinki University of Technology, Finland Thomas Drugman, from Faculté Polytechnique de Mons, Belgium Pierre Lanchantin, from IRCAM, France June Sig Sung, from Seoul National University, South Korea for proving samples of their systems.

References I Alku, P. (1992). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2 3):109 118. Cabral, J., Renals, S., Richmond, K., and Yamagishi, J. (2008). Glottal spectral separation for parametric speech synthesis. In Proc. of Interspeech, pages 1829 1832. Deller, Jr., J. R., Hansen, J. H. L., and Proaks, J. G. (2000). Discrete-Time Processing of Speech Signals. IEEE Press Classic Reissue, New York. Drugman, T., Wilfart, G., and Dutoit, T. (2009). A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In Proc. of Interspeech, pages 1779 1782. Lanchantin, P., Degottex, G., and Rodet, X. (2010). An HMM-based speech synthesis system using a new glottal source and vocal-tract separation method. In Proc. of ICASSP, pages 4630 4633. Maia, R., Toda, T., Zen, H., Nankaku, Y., and Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. of the 6th ISCA Workshop on Speesh Synthesis, pages 131 136. Raitio, T., Suni, A., Pulakka, H., Vainio, M., and Alku, P. (2008). HMM-based Finnish text-to-speech system using glottal inverse filtering. In Interspeech, pages 1881 1884. 60

References II Sung, J., Kyung, D., Oh, H., and Kim, N. (2010). Excitation modeling based on waveform interpolation for HMM-based speech synthesis. In Proc. of Interspeech, pages 813 816. Toda, T. and Tokuda, K. (2008). Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM. In Proc. of ICASSP, pages 3925 3928. Wu, Y. J. and Tokuda, K. (2009). Minimum generation error training by using original spectrum as reference for log spectral distortion measure. In Proc. of ICASSP, pages 4013 4016. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (2001). Mixed-excitation for HMM-based speech synthesis. In Proc. of EUROSPEECH. Zen, H., Toda, T., Nakamura, M., and Tokuda, K. (2007a). Details of the Nitech HMM-based speech synthesis for Blizzard Challenge 2005. IEICE Trans. on Inf. and Systems, E90-D(1):325 333. Zen, H., Tokuda, K., and Kitamura, T. (2007b). Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequence. Computer Speech and Language, 21(1):153 173. 61