Modeling the creaky excitation for parametric speech synthesis.

Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity College Dublin, Ireland 1 / 27

Creaky voice - examples TTS corpora examples American Male Finnish female Finnish Male Conversational speech examples Japanese female American female American Male 2 / 27

Creaky voice in speech Phonetic contrast e.g., Jalapa Mazatec Phrase/sentence/turn boundaries Commonly in American English, Finnish etc. Interactive speech Turn-taking Hesitations Expression of affective states Stylistic device 3 / 27

Creaky voice - acoustic characteristics 4 / 27

Creaky voice - acoustic characteristics 5 / 27

Problem statement Unique acoustic characteristics of creak poorly modelled in standard vocoders Silen et al. (2009) - improved robustness of f 0 and voicing decision Our Aim: Provide a method for modelling the creaky excitation to improve the timbre of creak in parametric synthesis. 6 / 27

Speech data American male (BDL) and Finnish male (MV) 100 sentences containing creak 7 / 27

Manual annotation A rough quality with the sensation of repeating impulses - Ishi et al. (2008) 8 / 27

Glottal closure instants (GCIs) Newly developed SE-VQ algorithm - Kane & Gobl, In Press Amplitude Amplitude 0.6 0.4 0.2 0 0.2 0.4 0.5 0 0.5 Speech waveform SEDREAMS GCI 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Resonator output 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Amplitude 1 0.5 0 DEGG DEGG GCI 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Time (seconds) 9 / 27

The deterministic plus stochastic model (DSM) The Deterministic plus Stochastic Model of the Residual Signal and its Applications -Drugman & Dutoit (2012), IEEE TASLP 10 / 27

DSM - residual excitation Residual excitation 11 / 27

The Deterministic plus Stochastic Model DSM - Residual frames Speech Database GCI Estimation GCI positions PS Windowing MGC Analysis Inverse Filtering Residual signals Dataset of PS residual frames Université de Mons 7 12 / 27

DSM - Deterministic modelling The Deterministic Component Pitch Normalization Energy Normalization Dataset of PS residual frames F0* Dataset for the Deterministic Modeling Université de Mons 8 13 / 27

DSM - vocoder The DSM vocoder Deterministic component of the excitation Filter Stochastic component of the excitation 14 / 27

Extended DSM for creaky voice 15 / 27

DSM (creak) - Fundamental period/opening phase Opening period (samples) 400 350 300 250 200 150 100 50 0 100 150 200 250 300 350 400 Fundamental period (samples) Opening period (samples) 500 450 400 350 300 250 200 150 100 50 0 200 250 300 350 400 450 500 550 Fundamental period (samples) 16 / 27

DSM (creak) - Excitation modelling Separate residual datasets for opening phase (secondary peak => GCI) and closed phase (GCI => secondary peak) Principal component analysis of each dataset separately, excitation model combining first eigenvectors for deterministic component. Energy envelope also derived for the two datasets separately. 17 / 27

DSM (creak) - Data-driven excitation signal 0.25 0.4 0.3 0.2 Amplitude 0.2 0.1 Amplitude 0.15 0.1 0 0.05 0.1 0 100 200 300 400 Time (samples) 0 0 100 200 300 400 Time (samples) 18 / 27

DSM (creak) - Vocoder 19 / 27

Evaluation 20 / 27

Experimental setup Subjective evaluation with 22 participants. Copy-synthesis of short utterances by the American and Finnish speaker using the standard DSM vocoder and the proposed method. ABX test Original utterance (X) and the two copy synthesis versions (A & B). Select most like original Comparative Mean Opinion Score (CMOS) test Copy synthesis by both vocoders - signal preference on gradual 7 point CMOS scale. 21 / 27

Results - ABX 22 / 27

Results - Comparative Mean Opinion Score (CMOS) 23 / 27

Results - Samples American Male 1 Original standard HTS vocoder DSM vocoder DSM-creak 2 Original standard HTS vocoder DSM vocoder DSM-creak 3 Original standard HTS vocoder DSM vocoder DSM-creak Finnish Male 1 Original standard HTS vocoder DSM vocoder DSM-creak 2 Original standard HTS vocoder DSM vocoder DSM-creak 3 Original standard HTS vocoder DSM vocoder DSM-creak 24 / 27

Ongoing/future research directions Automate creak segmentation (see our poster at special session - glottal source processing!) Prediction of creaky regions from contextual features (e.g., phoneme, word stress, position in sentence, prosodic context etc.) Transformation of speakers voice characteristics. 25 / 27

Acknowledgements This work was supported by the Science Foundation Ireland, Grant 07 / CE / I 1142 (Centre for Next Generation Localisation, www.cngl.ie) and Grant 09 / IN.1 / I 2631 (FASTNET). 26 / 27

Thank you! 27 / 27