Modeling the creaky excitation for parametric speech synthesis.

Similar documents
Improved Method for Epoch Extraction in High Pass Filtered Speech

Vocoding approaches for statistical parametric speech synthesis

Chirp Decomposition of Speech Signals for Glottal Source Estimation

Glottal Source Estimation using an Automatic Chirp Decomposition

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

L8: Source estimation

Proc. of NCC 2010, Chennai, India

Timbral, Scale, Pitch modifications

L7: Linear prediction of speech

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Wavelet Transform in Speech Segmentation

ENTROPY RATE-BASED STATIONARY / NON-STATIONARY SEGMENTATION OF SPEECH

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

Robust Speaker Identification

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

Speech Signal Representations

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Allpass Modeling of LP Residual for Speaker Recognition

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING

Lab 9a. Linear Predictive Coding for Speech Processing

THE RELATION BETWEEN PERCEIVED HYPERNASALITY OF CLEFT PALATE SPEECH AND ITS HOARSENESS

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Causal anticausal decomposition of speech using complex cepstrum for glottal source estimation

SPEECH ANALYSIS AND SYNTHESIS

Front-End Factor Analysis For Speaker Verification

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Harmonic Structure Transform for Speaker Recognition

Automatic Speech Recognition (CS753)

5Nonlinear methods for speech analysis

Feature extraction 1

Stress detection through emotional speech analysis

WaveNet: A Generative Model for Raw Audio

Tone Analysis in Harmonic-Frequency Domain and Feature Reduction using KLT+LVQ for Thai Isolated Word Recognition

Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction

Time-varying quasi-closed-phase weighted linear prediction analysis of speech for accurate formant detection and tracking

Application of the Bispectrum to Glottal Pulse Analysis

The Equivalence of ADPCM and CELP Coding

A 600bps Vocoder Algorithm Based on MELP. Lan ZHU and Qiang LI*

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

CS578- Speech Signal Processing

Exploring Non-linear Transformations for an Entropybased Voice Activity Detector

Signal representations: Cepstrum

Short-Time ICA for Blind Separation of Noisy Speech

representation of speech

IDIAP. Martigny - Valais - Suisse ADJUSTMENT FOR THE COMPENSATION OF MODEL MISMATCH IN SPEAKER VERIFICATION. Frederic BIMBOT + Dominique GENOUD *

RAMCESS 2.X framework expressive voice analysis for realtime and accurate synthesis of singing

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Phoneme segmentation based on spectral metrics

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Feature extraction 2

Speaker Recognition Using Artificial Neural Networks: RBFNNs vs. EBFNNs

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Comparing linear and non-linear transformation of speech

AN ALTERNATIVE FEEDBACK STRUCTURE FOR THE ADAPTIVE ACTIVE CONTROL OF PERIODIC AND TIME-VARYING PERIODIC DISTURBANCES

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

On the relationship between intra-oral pressure and speech sonority

Sound Recognition in Mixtures

Low Bit-Rate Speech Codec Based on a Long-Term Harmonic Plus Noise Model

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Independent Component Analysis and Unsupervised Learning

Functional principal component analysis of vocal tract area functions

MULTISENSORY SPEECH ENHANCEMENT IN NOISY ENVIRONMENTS USING BONE-CONDUCTED AND AIR-CONDUCTED MICROPHONES. Mingzi Li,Israel Cohen and Saman Mousazadeh

Linear Prediction: The Problem, its Solution and Application to Speech

Digital Signal Processing

Monaural speech separation using source-adapted models

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Novelty detection. Juan Pablo Bello EL9173 Selected Topics in Signal Processing: Audio Content Analysis NYU Poly

Emotional speech synthesis for a Radio DJ: corpus design and expression modeling

Model-based unsupervised segmentation of birdcalls from field recordings

Feature Extraction for ASR: Pitch

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Just Noticeable Differences of Open Quotient and Asymmetry Coefficient in Singing Voice

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

ISCA Archive

Presented By: Omer Shmueli and Sivan Niv

2D Spectrogram Filter for Single Channel Speech Enhancement

Glottal Modeling and Closed-Phase Analysis for Speaker Recognition

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Speaker recognition by means of Deep Belief Networks

Joint Factor Analysis for Speaker Verification

Noisy Speech Recognition using Wavelet Transform and Weighting Coefficients for a Specific Level

Generalized Cyclic Transformations in Speaker-Independent Speech Recognition

Fast speaker diarization based on binary keys. Xavier Anguera and Jean François Bonastre

An RNN-based Quantized F0 Model with Multi-tier Feedback Links for Text-to-Speech Synthesis

Limited error based event localizing decomposition and its application to rate speech coding. Nguyen, Phu Chien; Akagi, Masato; Ng Author(s) Phu

Evaluation of the modified group delay feature for isolated word recognition

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Transcription:

Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity College Dublin, Ireland 1 / 27

Creaky voice - examples TTS corpora examples American Male Finnish female Finnish Male Conversational speech examples Japanese female American female American Male 2 / 27

Creaky voice in speech Phonetic contrast e.g., Jalapa Mazatec Phrase/sentence/turn boundaries Commonly in American English, Finnish etc. Interactive speech Turn-taking Hesitations Expression of affective states Stylistic device 3 / 27

Creaky voice - acoustic characteristics 4 / 27

Creaky voice - acoustic characteristics 5 / 27

Problem statement Unique acoustic characteristics of creak poorly modelled in standard vocoders Silen et al. (2009) - improved robustness of f 0 and voicing decision Our Aim: Provide a method for modelling the creaky excitation to improve the timbre of creak in parametric synthesis. 6 / 27

Speech data American male (BDL) and Finnish male (MV) 100 sentences containing creak 7 / 27

Manual annotation A rough quality with the sensation of repeating impulses - Ishi et al. (2008) 8 / 27

Glottal closure instants (GCIs) Newly developed SE-VQ algorithm - Kane & Gobl, In Press Amplitude Amplitude 0.6 0.4 0.2 0 0.2 0.4 0.5 0 0.5 Speech waveform SEDREAMS GCI 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Resonator output 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Amplitude 1 0.5 0 DEGG DEGG GCI 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.7 Time (seconds) 9 / 27

The deterministic plus stochastic model (DSM) The Deterministic plus Stochastic Model of the Residual Signal and its Applications -Drugman & Dutoit (2012), IEEE TASLP 10 / 27

DSM - residual excitation Residual excitation 11 / 27

The Deterministic plus Stochastic Model DSM - Residual frames Speech Database GCI Estimation GCI positions PS Windowing MGC Analysis Inverse Filtering Residual signals Dataset of PS residual frames Université de Mons 7 12 / 27

DSM - Deterministic modelling The Deterministic Component Pitch Normalization Energy Normalization Dataset of PS residual frames F0* Dataset for the Deterministic Modeling Université de Mons 8 13 / 27

DSM - vocoder The DSM vocoder Deterministic component of the excitation Filter Stochastic component of the excitation 14 / 27

Extended DSM for creaky voice 15 / 27

DSM (creak) - Fundamental period/opening phase Opening period (samples) 400 350 300 250 200 150 100 50 0 100 150 200 250 300 350 400 Fundamental period (samples) Opening period (samples) 500 450 400 350 300 250 200 150 100 50 0 200 250 300 350 400 450 500 550 Fundamental period (samples) 16 / 27

DSM (creak) - Excitation modelling Separate residual datasets for opening phase (secondary peak => GCI) and closed phase (GCI => secondary peak) Principal component analysis of each dataset separately, excitation model combining first eigenvectors for deterministic component. Energy envelope also derived for the two datasets separately. 17 / 27

DSM (creak) - Data-driven excitation signal 0.25 0.4 0.3 0.2 Amplitude 0.2 0.1 Amplitude 0.15 0.1 0 0.05 0.1 0 100 200 300 400 Time (samples) 0 0 100 200 300 400 Time (samples) 18 / 27

DSM (creak) - Vocoder 19 / 27

Evaluation 20 / 27

Experimental setup Subjective evaluation with 22 participants. Copy-synthesis of short utterances by the American and Finnish speaker using the standard DSM vocoder and the proposed method. ABX test Original utterance (X) and the two copy synthesis versions (A & B). Select most like original Comparative Mean Opinion Score (CMOS) test Copy synthesis by both vocoders - signal preference on gradual 7 point CMOS scale. 21 / 27

Results - ABX 22 / 27

Results - Comparative Mean Opinion Score (CMOS) 23 / 27

Results - Samples American Male 1 Original standard HTS vocoder DSM vocoder DSM-creak 2 Original standard HTS vocoder DSM vocoder DSM-creak 3 Original standard HTS vocoder DSM vocoder DSM-creak Finnish Male 1 Original standard HTS vocoder DSM vocoder DSM-creak 2 Original standard HTS vocoder DSM vocoder DSM-creak 3 Original standard HTS vocoder DSM vocoder DSM-creak 24 / 27

Ongoing/future research directions Automate creak segmentation (see our poster at special session - glottal source processing!) Prediction of creaky regions from contextual features (e.g., phoneme, word stress, position in sentence, prosodic context etc.) Transformation of speakers voice characteristics. 25 / 27

Acknowledgements This work was supported by the Science Foundation Ireland, Grant 07 / CE / I 1142 (Centre for Next Generation Localisation, www.cngl.ie) and Grant 09 / IN.1 / I 2631 (FASTNET). 26 / 27

Thank you! 27 / 27