Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Similar documents
Chirp Decomposition of Speech Signals for Glottal Source Estimation

Glottal Source Estimation using an Automatic Chirp Decomposition

Evaluation of the modified group delay feature for isolated word recognition

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Feature extraction 1

Automatic Speech Recognition (CS753)

L8: Source estimation

L7: Linear prediction of speech

Signal representations: Cepstrum

Speech Signal Representations

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Improved Method for Epoch Extraction in High Pass Filtered Speech

SPEECH ANALYSIS AND SYNTHESIS

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Causal anticausal decomposition of speech using complex cepstrum for glottal source estimation

Detection-Based Speech Recognition with Sparse Point Process Models

Lab 9a. Linear Predictive Coding for Speech Processing

Frequency Domain Speech Analysis

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Voiced Speech. Unvoiced Speech

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Linear Prediction 1 / 41

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information

representation of speech

Feature extraction 2

A Spectral-Flatness Measure for Studying the Autocorrelation Method of Linear Prediction of Speech Analysis

Estimation of Cepstral Coefficients for Robust Speech Recognition

Allpass Modeling of LP Residual for Speaker Recognition

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

ON THE USE OF PHASE INFORMATION FOR SPEECH RECOGNITION. Baris Bozkurt and Laurent Couvreur

Modeling the creaky excitation for parametric speech synthesis.

Robust Speaker Identification

Applications of Linear Prediction

An Evolutionary Programming Based Algorithm for HMM training

Proc. of NCC 2010, Chennai, India

Chirp Transform for FFT

Improved Speech Presence Probabilities Using HMM-Based Inference, with Applications to Speech Enhancement and ASR

Voice Activity Detection Using Pitch Feature

Vocoding approaches for statistical parametric speech synthesis

Just Noticeable Differences of Open Quotient and Asymmetry Coefficient in Singing Voice

Sound 2: frequency analysis

A Low-Cost Robust Front-end for Embedded ASR System

ON THE USE OF MLP-DISTANCE TO ESTIMATE POSTERIOR PROBABILITIES BY KNN FOR SPEECH RECOGNITION

LINEAR-PHASE FIR FILTERS DESIGN

Improved system blind identification based on second-order cyclostationary statistics: A group delay approach

Time-varying quasi-closed-phase weighted linear prediction analysis of speech for accurate formant detection and tracking

where =0,, 1, () is the sample at time index and is the imaginary number 1. Then, () is a vector of values at frequency index corresponding to the mag

HARMONIC WAVELET TRANSFORM SIGNAL DECOMPOSITION AND MODIFIED GROUP DELAY FOR IMPROVED WIGNER- VILLE DISTRIBUTION

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

The Noisy Channel Model. CS 294-5: Statistical Natural Language Processing. Speech Recognition Architecture. Digitizing Speech

The Z-Transform. For a phasor: X(k) = e jωk. We have previously derived: Y = H(z)X

Speech Enhancement with Applications in Speech Recognition

Application of the Bispectrum to Glottal Pulse Analysis

Resonances and mode shapes of the human vocal tract during vowel production

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning

Fourier Analysis of Signals Using the DFT

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

SPEECH RECOGNITION USING TIME DOMAIN FEATURES FROM PHASE SPACE RECONSTRUCTIONS

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

MVA Processing of Speech Features. Chia-Ping Chen, Jeff Bilmes

OSE801 Engineering System Identification. Lecture 09: Computing Impulse and Frequency Response Functions

The effect of speaking rate and vowel context on the perception of consonants. in babble noise

RAMCESS 2.X framework expressive voice analysis for realtime and accurate synthesis of singing

Gaussian Processes for Audio Feature Extraction

QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING

Deep Learning for Speech Recognition. Hung-yi Lee

Timbral, Scale, Pitch modifications

Probabilistic Modeling of Speech and Language

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions

Presented By: Omer Shmueli and Sivan Niv

A latent variable modelling approach to the acoustic-to-articulatory mapping problem

LAB 6: FIR Filter Design Summer 2011

CS 188: Artificial Intelligence Fall 2011

Exemplar-based voice conversion using non-negative spectrogram deconvolution

CS578- Speech Signal Processing

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Model-based unsupervised segmentation of birdcalls from field recordings

Acoustic holography. LMS Test.Lab. Rev 12A

L6: Short-time Fourier analysis and synthesis

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

COMP 546, Winter 2018 lecture 19 - sound 2

Machine Recognition of Sounds in Mixtures

Lecture 9: Speech Recognition. Recognizing Speech

Lecture 9: Speech Recognition

Chapter 9 Automatic Speech Recognition DRAFT

Lecture 7: Feature Extraction

Introduction to Biomedical Engineering

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer.

-Digital Signal Processing- FIR Filter Design. Lecture May-16

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 10: Acoustic Models

Statistical NLP Spring The Noisy Channel Model

Introduction Basic Audio Feature Extraction

Enhancement of Noisy Speech. State-of-the-Art and Perspectives

ENTROPY RATE-BASED STATIONARY / NON-STATIONARY SEGMENTATION OF SPEECH

Hidden Markov Model and Speech Recognition

Transcription:

Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017

What is new in this thesis? q We present new spectral representations: ZZT and three group delay based representations and new algorithms demonstrating applications of these representations in various speech analysis problems: Source-tract separation Glottal flow parameter estimation Formant tracking Feature extraction for Automatic Speech recognition q We study in detail the phase estimation problem and propose solutions to existing problems 2 q We discuss group delay characteristics of a mixed-phase speech model 2

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 3 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 3

4 Motivations Primary motivation : Voice quality analysis for TTS Starting pointafter a literature review: spectral methods Two main problems : source-tract separation, fourier transform phase processing Potential impact areas for this study Source-tract separation: voice quality analysis, speech synthesis, emotion studies,speech therapy, speaker recognition. Phase processing: speech perception, speech recognition, speech coding. Group delay characteristics of the mixed-phase speech model: speech processing theory target application -> basic research -> larger impact 4

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 5 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 5

Spectral analysis of signals z-transform X z = N 1 n= 0 x n z n Fourier transform X w X z = a w jb w = = jw + z e 6 Magnitude => X + Phase => θ w = Group. delay => τ w 6 2 w = a w b w b w arctan a w d θ w = dw 2

All-pole filter response and causality 7 For causality detection, phase processing is essential 7

Why study group delay processing? poles of an all-pole filter -Higher resolution -Tilt free -Mixed-phase information 8 advantagegrpd.avi 8

Contents 9 q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 9

A Mixed-Phase model of speech Maximum-phase Glottal flow excitation* * = Minimum-phase vocal tract filter plus the GF return phase + = 10 Mixed-phase speech signal Important note: Mixed-phase characteristic can only be observed in phase/group delay spectrum = *:after Gardner 1994 and Doval & D Alessandro 1997 10 + =

Preliminary trials with chirp group delay processing 11 Not robust and we don t know the reason. Bozkurt & Dutoit, VOQUAL 2003 Hint comes from Prof. Kawahara in VOQUAL03: windowing may play a role 11

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 12 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 12

Problems in group delay analysis of speech Problem! Group Delay Functions are most often very noisy Reason: Roots of the z-transform polynomial close to unit circle Yegnanarayana and Murthy89 Conclusion: 13 A systematic study of roots of Z-transform for speech signals is needed Thanks to todays technology! difficultywithzeros.avi 13

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 14 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 14

15 15 = = + = = 1 0 1 1 1 0 N n N m m N n Z z z x z n x z X ZZT representation: the set of zeros of the z-transform polynomial Almost impossible to study analytically for most of the functions, therefore numerical methods are used roots function of Matlab 1 0,1..., = = N n a n x n = = = 1 0 1 1 N n N n n z a z a z a z X Basic elemantary signal : power series Zeros of Z-TransformZZT Representation

ZZT of elementary signals ZZT of damped sinusoid ZZT of causal all-pole filter response 16expCoeffInDampedSinusoid.avi causalresponsezeros1.avi 16

Zero-patterns for the LF model * of glottal flow derivative First phase g t 0 ω t = E e sin gt, 0 t t [ ] α e e c 0 e Return phase g E εt ε t te ε tc te t = e e, t t t T a 17 *: Fant et al, 1985 17

ZZT representation of speech Synthetic mixed-phase speech = + + = 18 periodicity results in many zeros on the unit circle 18 first phase of the glottal flow adds zeros outside the unit circle vocal tract response zeros lie inside the unit circle AND WINDOWING EFFECT TO ZZT IS DRASTIC!

All-zero representation of windowed speech Non-GCI Synchronous windowing GCI Synchronous windowing Rectangular window Rectangular window 19 19

All-zero representation of speech Window location effect to ZZT plots 2T0 case Windowing of synthetic speech Windowing of real speech 20 synth2t0blackman.avi real2t0blackman.avi 20

Window function effect to ZZT and group delay Best choices are Blackman, Gaussian and Hanning-Poisson 21 Smoothness can be adjusted by varying coefficients in Gaussian and Hanning- Poisson 21

Window size effect to ZZT 22 What do other people do? Pitch asynchronous, 3T0 window size, Hamming -> all are bad choices for phase processing 22

23 23 Recently proposed group delay representations in literature 2 ω ω ω ω ω ω τ X Y X Y X I I R R p + = γ ω ω ω ω ω ω τ 2 S Y X Y X I I R R p + = Modified group delay function, MODGDF Hegde et al, ICSLP 2004 Product spectrum, PS Zhu and Paliwal, ICASSP 2004 2 ω ω ω ω ω τ ω ω I I R R p Y X Y X X Q + = = ]}} [ { { n x FT X R = real ω ]}} [ { { n nx FT Y I = imag ω responsible! for spikes replaced by a cepstrally smoothed version responsible! completely removed Originality in our approach: Studying zero patterns, trying to find means of avoiding/removing zeros close to the unit circle

ZZT and group delay of GCI-synchronously windowed speech Group Delay of GCI-Synchronously windowed speech GDGCI ZZT Amp. Spec. 24 GDGCI 24

Group delay spectrogram using GDGCI Hanning-Poisson, 2T 0 25 The formant frequencies of a given speech signal can be estimated from phase spectrum once windowing is properly performed 25

Chirp group delay of GCI-synchronously windowed speech CGDGCI Basic ideas: -remove unwanted zeros -compute chirp group delay away from the resting zeros CGD outside unit circle directly from signal after zero removal CGD inside unit circle directly from signal after zero removal 26 CGDGCI Disadvantages: computationally heavy, GCI-synchronous 26

Chirp group delay of the Zero-Phase version of the signal CGDZP GCI-synhcronous processing is not practical for ASR. Speech windowing fft abs ifft CGD R=1.12 CGDZP Zero phasing constant frame size/shift optimized by testing performance for incrementing values 27 whyzerophase.avi CGDcompared2AmpSpec.avi 27

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 28 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 28

Zero-decomposition for source-tract separation Kawahara et al, 2000. 29 GCIdetection.avi Bozkurt et. al., ICSLP 2004-a 29

Zero-decomposition for sourcetract separation Synthetic glottal excitation Original windowed speech Original amp. spectum Synthetic speech Original+reconstructed glottal excitation Original+reconstructed glottal amp. spectrum 30 ZZT Original+reconstructed vocal tract response Original+reconstructed tract transfer function Zero-decomposition 30

Zero-decomposition for sourcetract separation Real speech Original windowed speech Original amp. spectum ZZT reconstructedglottal excitation reconstructed glottal amp. spectrum 31 reconstructed vocal tract response Zero-decomposition reconstructed tract transfer function Copy-Synth Noise excited tract 31

Comperative example: ZZT-decomposition and PSIAIF Glottal flowgf Differential glottal flowdgf Original ZZT-decomp estimate PSIAIF 32 Amp. Spec. of DGF Group delay of DGF 32

Robustness of ZZT-decomposition To GCI estimation errors To F1 variation F1=200 350 Hz F1=375 500 Hz To noise To return phase variations F1=525 700 Hz 33 33

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 34 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 34

Application 2 : ZZT+Group Delay Processing Glottal formant frequency Fg estimation Synthetic vowels a, u, i Real speech f0=100hz OQ? From EGG 35 f0=200hz *Acknowledgement: NB: Fg=fF0,1/OpenQuotient,Asym. Doval et al, VOQUAL 2003 Open quotient estimate provided by Nathalie Henrich. Henrich, N., et al. 2000. 35

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 36 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 36

Application 3 : ZZT+Group Delay Processing Candidates: DPPT, WinSnoori, Praat two publicly available tools Conclusion: Formant tracking with CGDGCI: DPPT Results combined with real speech tests show that Praat and DPPT are comparable in quality and superior to Win Snoori. The disadvantage of DPPT is its low speed and dependency to GCI. Average percentage error Formant miss rate 37 F1 F2 F3 F4 F1 F2 F3 F4 DPPT 6.8 1.8 1.0 0.8 0 17.1 3.5 0 WinSnoori 2.8 1.9 0.6-0 0 0 100 Praat 3.8 3.8 4.7 13.8 0 0 0 24.4 37

Application 3 : ZZT+Group Delay Processing Formant tracking with CGDZP: Fast-DPPT SPEECH DATA Fixed frame-size and frame-shift Blackman windowing Frame-size, frame-shift, number of formants to track Computation of the zero-phase version of the signal Computation of chirp group delay outside the unit circle CGDZP Peak picking Decrement/increment radius of analysis circle 38 number of peaks equal to number of formants to track? NO YES Formant frequencies 38

Formant tracking tests on real speech 39 Stimuli: 10 real speech examples5 female, 5 male with large formant variations. Candidates: Fast-DPPT, Wave Surfer, Praat two publicly available tools Conclusion: They have similar quality, Fast- DPPT lacks a post-processing module for guaranteeing continuity of tracks. 39

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 40 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 40

Application of the Mixed-phase model MixLP* for glottal flow parameter estimation Mixed-phase speech model Most of the existing LP methods can only estimate resonances for the minimum phase version of the signal. Poles outside the unit circle are avoided In MixLP, we look for poles outside the unit circle. Conclusion: Works well for synthetic speech, not robust for analyzing real speech. 41 Bozkurt, Severin, Dutoit, 2004, *:implemented and tested by Francois Severin 41

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 42 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 42

Application 5 : ZZT+Group Delay Processing Automatic Speech RecognitionASR 43 43

Computation of ASR features ASR system speech signal Front-End Acoustic Model Word Decoder word sequence MLP HMM topology lexicon grammar 44 MFCC as baseline method Alternative method: replace power spectrum function in MFCC by group delay functions MODGDF, PS, GDCGI, CGDCGI, CGDZP. 44

Combining acoustic models speech signal MFCC Front-End Alternative Front-End Acoustic Model 1 Acoustic Model 2 Combination Acoustic Model word sequence MODGDF, PS GDCGI, CGDCGI CGDZP 45 Combine HMM state probabilities MLP outputs at frame level as a weighted geometric average, P λ 1 λ 12 si vt = P1 si vt P2 si vt with λ optimized between 0 and 1. 45

46 Application 4 : ZZT+Group Delay Processing Automatic Speech RecognitionASR* Proposed group delay representations are compared with representations proposed in recent studies[hegde2004, Alsteris2004], in an ASR experiment. Our proposed techniques provide better results and have the potential to improve ASR performance. Feature SNR db Extraction 20 15 10 5 0-5 MFCC 1.9 œ 6.7 œ 18.6 œ 45.2 œ 75.1 œ 88.8 œ 91.5 œ MODGDF 3.2 2.1 19.0 8.5 41.7 23.9 68.7 52.7 86.1 79.5 91.0 89.5 92.3 91.5 PS 2.0 1.9 6.7 6.7 19.4 18.6 45.3 44.4 75.5 74.6 89.0 88.5 92.2 91.6 GDGCI 8.8 2.1 32.8 7.8 49.4 16.8 69.0 36.0 88.3 64.4 98.6 88.0 100.0 96.1 CGDGCI 3.2 1.8 12.3 5.8 25.6 12.2 50.8 29.1 80.8 58.0 97.0 83.8 99.8 93.8 CGDZP 1.8 1.7 5.8 5.0 12.2 10.4 29.4 24.8 62.6 52.7 88.7 82.3 97.6 91.1 Performances of ASR system word error rate WER in percent for various feature extraction on the AURORA-2 task lexicon reduced to English digits and no grammar is applied. Training: 8440 noise-free utterances spoken by 110 speakers. Evaluation: 4004 different noise-free utterances spoken by 104 other speakers. *:ASR tests handled by Laurent Couvreur/TCTS Lab 46

47 Deficiencies, future work, work not included in the thesis Some of the algorithms are not completely tested due to time constraints and therefore stand just as demonstrations of potential. ZZT-decomposition is not throughly tested ASR tests are limited It is an experimental study, analytical part is weak Since studying zero locations is difficult if not impossible Future work: Voice quality labeling/classification: Source-tract decomposition using complex cepstrum Restudying phase related problems in speech processing Studies on TTS during the 2000-2002 are not included in the thesis 47