Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017

What is new in this thesis? q We present new spectral representations: ZZT and three group delay based representations and new algorithms demonstrating applications of these representations in various speech analysis problems: Source-tract separation Glottal flow parameter estimation Formant tracking Feature extraction for Automatic Speech recognition q We study in detail the phase estimation problem and propose solutions to existing problems 2 q We discuss group delay characteristics of a mixed-phase speech model 2

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 3 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 3

4 Motivations Primary motivation : Voice quality analysis for TTS Starting pointafter a literature review: spectral methods Two main problems : source-tract separation, fourier transform phase processing Potential impact areas for this study Source-tract separation: voice quality analysis, speech synthesis, emotion studies,speech therapy, speaker recognition. Phase processing: speech perception, speech recognition, speech coding. Group delay characteristics of the mixed-phase speech model: speech processing theory target application -> basic research -> larger impact 4

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 5 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 5

Spectral analysis of signals z-transform X z = N 1 n= 0 x n z n Fourier transform X w X z = a w jb w = = jw + z e 6 Magnitude => X + Phase => θ w = Group. delay => τ w 6 2 w = a w b w b w arctan a w d θ w = dw 2

All-pole filter response and causality 7 For causality detection, phase processing is essential 7

Why study group delay processing? poles of an all-pole filter -Higher resolution -Tilt free -Mixed-phase information 8 advantagegrpd.avi 8

Contents 9 q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 9

A Mixed-Phase model of speech Maximum-phase Glottal flow excitation* * = Minimum-phase vocal tract filter plus the GF return phase + = 10 Mixed-phase speech signal Important note: Mixed-phase characteristic can only be observed in phase/group delay spectrum = *:after Gardner 1994 and Doval & D Alessandro 1997 10 + =

Preliminary trials with chirp group delay processing 11 Not robust and we don t know the reason. Bozkurt & Dutoit, VOQUAL 2003 Hint comes from Prof. Kawahara in VOQUAL03: windowing may play a role 11

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 12 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 12

Problems in group delay analysis of speech Problem! Group Delay Functions are most often very noisy Reason: Roots of the z-transform polynomial close to unit circle Yegnanarayana and Murthy89 Conclusion: 13 A systematic study of roots of Z-transform for speech signals is needed Thanks to todays technology! difficultywithzeros.avi 13

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 14 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 14

15 15 = = + = = 1 0 1 1 1 0 N n N m m N n Z z z x z n x z X ZZT representation: the set of zeros of the z-transform polynomial Almost impossible to study analytically for most of the functions, therefore numerical methods are used roots function of Matlab 1 0,1..., = = N n a n x n = = = 1 0 1 1 N n N n n z a z a z a z X Basic elemantary signal : power series Zeros of Z-TransformZZT Representation

ZZT of elementary signals ZZT of damped sinusoid ZZT of causal all-pole filter response 16expCoeffInDampedSinusoid.avi causalresponsezeros1.avi 16

Zero-patterns for the LF model * of glottal flow derivative First phase g t 0 ω t = E e sin gt, 0 t t [ ] α e e c 0 e Return phase g E εt ε t te ε tc te t = e e, t t t T a 17 *: Fant et al, 1985 17

ZZT representation of speech Synthetic mixed-phase speech = + + = 18 periodicity results in many zeros on the unit circle 18 first phase of the glottal flow adds zeros outside the unit circle vocal tract response zeros lie inside the unit circle AND WINDOWING EFFECT TO ZZT IS DRASTIC!

All-zero representation of windowed speech Non-GCI Synchronous windowing GCI Synchronous windowing Rectangular window Rectangular window 19 19

All-zero representation of speech Window location effect to ZZT plots 2T0 case Windowing of synthetic speech Windowing of real speech 20 synth2t0blackman.avi real2t0blackman.avi 20

Window function effect to ZZT and group delay Best choices are Blackman, Gaussian and Hanning-Poisson 21 Smoothness can be adjusted by varying coefficients in Gaussian and Hanning- Poisson 21

Window size effect to ZZT 22 What do other people do? Pitch asynchronous, 3T0 window size, Hamming -> all are bad choices for phase processing 22

23 23 Recently proposed group delay representations in literature 2 ω ω ω ω ω ω τ X Y X Y X I I R R p + = γ ω ω ω ω ω ω τ 2 S Y X Y X I I R R p + = Modified group delay function, MODGDF Hegde et al, ICSLP 2004 Product spectrum, PS Zhu and Paliwal, ICASSP 2004 2 ω ω ω ω ω τ ω ω I I R R p Y X Y X X Q + = = ]}} [ { { n x FT X R = real ω ]}} [ { { n nx FT Y I = imag ω responsible! for spikes replaced by a cepstrally smoothed version responsible! completely removed Originality in our approach: Studying zero patterns, trying to find means of avoiding/removing zeros close to the unit circle

ZZT and group delay of GCI-synchronously windowed speech Group Delay of GCI-Synchronously windowed speech GDGCI ZZT Amp. Spec. 24 GDGCI 24

Group delay spectrogram using GDGCI Hanning-Poisson, 2T 0 25 The formant frequencies of a given speech signal can be estimated from phase spectrum once windowing is properly performed 25

Chirp group delay of GCI-synchronously windowed speech CGDGCI Basic ideas: -remove unwanted zeros -compute chirp group delay away from the resting zeros CGD outside unit circle directly from signal after zero removal CGD inside unit circle directly from signal after zero removal 26 CGDGCI Disadvantages: computationally heavy, GCI-synchronous 26

Chirp group delay of the Zero-Phase version of the signal CGDZP GCI-synhcronous processing is not practical for ASR. Speech windowing fft abs ifft CGD R=1.12 CGDZP Zero phasing constant frame size/shift optimized by testing performance for incrementing values 27 whyzerophase.avi CGDcompared2AmpSpec.avi 27

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 28 Source-tract separation Formant tracking Glottal flow parameter estimation Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 28

Zero-decomposition for source-tract separation Kawahara et al, 2000. 29 GCIdetection.avi Bozkurt et. al., ICSLP 2004-a 29

Zero-decomposition for sourcetract separation Synthetic glottal excitation Original windowed speech Original amp. spectum Synthetic speech Original+reconstructed glottal excitation Original+reconstructed glottal amp. spectrum 30 ZZT Original+reconstructed vocal tract response Original+reconstructed tract transfer function Zero-decomposition 30

Zero-decomposition for sourcetract separation Real speech Original windowed speech Original amp. spectum ZZT reconstructedglottal excitation reconstructed glottal amp. spectrum 31 reconstructed vocal tract response Zero-decomposition reconstructed tract transfer function Copy-Synth Noise excited tract 31

Comperative example: ZZT-decomposition and PSIAIF Glottal flowgf Differential glottal flowdgf Original ZZT-decomp estimate PSIAIF 32 Amp. Spec. of DGF Group delay of DGF 32

Robustness of ZZT-decomposition To GCI estimation errors To F1 variation F1=200 350 Hz F1=375 500 Hz To noise To return phase variations F1=525 700 Hz 33 33

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 34 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 34

Application 2 : ZZT+Group Delay Processing Glottal formant frequency Fg estimation Synthetic vowels a, u, i Real speech f0=100hz OQ? From EGG 35 f0=200hz *Acknowledgement: NB: Fg=fF0,1/OpenQuotient,Asym. Doval et al, VOQUAL 2003 Open quotient estimate provided by Nathalie Henrich. Henrich, N., et al. 2000. 35

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 36 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 36

Application 3 : ZZT+Group Delay Processing Candidates: DPPT, WinSnoori, Praat two publicly available tools Conclusion: Formant tracking with CGDGCI: DPPT Results combined with real speech tests show that Praat and DPPT are comparable in quality and superior to Win Snoori. The disadvantage of DPPT is its low speed and dependency to GCI. Average percentage error Formant miss rate 37 F1 F2 F3 F4 F1 F2 F3 F4 DPPT 6.8 1.8 1.0 0.8 0 17.1 3.5 0 WinSnoori 2.8 1.9 0.6-0 0 0 100 Praat 3.8 3.8 4.7 13.8 0 0 0 24.4 37

Application 3 : ZZT+Group Delay Processing Formant tracking with CGDZP: Fast-DPPT SPEECH DATA Fixed frame-size and frame-shift Blackman windowing Frame-size, frame-shift, number of formants to track Computation of the zero-phase version of the signal Computation of chirp group delay outside the unit circle CGDZP Peak picking Decrement/increment radius of analysis circle 38 number of peaks equal to number of formants to track? NO YES Formant frequencies 38

Formant tracking tests on real speech 39 Stimuli: 10 real speech examples5 female, 5 male with large formant variations. Candidates: Fast-DPPT, Wave Surfer, Praat two publicly available tools Conclusion: They have similar quality, Fast- DPPT lacks a post-processing module for guaranteeing continuity of tracks. 39

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 40 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 40

Application of the Mixed-phase model MixLP* for glottal flow parameter estimation Mixed-phase speech model Most of the existing LP methods can only estimate resonances for the minimum phase version of the signal. Poles outside the unit circle are avoided In MixLP, we look for poles outside the unit circle. Conclusion: Works well for synthetic speech, not robust for analyzing real speech. 41 Bozkurt, Severin, Dutoit, 2004, *:implemented and tested by Francois Severin 41

Contents q Motivations q Spectral analysis of signals q Mixed-phase speech model and group delay characteristics q Difficulties in group delay processing q ZZT representation and chirp group delay processing q Applications 42 Source-tract separation Glottal flow parameter estimation Formant tracking Feature estimation for Automatic Speech Recognition q Conclusions and Future Work 42

Application 5 : ZZT+Group Delay Processing Automatic Speech RecognitionASR 43 43

Computation of ASR features ASR system speech signal Front-End Acoustic Model Word Decoder word sequence MLP HMM topology lexicon grammar 44 MFCC as baseline method Alternative method: replace power spectrum function in MFCC by group delay functions MODGDF, PS, GDCGI, CGDCGI, CGDZP. 44

Combining acoustic models speech signal MFCC Front-End Alternative Front-End Acoustic Model 1 Acoustic Model 2 Combination Acoustic Model word sequence MODGDF, PS GDCGI, CGDCGI CGDZP 45 Combine HMM state probabilities MLP outputs at frame level as a weighted geometric average, P λ 1 λ 12 si vt = P1 si vt P2 si vt with λ optimized between 0 and 1. 45

46 Application 4 : ZZT+Group Delay Processing Automatic Speech RecognitionASR* Proposed group delay representations are compared with representations proposed in recent studies[hegde2004, Alsteris2004], in an ASR experiment. Our proposed techniques provide better results and have the potential to improve ASR performance. Feature SNR db Extraction 20 15 10 5 0-5 MFCC 1.9 œ 6.7 œ 18.6 œ 45.2 œ 75.1 œ 88.8 œ 91.5 œ MODGDF 3.2 2.1 19.0 8.5 41.7 23.9 68.7 52.7 86.1 79.5 91.0 89.5 92.3 91.5 PS 2.0 1.9 6.7 6.7 19.4 18.6 45.3 44.4 75.5 74.6 89.0 88.5 92.2 91.6 GDGCI 8.8 2.1 32.8 7.8 49.4 16.8 69.0 36.0 88.3 64.4 98.6 88.0 100.0 96.1 CGDGCI 3.2 1.8 12.3 5.8 25.6 12.2 50.8 29.1 80.8 58.0 97.0 83.8 99.8 93.8 CGDZP 1.8 1.7 5.8 5.0 12.2 10.4 29.4 24.8 62.6 52.7 88.7 82.3 97.6 91.1 Performances of ASR system word error rate WER in percent for various feature extraction on the AURORA-2 task lexicon reduced to English digits and no grammar is applied. Training: 8440 noise-free utterances spoken by 110 speakers. Evaluation: 4004 different noise-free utterances spoken by 104 other speakers. *:ASR tests handled by Laurent Couvreur/TCTS Lab 46

47 Deficiencies, future work, work not included in the thesis Some of the algorithms are not completely tested due to time constraints and therefore stand just as demonstrations of potential. ZZT-decomposition is not throughly tested ASR tests are limited It is an experimental study, analytical part is weak Since studying zero locations is difficult if not impossible Future work: Voice quality labeling/classification: Source-tract decomposition using complex cepstrum Restudying phase related problems in speech processing Studies on TTS during the 2000-2002 are not included in the thesis 47