Vocoding approaches for statistical parametric speech synthesis

Size: px
Start display at page:

Download "Vocoding approaches for statistical parametric speech synthesis"

Transcription

1 Vocoding approaches for statistical parametric speech synthesis Ranniery Maia Toshiba Research Europe Limited Cambridge Research Laboratory Speech Synthesis Seminar Series CUED, University of Cambridge, UK March 2nd, 2011

2 2 Topics of this presentation 1. Existing methods to generate the speech waveform in statistical parametric speech synthesis 2. An idea for closing the gap between acoustic modeling and waveform generation

3 3 Notation and acronyms in this presentation Notation x(n) X(z) X `ejω X `ejω magnitude X `ejω X `ejω 2 x X Acronyms OLA MELP STRAIGHT FFT IFFT LF LP PCA LSP a discrete-time signal x(n) in the z-transform domain Discrete-Time Fourier Transform of x(n) (frequency domain representation of x(n)) response of x(n) phase response of x(n) power spectrum of x(n) a vector a matrix OverLap and Add Mixed Excitation Linear Prediction Speech Transformation and Representation using Adaptive Interpolation of weighted spectrum Fast Fourier Transform Inverse Fast Fourier Transform Liljencrants-Fant model Linear Prediction Principal Component Analysis Line Spectral Pairs

4 4 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

5 5 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

6 6 Statistical parametric speech synthesis Speech synthesis methods 1. Rule-based 1.1 Parametric 1.2 Unit concatenation 2. Corpus-based 2.1 Unit selection and concatenation 2.2 Statistical parametric 2.3 Hybrid

7 7 Statistical parametric speech synthesis 1. Advantages several voices, small data, small footprint, language portability, etc 2. Unnatural synthesized speech 2.1 Parametric model of speech production 2.2 Parameters of the model are averaged How to alleviate this unnaturalness? 1. Statistical modeling 2. Choice of the speech production model 3. Choice of the parameters to represent such model 4. Way of synthesizing speech with these parameters

8 8 Statistical parametric speech synthesis 1. Advantages several voices, small data, small footprint, language portability, etc 2. Unnatural synthesized speech 2.1 Parametric model of speech production 2.2 Parameters of the model are averaged How to alleviate this unnaturalness? 1. Statistical modeling 2. Choice of the speech production model 3. Choice of the parameters to represent such model 4. Way of synthesizing speech with these parameters

9 9 Statistical parametric speech synthesis Training time Speech waveform s = [ s(0) s(n 1) ] Parameter extraction Synthesis time Speech Parameters c = [ c 0 c T 1 ] Labels Acoustic model training ˆλ c = arg max λ c p (c l, λ c ) Acoustic model parameters ˆλ c Labels Speech parameters Acoustic model parameters Parameter ĉ = [ ] ĉ 0 ĉ T 1 ˆλ generation c ĉ = arg max p (c l, ˆλ ) c λ c Waveform generation Speech waveform s = [ s(0) s(n 1) ]

10 10 Waveform generation part 1. Choice of the speech production mechanism Simple Speech synthesis filter Excitation Complete Vocal tract, glottal and lip radiation filters Excitation 2. Appropriate parameters for the chosen speech mechanism Good quantization/compression properties 3. Given the speech model and corresponding parameters, design the best way to synthesize the speech signal according to some criteria

11 11 Waveform generation part 1. Choice of the speech production mechanism Simple Speech synthesis filter Excitation Complete Vocal tract, glottal and lip radiation filters Excitation 2. Appropriate parameters for the chosen speech mechanism Good quantization/compression properties 3. Given the speech model and corresponding parameters, design the best way to synthesize the speech signal according to some criteria

12 12 Digital speech models The complete model [Deller, Jr. et al., 2000] Pitch period Gain Pulse train generator Uncorrelated noise generator Glottal filter G(z) Voice volume velocity V U Vocal tract filter H(z) Lip radiation filter L(z) Speech signal Gain The simplified model, assumed for LP analysis Pitch period Pulse train generator White noise generator V U Gain All pole filter Speech signal

13 13 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

14 14 Standard vocoder for statistical parametric synthesis Labels representing the text to be synthesized Trained acoustic models F 0 Pulse train P 0... White noise t(n) w(n) F 0 Spectral parameters V e(n) Synthesis U Excitation filter Speech Very simple Analysis: F 0 extraction Synthesis: pulse/white noise switch Poor speech quality!

15 15 Improved vocoding methods for statistical parametric synthesis 1. Methods that focus solely on the excitation signal 1.1 Fully parametric excitation models 1.2 Methods that attempt to mimic the LP residual 2. Methods that focus on source and vocal tract modeling

16 16 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

17 17 MELP mixed excitation [Yoshimura et al., 2001] MELP excitation building part Jitter Fourier magnitudes Bandpass voicing strengths Pulse train t(n) Position Jittering H v (z) H p (z) v(n) Voiced Excitation e(n) White noise w(n) H n (z) u(n) Unvoiced Excitation Mixed Excitation Period jitter derived from voicing strengths for aperiodic frames Fourier magnitudes simulates the glottal filter Filters H p (z) and H n (z) control the amount of pulse and noise in the final excitation e(n)

18 18 Pulse and noise shaping filters Filters H p (z) and H n (z) switch between noise and pulse excitation according to each band H p (z) = J 1 j=0 m=0 M β j h j (m)z m, H n (z) = β j = J 1 { 1 if β j if β j < 0.5 M j=0 m=0 h j (m): bandpass filter coefficients for the j band Bandpass voicing strength for the j band obtained according to a normalized correlation coefficient β j = f(r t ) ; r t = N 1 n=0 [ N 1 n=0 s2 (n) (1 β j ) h j (m)z m s(n)s(n + t) ] [ N 1 ] n=0 s2 (n + t)

19 19 Application to statistical parametric synthesis Additional parameters for acoustic modeling 1. Bandpass voicing strengths: 5 2. Fourier magnitudes: 10

20 20 STRAIGHT excitation [Zen et al., 2007a] STRAIGHT vocoder: excitation construction no phase manipulation case Pulse T ( e jω) T(e jω ) Zero phase Noise ( ) W e jω Random phase 1 W ( e jω) 1 ω ω ω ω H p ( e jω ) Hn ( e jω ) Aperiodicity parameters ω ω V ( e jω) Voiced Excitation U ( e jω) Unvoiced Excitation E ( e jω) e(n) Á Ì Mixed Excitation

21 21 STRAIGHT vocoder for statistical parametric synthesis Aperiodicity parameters extracted and averaged over specified frequency sub-bands Band-aperiodicity parameters (BAP) At synthesis time the generated BAP are converted in aperiodicity Speech is synthesized in the frequency domain Achieves very good quality Additional parameters for acoustic modeling BAP: usually 5 coefficients

22 22 Pulse and noise weighting filters Filters H p ( e jω ) and H n ( e jω ) shape the pulse and noise inputs, just like in MELP Frequency responses are obtained from the aperiodicity parameters a(w) ( Hp e jω ) = 1 a(w) 0 ω π ( H p e jω ) =0 ( Hn e jω ) = a(w) H n ( e jω ) =0 0 ω π 0 ω π 0 ω π

23 23 Band aperiodicity parameters Aperiodicity at frequency ω a(ω) = werb (λ; ω) S ( e jλ) 2 Υ ( ) S L(e jλ ) 2 S U(e jλ ) 2 werb (λ; ω) S (e jλ ) 2 dλ dλ S `e jω : speech spectral envelope S U `ejω : envelope constructed by connecting the peaks of S `ejω S L `ejω : envelope constructed by connecting the valleys of S `ejω w ERB (λ; ω): auditory filter to smooth S `ejω Υ( ): look-up table operation Band-aperiodicity b j = 1 a(ω)dω Ω j Ω j Ω j : j-th frequency band

24 24 Aperiodicity and band aperiodicity: examples 5 bands: 0-1kHz, 1-2kHz, 2-4kHz, 4-6kHz, 6-8kHz Log Magnitude Aperiodicity Band aperiodicity Frequency (Hz) 24 Bark critical bands Log Magnitude Aperiodicity Band aperiodicity Frequency (Hz)

25 25 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

26 26 State-dependent mixed excitation [Maia et al., 2007] Labels Trained HMMs HMM state sequence State s 0 State s 1 State s Q 1 State durations c (0) 0 c (0) c (1) T 1 0 c (1) T 1 c (Q 1) 0 c (Q 1) T 1 Spectral parameters (generated) F0 (0) 0 F0 (0) F0 (1) T 1 0 F0 (1) T 1 F0 (Q 1) 0 F0 (Q 1) T 1 F0 (generated) H (0) v (z), H (0) u (z) H (1) v (z), H (1) u (z) H v (Q 1) (z), H u (Q 1) (z) Filters Pulse train Generator Noise w(n) t(n) H v (z) H u(z) v(n) Voiced Excitation ũ(n) Unvoiced Excitation ẽ(n) Mixed Excitation H(z) Synthesized Speech Filters: H v(z) = M/2 X m= M/2 h(m)z m, H u(z) = 1 P L l=0 g(l)z l

27 27 State-dependent mixed excitation: training Residual (target signal) e(n) a 0a1 a 2 a Z 1 v(n) H v (z)... Voiced Excitation p 0 p 1 p 2 p Z 1 u(n) ÈÙÐ ØÖ Ò t(n) White noise w(n) (error signal) Filter coefficients h = [ h ( ) M 2 G(z) = 1 H u (z) And pulse positions and amplitudes Are optimized in a way that Unvoiced Excitation h ( )] M [ ] 2, g = g(0) g(l) {p 0,..., p J 1 }, {a 0,..., a J 1 } {h, g, t(n)} = arg max g,h,t(n) P (e(n) g, h, t(n))

28 28 State-dependent mixed excitation: synthesis F 0 State sequence s q = {s 0,...,s Q 1 } Pulse train generator t(n) H (q) v (z) ṽ(n) αṽ(n) α Excitation ẽ(n) White noise w(n) H (q) u (z) u(n) Time modulation H HP (z) ũ(n) ρ(n) Noise component is colored through 1. High-pass filtering (F c = 2kHz) 2. Time modulation with a pitch-synchronous triangular window: ρ(n) H v(z) is normalized in energy Gain α adjusts the energy of the voiced component so that the power of the excitation signal ẽ(n) becomes one

29 29 Voiced filter effect Pulse train Voiced filter Voiced excitation Amplitude Amplitude Amplitude Time (ms) Time (ms) Time (ms) 30 Pulse train 10 Voiced filter 40 Voiced excitation Magnitude (db) Magnitude (db) Magnitude (db) Frequency (Hz) Frequency (Hz) Frequency (Hz)

30 30 Unvoiced filter effect White noise Unvoiced filter Unvoiced excitation Amplitude Time (ms) Amplitude Time (ms) Amplitude Time (ms) Magnitude (db) White noise Frequency (Hz) Magnitude (db) Unvoiced filter Frequency (Hz) Magnitude (db) Unvoiced excitation Frequency (Hz)

31 31 Deterministic plus stochastic residual modeling [Drugman et al., 2009] Assumed model of the LP residual e(n) e(n) = e d (n) + e s (n) e d (n): deterministic part e s (n): stochastic part Maximum voiced frequency F m Boundary between deterministic and stochastic components Set to 4 khz

32 32 Deterministic modeling: eigenresidual calculation Spectral parameters Speech data Inverse filtering Residual GCI detection F 0 extraction Pitch periods Windowing Residual segments Resampling and energy normalization PCA Eigenresiduals Normalized frequency F0 for resampling the residual segments F 0 F Nyquist F m F 0,min PCA: eigenresiduals explain about 80% of the total dispersion

33 33 Stochastic modeling Stochastic component model e s (n) = ρ(n) [h u (n) w(n)] ρ(n): pitch synchronous modulation window h u (n): AR filter impulse response w(n): white noise Unvoiced filter h u (n) Fixed Auto-regressive (all-pole) Coefficients obtained through LP analysis

34 34 Application to statistical parametric synthesis Additional parameters for acoustic modeling PCA weights: 15 Use of eigenresidual of superior ranks makes no difference = Optionally, the stream of PCA weights can be removed Synthesis part Eigenresiduals F 0 PCA weights White noise w(n) Linear combination H u (z) Resampling Time modulation r d (n) r s (n) OLA e(n) ρ(n)

35 35 Waveform interpolation [Sung et al., 2010] Waveform interpolation (WI) Each cycle of the excitation signal represented by a characteristic waveform (CW) P/2 [ ( ) ( )] 2πkn 2πkn e(n) = A k cos + B k sin P P k=0 {A k, B k }: discrete-time Fourier series coefficients P : pitch period CW extracted from the LP residual at a fixed rate Information to reconstruct the excitation signal Pitch period: P CW coefficients: {A k, B k }

36 36 Application to statistical parametric synthesis Analysis is similar to eigenresidual calculation [Drugman et al., 2009] Speech data WI analysis CW coefficients PCA Basis vector selection Basis vector coefficient computation Additional parameters for acoustic modeling Coefficients of the basis vectors: 8 Synthesis Basis vector coefficients Pitch Basis vector coefficients CW reconstruction WI synthesis Speech

37 37 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

38 38 Glottal inverse filtering [Raitio et al., 2008] Uses Iterative Adaptive Inverse Filtering [Alku, 1992] Speech signal Harmonic to noise ratio (HNR) extraction Energy extraction Vocal tract spectrum V (z) Glottal inverse filtering (IAIF) Estimated voice source signal g(n) Features LP analysis Voice source spectrum G(z) Features for acoustic modeling 1. F 0 2. Energy 3. HNR in 4 bands: 0-2kHz, 2-4kHz, 4-6kHz,6-8kHz 4. Voice source spectrum = glottal flow = 10 LSPs 5. Vocal tract spectrum: 30 LSPs F 0 extraction

39 39 At synthesis time F 0 Energy HNR Glottal source spectrum G(z) Library pulse Interpolation Gain Noise addition Spectral macthing L(z) V White noise Gain U Excitation Library pulse extracted from the speech data through glottal inverse filtering Noise is separately added to each band of the voiced excitation according to HNR G(z) implements the spectral shape of the glottal pulse L(z) is a fixed lip radiation filter

40 40 Glottal spectrum separation (GSS) [Cabral et al., 2008] Speech production model S ( e jω) = D ( e jω) G ( e jω) V ( e jω) R ( e jω) D `e jω : pulse train G `e jω : glottal pulse V `e jω : vocal tract R `e jω : lip radiation Simplified speech production model S ( e jω) = D ( e jω) V ( e jω) What GSS does 1. Estimate a model of the glottal flow derivative = E `e jω 2. Remove its effect from the speech spectral envelope = Ĥ V e jω Ĥ `e jω = E (e jω ) 3. Re-synthesize speech S e jω = D e jω E e jω Ĥ `e jω E (e jω ) = S e jω = D e jω V `e jω e jω

41 41 Glottal flow model utilized in GSS LF model is used to represent the glottal pulse T 0 T a 0 t 0 t p t e t a t c E e Parameters t c: instant of complete closure t e: instant of maximum excitation T 0: fundamental period Time t p: instant of maximum flow T a = t a t e E e: amplitude of maximum excitation

42 42 Application to statistical parametric synthesis Speech Inverse filtering STRAIGHT Glottal flow derivative Ĥ ( e jω) Speech spectral envelope LF model parameterization LF model parameters e LF (n) FFT Features E LF ( e jω ) Spectral separation Spectral parameters V ( e jω) = Ĥ(ejω ) E LF(e jω ) Acoustic modeling 1. Spectral parameters: mel-cepstral coefficients 2. Band aperiodicity parameters 3. 5 LF-model parameters: t e, t p, T a, E e, T 0

43 43 Synthesis part LF model parameters Aperiodicity parameters LF model generator e LF (n) FFT E LF ( e jω ) H p ( e jω ) ω White noise w(n) FFT W ( e jω) H n ( e jω ) ω IFFT Mixed Excitation STRAIGHT vocoder is utilized Original delta pulse is replaced by the pulse created by the generated LF-model parameters

44 44 Vocal tract and LF-model plus noise separation [Lanchantin et al., 2010] Assumed speech model S ( e jω) = [ D ( e jω) G ( e jω) + W ( e jω)] V ( e jω) R ( e jω) D ( e jω) : pulse train W ( e jω) : white noise R ( e jω) : lip radiation Parameterization 1. Fundamental frequency D ( e jω) 2. LF model parameter G ( e jω) 3. Noise power W ( e jω) 4. Mel-cepstral coefficients V ( e jω) G ( e jω) : glottal pulse V ( e jω) : vocal tract

45 45 Application to statistical parametric synthesis 1. Estimate a single parameter for a simplified LF model: R d 2. Determine a maximum voiced frequency ω V U 3. Estimate power of the noise W ( e jω) : σg 2 4. Estimate vocal tract parameters V e jω = 8 «>< τ o S(e jω ) γ R(e jω )G(e jω ) 1, w < ω V U «>: C o S(e jω ) Γ 1, w ω R(e jω )G(e jω V U ) V U τ o : cepstral analysis by fitting the harmonic peaks C o : power cepstrum Γ, γ: normalization terms Acoustic modeling 1. Simplified LF-model parameter: R d 2. Standard deviation of the noise component: σ g 3. Cepstral coefficients that represent V ( e jω) 4. F 0

46 46 Synthesis time Excitation construction for voiced segments R d Glottal pulse generator White noise w(n) g(n) Windowing and time modulation FFT FFT G ( e jω) W ( e jω) H HP ( e jω ) ω V U E ( e jω) Excitation Excitation construction for unvoiced segments White noise w(n) Windowing FFT Synthesized speech signal E ( e jω) Excitation S ( e jω) = E ( e jω) V ( e jω) R ( e jω)

47 Vocoding methods: summary and examples Method Simple MELP STRAIGHT SDF DSM WI GlottiHMM GSS SVLN Description Pulse train/white noise simple switch MELP mixed excitation STRAIGHT mixed excitation State-dependent filtering mixed excitation Deterministic plus stochastic of the residual Waveform interpolation to statistical parametric synthesis Glottal inverse filtering Glottal source separation Separation of vocal tract and LF-model plus noise Sample 47

48 48 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

49 49 Joint vocoding-acoustic modeling The goal of any speech synthesizer is to reproduce the speech waveform Parameters of a joint acoustic-excitation model are estimated by maximizing the probability of the speech waveform l Speech waveform s = [ s(0) s(n 1) ] Joint acoustic excitation model training ˆλ = arg max λ = arg max λ p (s l, λ) p (s c, λ) p (c l, λ) dc c = [ ] c 0 c T 1 Spectral parameter as hidden variable ˆλ λ = {λ c, λ e}: acoustic-excitation model parameters λ c: acoustic model part λ e: excitation model part

50 50 Another viewpoint: waveform-level modeling Comparison with typical modeling for parametric synthesis Typical: state sequence is hidden variable, spectrum is the observation New model: state sequence and spectrum are hidden variables, speech is the observation q q c c Similar concepts [Toda and Tokuda, 2008]: factor analyzed trajectory HMM for spectral estimation [Wu and Tokuda, 2009]: closed-loop training for HMM-based synthesis s

51 51 A close look into the probabilities involved Typical statistical modeling for parametric synthesis X ˆλ c = arg max p (c q, λ c) p (q l, λ c) λ c Augmented statistical modeling X Z ˆλ = arg max p (s c, q, λ) λ {z } q w q Speech generative model (speech production from spectrum) p (c q, λ) p (q l, λ) {z } w Can be modeled by existing machines, e.g. HMM, HSMM, trajectory HMM dc

52 52 One possible speech generative model Pulse train t(n) a 0 a 2 a1 a Z 1 H v (z)... p 0 p 1 p 2 p Z 1 Gaussian white noise w(n) (zero mean, variance one) H u (z) v(n) Voiced Excitation u(n) Unvoiced Excitation e(n) Excitation H c (z) Speech s(n) Probability of the speech signal p (s H c, q, λ e ) = H c 1 N ( Hc 1 ) s; H v,q t, Φ q ) 1 Φ q = (G q G q λ e = {H v, G, t} : excitation model parameters

53 53 Vocal tract filter impulse response and spectral parameters: relationship We need p (s c, q, λ), not p (s H c, q, λ) Mapping between H c and c is necessary! Two possibilities 1. Relationship between H c and c represented as a Gaussian process 2. Relationship between H c and c is deterministic Cepstral coefficients!

54 54 Acoustic modeling Trajectory HMM [Zen et al., 2007b] for acoustic modeling p (c l, λ c ) = q p (c q, λ c ) }{{} N (c ; c q, P q ) p (q l, λ c ) }{{} π T 1 q0 t=0 α q tq t+1 R q = P 1 q c q = P qr q = W Σ 1 q W r q = W Σ 1 q µ q W : append dynamic features to c Why: modeling of p (c l, λ c ) instead of p (W c l, λ c ) (conventional HMM)

55 55 Joint acoustic-excitation model Final joint model p (s l, λ) = p (s c, q, λ e ) }{{} q Excitation model H c 1 N ( ) Hc 1 s; H v,q t, Φ q p (c q, λ c ) p (q l, λ c ) }{{} Acoustic model N (c; c q, P q ) π q0 T 1 t=0 α q tq t+1 dc

56 56 Training procedure Speech Initial cepstral analysis Initialization ÁÒ Ø Ð Ô ØÖÙÑ Ú ØÓÖ c Trajectory HMM training State sequence ˆq ÁÒ Ø Ð Ô ØÖÙÑ Ú ØÓÖ c Excitation model training Rˆq,rˆq H v (z), G(z), t(n) Estimation of the best cepstral coefficients Recursion Ô ØÖÙÑ Ú ØÓÖ ĉ Ô ØÖÙÑ Ú ØÓÖ ĉ Trajectory HMM training State sequence ˆq Excitation model training

57 57 Contents Introduction Vocoding methods for statistical parametric speech synthesis Fully parametric excitation methods Methods that attempt to mimic the LP residual Methods that work on source and vocal tract modeling Joint acoustic modeling and waveform generation for statistical parametric speech synthesis Conclusion

58 58 Conclusions Quality improvement of statistical parametric synthesizers through better waveform generation methods Existing approaches that use source-filter modeling can be classified into Methods that attempt to improve the excitation signal solely Methods that focus on the speech production model as a whole Naturalness degradation of statistical parametric synthesizers has basically two causes Acoustic modeling that produces averaged parameter trajectories Use of parametric speech production models Methods which can integrate both acoustic modeling and speech production

59 59 Acknowledgments Many thanks to João Cabral, from University College Dublin, Ireland Tuomo Raitio, from Helsinki University of Technology, Finland Thomas Drugman, from Faculté Polytechnique de Mons, Belgium Pierre Lanchantin, from IRCAM, France June Sig Sung, from Seoul National University, South Korea for proving samples of their systems.

60 References I Alku, P. (1992). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11(2 3): Cabral, J., Renals, S., Richmond, K., and Yamagishi, J. (2008). Glottal spectral separation for parametric speech synthesis. In Proc. of Interspeech, pages Deller, Jr., J. R., Hansen, J. H. L., and Proaks, J. G. (2000). Discrete-Time Processing of Speech Signals. IEEE Press Classic Reissue, New York. Drugman, T., Wilfart, G., and Dutoit, T. (2009). A deterministic plus stochastic model of the residual signal for improved parametric speech synthesis. In Proc. of Interspeech, pages Lanchantin, P., Degottex, G., and Rodet, X. (2010). An HMM-based speech synthesis system using a new glottal source and vocal-tract separation method. In Proc. of ICASSP, pages Maia, R., Toda, T., Zen, H., Nankaku, Y., and Tokuda, K. (2007). An excitation model for HMM-based speech synthesis based on residual modeling. In Proc. of the 6th ISCA Workshop on Speesh Synthesis, pages Raitio, T., Suni, A., Pulakka, H., Vainio, M., and Alku, P. (2008). HMM-based Finnish text-to-speech system using glottal inverse filtering. In Interspeech, pages

61 References II Sung, J., Kyung, D., Oh, H., and Kim, N. (2010). Excitation modeling based on waveform interpolation for HMM-based speech synthesis. In Proc. of Interspeech, pages Toda, T. and Tokuda, K. (2008). Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM. In Proc. of ICASSP, pages Wu, Y. J. and Tokuda, K. (2009). Minimum generation error training by using original spectrum as reference for log spectral distortion measure. In Proc. of ICASSP, pages Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., and Kitamura, T. (2001). Mixed-excitation for HMM-based speech synthesis. In Proc. of EUROSPEECH. Zen, H., Toda, T., Nakamura, M., and Tokuda, K. (2007a). Details of the Nitech HMM-based speech synthesis for Blizzard Challenge IEICE Trans. on Inf. and Systems, E90-D(1): Zen, H., Tokuda, K., and Kitamura, T. (2007b). Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequence. Computer Speech and Language, 21(1):

62

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling

An Excitation Model for HMM-Based Speech Synthesis Based on Residual Modeling An Model for HMM-Based Speech Synthesis Based on Residual Modeling Ranniery Maia,, Tomoki Toda,, Heiga Zen, Yoshihiko Nankaku, Keiichi Tokuda, National Inst. of Inform. and Comm. Technology (NiCT), Japan

More information

Modeling the creaky excitation for parametric speech synthesis.

Modeling the creaky excitation for parametric speech synthesis. Modeling the creaky excitation for parametric speech synthesis. 1 Thomas Drugman, 2 John Kane, 2 Christer Gobl September 11th, 2012 Interspeech Portland, Oregon, USA 1 University of Mons, Belgium 2 Trinity

More information

Feature extraction 2

Feature extraction 2 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods

More information

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

More information

Chirp Decomposition of Speech Signals for Glottal Source Estimation

Chirp Decomposition of Speech Signals for Glottal Source Estimation Chirp Decomposition of Speech Signals for Glottal Source Estimation Thomas Drugman 1, Baris Bozkurt 2, Thierry Dutoit 1 1 TCTS Lab, Faculté Polytechnique de Mons, Belgium 2 Department of Electrical & Electronics

More information

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece,

Sinusoidal Modeling. Yannis Stylianou SPCC University of Crete, Computer Science Dept., Greece, Sinusoidal Modeling Yannis Stylianou University of Crete, Computer Science Dept., Greece, yannis@csd.uoc.gr SPCC 2016 1 Speech Production 2 Modulators 3 Sinusoidal Modeling Sinusoidal Models Voiced Speech

More information

Glottal Source Estimation using an Automatic Chirp Decomposition

Glottal Source Estimation using an Automatic Chirp Decomposition Glottal Source Estimation using an Automatic Chirp Decomposition Thomas Drugman 1, Baris Bozkurt 2, Thierry Dutoit 1 1 TCTS Lab, Faculté Polytechnique de Mons, Belgium 2 Department of Electrical & Electronics

More information

Timbral, Scale, Pitch modifications

Timbral, Scale, Pitch modifications Introduction Timbral, Scale, Pitch modifications M2 Mathématiques / Vision / Apprentissage Audio signal analysis, indexing and transformation Page 1 / 40 Page 2 / 40 Modification of playback speed Modifications

More information

Voiced Speech. Unvoiced Speech

Voiced Speech. Unvoiced Speech Digital Speech Processing Lecture 2 Homomorphic Speech Processing General Discrete-Time Model of Speech Production p [ n] = p[ n] h [ n] Voiced Speech L h [ n] = A g[ n] v[ n] r[ n] V V V p [ n ] = u [

More information

Improved Method for Epoch Extraction in High Pass Filtered Speech

Improved Method for Epoch Extraction in High Pass Filtered Speech Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu 642 Email: d

More information

L7: Linear prediction of speech

L7: Linear prediction of speech L7: Linear prediction of speech Introduction Linear prediction Finding the linear prediction coefficients Alternative representations This lecture is based on [Dutoit and Marques, 2009, ch1; Taylor, 2009,

More information

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 Chapter 9 Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析 1 LPC Methods LPC methods are the most widely used in speech coding, speech synthesis, speech recognition, speaker recognition and verification

More information

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis

An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis ICASSP 07 New Orleans, USA An Autoregressive Recurrent Mixture Density Network for Parametric Speech Synthesis Xin WANG, Shinji TAKAKI, Junichi YAMAGISHI National Institute of Informatics, Japan 07-03-07

More information

Speech Signal Representations

Speech Signal Representations Speech Signal Representations Berlin Chen 2003 References: 1. X. Huang et. al., Spoken Language Processing, Chapters 5, 6 2. J. R. Deller et. al., Discrete-Time Processing of Speech Signals, Chapters 4-6

More information

L8: Source estimation

L8: Source estimation L8: Source estimation Glottal and lip radiation models Closed-phase residual analysis Voicing/unvoicing detection Pitch detection Epoch detection This lecture is based on [Taylor, 2009, ch. 11-12] Introduction

More information

Signal representations: Cepstrum

Signal representations: Cepstrum Signal representations: Cepstrum Source-filter separation for sound production For speech, source corresponds to excitation by a pulse train for voiced phonemes and to turbulence (noise) for unvoiced phonemes,

More information

Linear Prediction 1 / 41

Linear Prediction 1 / 41 Linear Prediction 1 / 41 A map of speech signal processing Natural signals Models Artificial signals Inference Speech synthesis Hidden Markov Inference Homomorphic processing Dereverberation, Deconvolution

More information

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis

Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis ICASSP 2018 Calgary, Canada Comparing Recent Waveform Generation and Acoustic Modeling Methods for Neural-network-based Speech Synthesis Xin WANG, Jaime Lorenzo-Trueba, Shinji TAKAKI, Lauri Juvela, Junichi

More information

SPEECH ANALYSIS AND SYNTHESIS

SPEECH ANALYSIS AND SYNTHESIS 16 Chapter 2 SPEECH ANALYSIS AND SYNTHESIS 2.1 INTRODUCTION: Speech signal analysis is used to characterize the spectral information of an input speech signal. Speech signal analysis [52-53] techniques

More information

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models

Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models 8th ISCA Speech Synthesis Workshop August 31 September 2, 2013 Barcelona, Spain Text-to-speech synthesizer based on combination of composite wavelet and hidden Markov models Nobukatsu Hojo 1, Kota Yoshizato

More information

Autoregressive Neural Models for Statistical Parametric Speech Synthesis

Autoregressive Neural Models for Statistical Parametric Speech Synthesis Autoregressive Neural Models for Statistical Parametric Speech Synthesis シンワン Xin WANG 2018-01-11 contact: wangxin@nii.ac.jp we welcome critical comments, suggestions, and discussion 1 https://www.slideshare.net/kotarotanahashi/deep-learning-library-coyotecnn

More information

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender

Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis. Thomas Ewender Nearly Perfect Detection of Continuous F 0 Contour and Frame Classification for TTS Synthesis Thomas Ewender Outline Motivation Detection algorithm of continuous F 0 contour Frame classification algorithm

More information

Lab 9a. Linear Predictive Coding for Speech Processing

Lab 9a. Linear Predictive Coding for Speech Processing EE275Lab October 27, 2007 Lab 9a. Linear Predictive Coding for Speech Processing Pitch Period Impulse Train Generator Voiced/Unvoiced Speech Switch Vocal Tract Parameters Time-Varying Digital Filter H(z)

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT. CEPSTRAL ANALYSIS SYNTHESIS ON THE EL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITH FOR IT. Summarized overview of the IEEE-publicated papers Cepstral analysis synthesis on the mel frequency scale by Satochi

More information

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February

Generative Model-Based Text-to-Speech Synthesis. Heiga Zen (Google London) February Generative Model-Based Text-to-Speech Synthesis Heiga Zen (Google London) February rd, @MIT Outline Generative TTS Generative acoustic models for parametric TTS Hidden Markov models (HMMs) Neural networks

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

representation of speech

representation of speech Digital Speech Processing Lectures 7-8 Time Domain Methods in Speech Processing 1 General Synthesis Model voiced sound amplitude Log Areas, Reflection Coefficients, Formants, Vocal Tract Polynomial, l

More information

Feature extraction 1

Feature extraction 1 Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 1 Dr Philip Jackson Cepstral analysis - Real & complex cepstra - Homomorphic decomposition Filter

More information

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals

Zeros of z-transform(zzt) representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Zeros of z-transformzzt representation and chirp group delay processing for analysis of source and filter characteristics of speech signals Baris Bozkurt 1 Collaboration with LIMSI-CNRS, France 07/03/2017

More information

ELEN E4810: Digital Signal Processing Topic 11: Continuous Signals. 1. Sampling and Reconstruction 2. Quantization

ELEN E4810: Digital Signal Processing Topic 11: Continuous Signals. 1. Sampling and Reconstruction 2. Quantization ELEN E4810: Digital Signal Processing Topic 11: Continuous Signals 1. Sampling and Reconstruction 2. Quantization 1 1. Sampling & Reconstruction DSP must interact with an analog world: A to D D to A x(t)

More information

Application of the Bispectrum to Glottal Pulse Analysis

Application of the Bispectrum to Glottal Pulse Analysis ISCA Archive http://www.isca-speech.org/archive ITRW on Non-Linear Speech Processing (NOLISP 3) Le Croisic, France May 2-23, 23 Application of the Bispectrum to Glottal Pulse Analysis Dr Jacqueline Walker

More information

Frequency Domain Speech Analysis

Frequency Domain Speech Analysis Frequency Domain Speech Analysis Short Time Fourier Analysis Cepstral Analysis Windowed (short time) Fourier Transform Spectrogram of speech signals Filter bank implementation* (Real) cepstrum and complex

More information

CS578- Speech Signal Processing

CS578- Speech Signal Processing CS578- Speech Signal Processing Lecture 7: Speech Coding Yannis Stylianou University of Crete, Computer Science Dept., Multimedia Informatics Lab yannis@csd.uoc.gr Univ. of Crete Outline 1 Introduction

More information

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions

Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions Problem 1 Department of Electrical and Computer Engineering Digital Speech Processing Homework No. 6 Solutions The complex cepstrum, ˆx[n], of a sequence x[n] is the inverse Fourier transform of the complex

More information

TinySR. Peter Schmidt-Nielsen. August 27, 2014

TinySR. Peter Schmidt-Nielsen. August 27, 2014 TinySR Peter Schmidt-Nielsen August 27, 2014 Abstract TinySR is a light weight real-time small vocabulary speech recognizer written entirely in portable C. The library fits in a single file (plus header),

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Topic 6. Timbre Representations

Topic 6. Timbre Representations Topic 6 Timbre Representations We often say that singer s voice is magnetic the violin sounds bright this French horn sounds solid that drum sounds dull What aspect(s) of sound are these words describing?

More information

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007

Linear Prediction Coding. Nimrod Peleg Update: Aug. 2007 Linear Prediction Coding Nimrod Peleg Update: Aug. 2007 1 Linear Prediction and Speech Coding The earliest papers on applying LPC to speech: Atal 1968, 1970, 1971 Markel 1971, 1972 Makhoul 1975 This is

More information

Allpass Modeling of LP Residual for Speaker Recognition

Allpass Modeling of LP Residual for Speaker Recognition Allpass Modeling of LP Residual for Speaker Recognition K. Sri Rama Murty, Vivek Boominathan and Karthika Vijayan Department of Electrical Engineering, Indian Institute of Technology Hyderabad, India email:

More information

COMP Signals and Systems. Dr Chris Bleakley. UCD School of Computer Science and Informatics.

COMP Signals and Systems. Dr Chris Bleakley. UCD School of Computer Science and Informatics. COMP 40420 2. Signals and Systems Dr Chris Bleakley UCD School of Computer Science and Informatics. Scoil na Ríomheolaíochta agus an Faisnéisíochta UCD. Introduction 1. Signals 2. Systems 3. System response

More information

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System

GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System GMM Vector Quantization on the Modeling of DHMM for Arabic Isolated Word Recognition System Snani Cherifa 1, Ramdani Messaoud 1, Zermi Narima 1, Bourouba Houcine 2 1 Laboratoire d Automatique et Signaux

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Kornel Laskowski & Qin Jin Carnegie Mellon University Pittsburgh PA, USA 28 June, 2010 Laskowski & Jin ODYSSEY 2010,

More information

Optimal Design of Real and Complex Minimum Phase Digital FIR Filters

Optimal Design of Real and Complex Minimum Phase Digital FIR Filters Optimal Design of Real and Complex Minimum Phase Digital FIR Filters Niranjan Damera-Venkata and Brian L. Evans Embedded Signal Processing Laboratory Dept. of Electrical and Computer Engineering The University

More information

Fusion of Spectral Feature Sets for Accurate Speaker Identification

Fusion of Spectral Feature Sets for Accurate Speaker Identification Fusion of Spectral Feature Sets for Accurate Speaker Identification Tomi Kinnunen, Ville Hautamäki, and Pasi Fränti Department of Computer Science University of Joensuu, Finland {tkinnu,villeh,franti}@cs.joensuu.fi

More information

Design of a CELP coder and analysis of various quantization techniques

Design of a CELP coder and analysis of various quantization techniques EECS 65 Project Report Design of a CELP coder and analysis of various quantization techniques Prof. David L. Neuhoff By: Awais M. Kamboh Krispian C. Lawrence Aditya M. Thomas Philip I. Tsai Winter 005

More information

Lab 4: Quantization, Oversampling, and Noise Shaping

Lab 4: Quantization, Oversampling, and Noise Shaping Lab 4: Quantization, Oversampling, and Noise Shaping Due Friday 04/21/17 Overview: This assignment should be completed with your assigned lab partner(s). Each group must turn in a report composed using

More information

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES Abstract March, 3 Mads Græsbøll Christensen Audio Analysis Lab, AD:MT Aalborg University This document contains a brief introduction to pitch

More information

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation

Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation Causal-Anticausal Decomposition of Speech using Complex Cepstrum for Glottal Source Estimation Thomas Drugman, Baris Bozkurt, Thierry Dutoit To cite this version: Thomas Drugman, Baris Bozkurt, Thierry

More information

QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING

QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING QUASI CLOSED PHASE ANALYSIS OF SPEECH SIGNALS USING TIME VARYING WEIGHTED LINEAR PREDICTION FOR ACCURATE FORMANT TRACKING Dhananjaya Gowda, Manu Airaksinen, Paavo Alku Dept. of Signal Processing and Acoustics,

More information

GMM-Based Speech Transformation Systems under Data Reduction

GMM-Based Speech Transformation Systems under Data Reduction GMM-Based Speech Transformation Systems under Data Reduction Larbi Mesbahi, Vincent Barreaud, Olivier Boeffard IRISA / University of Rennes 1 - ENSSAT 6 rue de Kerampont, B.P. 80518, F-22305 Lannion Cedex

More information

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004

SPEECH COMMUNICATION 6.541J J-HST710J Spring 2004 6.541J PS3 02/19/04 1 SPEECH COMMUNICATION 6.541J-24.968J-HST710J Spring 2004 Problem Set 3 Assigned: 02/19/04 Due: 02/26/04 Read Chapter 6. Problem 1 In this problem we examine the acoustic and perceptual

More information

Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction

Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction Downloaded from vbnaaudk on: januar 12, 2019 Aalborg Universitet Re-estimation of Linear Predictive Parameters in Sparse Linear Prediction Giacobello, Daniele; Murthi, Manohar N; Christensen, Mads Græsbøll;

More information

Review of Fundamentals of Digital Signal Processing

Review of Fundamentals of Digital Signal Processing Solution Manual for Theory and Applications of Digital Speech Processing by Lawrence Rabiner and Ronald Schafer Click here to Purchase full Solution Manual at http://solutionmanuals.info Link download

More information

The Equivalence of ADPCM and CELP Coding

The Equivalence of ADPCM and CELP Coding The Equivalence of ADPCM and CELP Coding Peter Kabal Department of Electrical & Computer Engineering McGill University Montreal, Canada Version.2 March 20 c 20 Peter Kabal 20/03/ You are free: to Share

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation CENTER FOR COMPUTER RESEARCH IN MUSIC AND ACOUSTICS DEPARTMENT OF MUSIC, STANFORD UNIVERSITY REPORT NO. STAN-M-4 Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

More information

E : Lecture 1 Introduction

E : Lecture 1 Introduction E85.2607: Lecture 1 Introduction 1 Administrivia 2 DSP review 3 Fun with Matlab E85.2607: Lecture 1 Introduction 2010-01-21 1 / 24 Course overview Advanced Digital Signal Theory Design, analysis, and implementation

More information

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic

More information

L6: Short-time Fourier analysis and synthesis

L6: Short-time Fourier analysis and synthesis L6: Short-time Fourier analysis and synthesis Overview Analysis: Fourier-transform view Analysis: filtering view Synthesis: filter bank summation (FBS) method Synthesis: overlap-add (OLA) method STFT magnitude

More information

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University

Speech Coding. Speech Processing. Tom Bäckström. October Aalto University Speech Coding Speech Processing Tom Bäckström Aalto University October 2015 Introduction Speech coding refers to the digital compression of speech signals for telecommunication (and storage) applications.

More information

Introduction to Biomedical Engineering

Introduction to Biomedical Engineering Introduction to Biomedical Engineering Biosignal processing Kung-Bin Sung 6/11/2007 1 Outline Chapter 10: Biosignal processing Characteristics of biosignals Frequency domain representation and analysis

More information

Digital Signal Processing. Midterm 1 Solution

Digital Signal Processing. Midterm 1 Solution EE 123 University of California, Berkeley Anant Sahai February 15, 27 Digital Signal Processing Instructions Midterm 1 Solution Total time allowed for the exam is 8 minutes Some useful formulas: Discrete

More information

EE538 Final Exam Fall :20 pm -5:20 pm PHYS 223 Dec. 17, Cover Sheet

EE538 Final Exam Fall :20 pm -5:20 pm PHYS 223 Dec. 17, Cover Sheet EE538 Final Exam Fall 005 3:0 pm -5:0 pm PHYS 3 Dec. 17, 005 Cover Sheet Test Duration: 10 minutes. Open Book but Closed Notes. Calculators ARE allowed!! This test contains five problems. Each of the five

More information

Model-based unsupervised segmentation of birdcalls from field recordings

Model-based unsupervised segmentation of birdcalls from field recordings Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:

More information

CHAPTER 2 RANDOM PROCESSES IN DISCRETE TIME

CHAPTER 2 RANDOM PROCESSES IN DISCRETE TIME CHAPTER 2 RANDOM PROCESSES IN DISCRETE TIME Shri Mata Vaishno Devi University, (SMVDU), 2013 Page 13 CHAPTER 2 RANDOM PROCESSES IN DISCRETE TIME When characterizing or modeling a random variable, estimates

More information

Estimation of Cepstral Coefficients for Robust Speech Recognition

Estimation of Cepstral Coefficients for Robust Speech Recognition Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment

More information

Source modeling (block processing)

Source modeling (block processing) Digital Speech Processing Lecture 17 Speech Coding Methods Based on Speech Models 1 Waveform Coding versus Block Waveform coding Processing sample-by-sample matching of waveforms coding gquality measured

More information

Thursday, October 29, LPC Analysis

Thursday, October 29, LPC Analysis LPC Analysis Prediction & Regression We hypothesize that there is some systematic relation between the values of two variables, X and Y. If this hypothesis is true, we can (partially) predict the observed

More information

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion

Speaker Identification Based On Discriminative Vector Quantization And Data Fusion University of Central Florida Electronic Theses and Dissertations Doctoral Dissertation (Open Access) Speaker Identification Based On Discriminative Vector Quantization And Data Fusion 2005 Guangyu Zhou

More information

MITIGATING UNCORRELATED PERIODIC DISTURBANCE IN NARROWBAND ACTIVE NOISE CONTROL SYSTEMS

MITIGATING UNCORRELATED PERIODIC DISTURBANCE IN NARROWBAND ACTIVE NOISE CONTROL SYSTEMS 17th European Signal Processing Conference (EUSIPCO 29) Glasgow, Scotland, August 24-28, 29 MITIGATING UNCORRELATED PERIODIC DISTURBANCE IN NARROWBAND ACTIVE NOISE CONTROL SYSTEMS Muhammad Tahir AKHTAR

More information

arxiv: v1 [cs.sd] 25 Oct 2014

arxiv: v1 [cs.sd] 25 Oct 2014 Choice of Mel Filter Bank in Computing MFCC of a Resampled Speech arxiv:1410.6903v1 [cs.sd] 25 Oct 2014 Laxmi Narayana M, Sunil Kumar Kopparapu TCS Innovation Lab - Mumbai, Tata Consultancy Services, Yantra

More information

Causal anticausal decomposition of speech using complex cepstrum for glottal source estimation

Causal anticausal decomposition of speech using complex cepstrum for glottal source estimation Available online at www.sciencedirect.com Speech Communication 53 (2011) 855 866 www.elsevier.com/locate/specom Causal anticausal decomposition of speech using complex cepstrum for glottal source estimation

More information

Voice Activity Detection Using Pitch Feature

Voice Activity Detection Using Pitch Feature Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech

More information

Lecture 5: GMM Acoustic Modeling and Feature Extraction

Lecture 5: GMM Acoustic Modeling and Feature Extraction CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 5: GMM Acoustic Modeling and Feature Extraction Original slides by Dan Jurafsky Outline for Today Acoustic

More information

System Identification and Adaptive Filtering in the Short-Time Fourier Transform Domain

System Identification and Adaptive Filtering in the Short-Time Fourier Transform Domain System Identification and Adaptive Filtering in the Short-Time Fourier Transform Domain Electrical Engineering Department Technion - Israel Institute of Technology Supervised by: Prof. Israel Cohen Outline

More information

Filter structures ELEC-E5410

Filter structures ELEC-E5410 Filter structures ELEC-E5410 Contents FIR filter basics Ideal impulse responses Polyphase decomposition Fractional delay by polyphase structure Nyquist filters Half-band filters Gibbs phenomenon Discrete-time

More information

Various signal sampling and reconstruction methods

Various signal sampling and reconstruction methods Various signal sampling and reconstruction methods Rolands Shavelis, Modris Greitans 14 Dzerbenes str., Riga LV-1006, Latvia Contents Classical uniform sampling and reconstruction Advanced sampling and

More information

Speech Synthesis as A Statistical. Keiichi Tokuda Nagoya Institute of Technology

Speech Synthesis as A Statistical. Keiichi Tokuda Nagoya Institute of Technology Speech Synthesis as A Statistical Machine Learning Problem Keiichi Tokuda Nagoya Institute of Technology ASRU2011, Hawaii December 14, 2011 Introduction Rule-based, formant synthesis (~ 90s) Hand-crafting

More information

A 600bps Vocoder Algorithm Based on MELP. Lan ZHU and Qiang LI*

A 600bps Vocoder Algorithm Based on MELP. Lan ZHU and Qiang LI* 2017 2nd International Conference on Electrical and Electronics: Techniques and Applications (EETA 2017) ISBN: 978-1-60595-416-5 A 600bps Vocoder Algorithm Based on MELP Lan ZHU and Qiang LI* Chongqing

More information

Time-varying quasi-closed-phase weighted linear prediction analysis of speech for accurate formant detection and tracking

Time-varying quasi-closed-phase weighted linear prediction analysis of speech for accurate formant detection and tracking INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Time-varying quasi-closed-phase weighted linear prediction analysis of speech for accurate formant detection and tracking Dhananjaya Gowda and

More information

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer.

Source/Filter Model. Markus Flohberger. Acoustic Tube Models Linear Prediction Formant Synthesizer. Source/Filter Model Acoustic Tube Models Linear Prediction Formant Synthesizer Markus Flohberger maxiko@sbox.tugraz.at Graz, 19.11.2003 2 ACOUSTIC TUBE MODELS 1 Introduction Speech synthesis methods that

More information

New Statistical Model for the Enhancement of Noisy Speech

New Statistical Model for the Enhancement of Noisy Speech New Statistical Model for the Enhancement of Noisy Speech Electrical Engineering Department Technion - Israel Institute of Technology February 22, 27 Outline Problem Formulation and Motivation 1 Problem

More information

Chapter 5 Frequency Domain Analysis of Systems

Chapter 5 Frequency Domain Analysis of Systems Chapter 5 Frequency Domain Analysis of Systems CT, LTI Systems Consider the following CT LTI system: xt () ht () yt () Assumption: the impulse response h(t) is absolutely integrable, i.e., ht ( ) dt< (this

More information

Review of Fundamentals of Digital Signal Processing

Review of Fundamentals of Digital Signal Processing Chapter 2 Review of Fundamentals of Digital Signal Processing 2.1 (a) This system is not linear (the constant term makes it non linear) but is shift-invariant (b) This system is linear but not shift-invariant

More information

On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR

On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR On Variable-Scale Piecewise Stationary Spectral Analysis of Speech Signals for ASR Vivek Tyagi a,c, Hervé Bourlard b,c Christian Wellekens a,c a Institute Eurecom, P.O Box: 193, Sophia-Antipolis, France.

More information

Keywords: Vocal Tract; Lattice model; Reflection coefficients; Linear Prediction; Levinson algorithm.

Keywords: Vocal Tract; Lattice model; Reflection coefficients; Linear Prediction; Levinson algorithm. Volume 3, Issue 6, June 213 ISSN: 2277 128X International Journal of Advanced Research in Comuter Science and Software Engineering Research Paer Available online at: www.ijarcsse.com Lattice Filter Model

More information

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information Sādhanā Vol. 38, Part 4, August 23, pp. 59 62. c Indian Academy of Sciences A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information DEBADATTA

More information

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification

Estimation of Relative Operating Characteristics of Text Independent Speaker Verification International Journal of Engineering Science Invention Volume 1 Issue 1 December. 2012 PP.18-23 Estimation of Relative Operating Characteristics of Text Independent Speaker Verification Palivela Hema 1,

More information

Evaluation of the modified group delay feature for isolated word recognition

Evaluation of the modified group delay feature for isolated word recognition Evaluation of the modified group delay feature for isolated word recognition Author Alsteris, Leigh, Paliwal, Kuldip Published 25 Conference Title The 8th International Symposium on Signal Processing and

More information

Stress detection through emotional speech analysis

Stress detection through emotional speech analysis Stress detection through emotional speech analysis INMA MOHINO inmaculada.mohino@uah.edu.es ROBERTO GIL-PITA roberto.gil@uah.es LORENA ÁLVAREZ PÉREZ loreduna88@hotmail Abstract: Stress is a reaction or

More information

Gaussian Processes for Audio Feature Extraction

Gaussian Processes for Audio Feature Extraction Gaussian Processes for Audio Feature Extraction Dr. Richard E. Turner (ret26@cam.ac.uk) Computational and Biological Learning Lab Department of Engineering University of Cambridge Machine hearing pipeline

More information

Discrete-Time Signals and Systems. Frequency Domain Analysis of LTI Systems. The Frequency Response Function. The Frequency Response Function

Discrete-Time Signals and Systems. Frequency Domain Analysis of LTI Systems. The Frequency Response Function. The Frequency Response Function Discrete-Time Signals and s Frequency Domain Analysis of LTI s Dr. Deepa Kundur University of Toronto Reference: Sections 5., 5.2-5.5 of John G. Proakis and Dimitris G. Manolakis, Digital Signal Processing:

More information

A FAST AND ACCURATE ADAPTIVE NOTCH FILTER USING A MONOTONICALLY INCREASING GRADIENT. Yosuke SUGIURA

A FAST AND ACCURATE ADAPTIVE NOTCH FILTER USING A MONOTONICALLY INCREASING GRADIENT. Yosuke SUGIURA A FAST AND ACCURATE ADAPTIVE NOTCH FILTER USING A MONOTONICALLY INCREASING GRADIENT Yosuke SUGIURA Tokyo University of Science Faculty of Science and Technology ABSTRACT In this paper, we propose a new

More information

Glottal Modeling and Closed-Phase Analysis for Speaker Recognition

Glottal Modeling and Closed-Phase Analysis for Speaker Recognition Glottal Modeling and Closed-Phase Analysis for Speaker Recognition Raymond E. Slyh, Eric G. Hansen and Timothy R. Anderson Air Force Research Laboratory, Human Effectiveness Directorate, Wright-Patterson

More information

MANY digital speech communication applications, e.g.,

MANY digital speech communication applications, e.g., 406 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 2, FEBRUARY 2007 An MMSE Estimator for Speech Enhancement Under a Combined Stochastic Deterministic Speech Model Richard C.

More information

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters

Signals, Instruments, and Systems W5. Introduction to Signal Processing Sampling, Reconstruction, and Filters Signals, Instruments, and Systems W5 Introduction to Signal Processing Sampling, Reconstruction, and Filters Acknowledgments Recapitulation of Key Concepts from the Last Lecture Dirac delta function (

More information