Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda

Size: px

Start display at page:

Download "Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation. Keiichi Tokuda"

Noah Norris
6 years ago
Views:

1 Mel-Generalized Cepstral Representation of Speech A Unified Approach to Speech Spectral Estimation Keiichi Tokuda Nagoya Institute of Technology Carnegie Mellon University Tamkang University March 13,

2 Conventional Speech Spectral Estimation Linear prediction (LPC) Cepstral analysis Subband filter bank Autoregressive (AR) model Exponential (EX) model Nonparametric Variations Model Pole-zero (ARMA) model Analysis window Adaptive analysis (sample by sample basis) Auditory characteristics Warped LPC, PLP, etc. Auditory frequency scales (mel, Bark) Loudness scales (log, sone) 2

3 Structure of This Talk 1. Conventional cepstral analysis 2. Introduction of generalized logarithmic function Generalized cepstral analysis 3. Introduction of auditory frequency scale Mel-generalized cepstral analysis 4. Applications to speech recognition and coding 3

4 History of Cepstral Analysis B.P. Bogert, M.J.R. Healy, J.W. Tukey (1963) Analysis of seismic signals decomposition into direct wave and echo Cepstrum, Quefrency, Lifter A.M. Noll (1964, 1967) Pitch extraction based on cepstrum A.V. Oppenheim (1966, 1968) Homomorphic deconvolution decomposition into source and vocal tract function Complex cepstrum 4

5 Definition of Cepstrum Fourier transform of signal s(n) Cepstrum S(e jω )=F [ s(n) ] C(m) =F 1 [ log S(e jω ) 2 ] (Bogert et al., Noll) C(m) =F 1 [ log S(e jω ) ] (Oppenheim) 5

6 Complex Cepstrum z-transform of signal s(n) Complex cepstrum S(z) =Z [ s(n) ] c(m) = Z 1 [ log S(z) ] = F 1 [ log S(e jω ) ] = F 1 [ log S(e jω ) + j arg S(e jω ) ] 6

7 Cepstrum and Complex Cepstrum log S(e jω ) = F [ C(m) ] =Re[ F [ c(m) ]] When it is minimum phase (all poles and zeros are located in the unit circle) c(m) = 0, m < 0 C(m), m =0 2C(m), m > 0 7

8 Homomorphic Deconvolution Pulse train White noise e(n) Linear time-invariant system h(n) s(n) =e(n) h(n) Speech s(n) =h(n) e(n) F S(e jω )=H(e jω )E(e jω ) log log S(e jω ) =log H(e jω ) +log E(e jω ) F 1 C(m) =C h (m)+c e (m) 8

9 Amplitude (db) C(m) s(n) Time (100μs) (a) Quefrency (100μs) (b) Frequency (khz) (d) Amplitude (db) Amplitude (db) Frequency (khz) (c) Frequency (khz) (e) 9

10 Spectrum of Periodic Signal e(n) w(n) log E w (e jω ) W(e jω ) 2π/N p N p 0 π Frequency e(n) h(n) s(n) log S w (e jω ) H(e jω ) 0 π Frequency 10

11 Problems of Homomorphic Processing (Cepstral Analysis) Linear smoothing of log spectrum affected by fine structure of FFT spectrum results in a large bias and variance Voiced speech (periodic) Envelope of peaks of spectral fine structure Improved cepstral analysis, PSE: Biased 11

12 Cost Function P (ω): I N (ω): Estimate of Power Spectrum Periodogram E = 1 2π π π { IN (ω) P (ω) log I } N(ω) P (ω) 1 dω min x: Gaussian Process Maximizing p(x c) Unbiased estimation of log spectrum equivalent to one used in LPC Minimization of energy of inverse filter output 12

13 Analysis of Natural Speech Log magnitude (db) Log magnitude (db) Frequency (khz) (a) Unbiased cepstral analysis Frequency (khz) (b) Linear prediction 13

14 Generalized Cepstrum Complex Cepstrum c(m) = Z 1 [ log S(z) ] log S(z) = Z [ c(m) ] Generalized Cepstrum c γ (m) = Z 1 [ s γ (S(z))] s γ (S(z)) = Z [ c γ (m) ] 14

15 Generalized logarithmic function s γ (w) = (w γ 1)/γ, 0 < γ 1 log w, γ =0 s ( x ) γ γ = < γ < 1 γ = 0 (log x ) 1 < γ < 0 γ = 1 x 15

16 Spectral Model Generalized Cepstrum: c γ (m) H(z) = s 1 γ = M m=0 1+γ exp c γ (m) z m M m=0 M m=0 c γ (m) z m 1/γ, 0 < γ 1 c γ (m) z m, γ =0 Inverse function of Generalized logarithm s 1 γ (w) = (1 + γw) 1/γ, 0 < γ 1 exp w, γ =0 16

17 Cost Function E = 1 2π π π { IN (ω) P (ω) log I } N(ω) P (ω) 1 dω min Estimate of Power Spectrum P (ω) = H(e jω ) 2 = σ 2 D(e jω ) 2 Interpretation in time-domain ε = E [ e 2 (n) ] min x(n) 1/D(z) e(n) 17

18 Advantage 1 γ 0: Convex function Global solutioncan easily be obtained The obtained system H(z) is minimum phase, e.g., stable γ = 1 Linear Prediction H(z) = 1 γ =0 Cepstrum H(z) = exp M m=0 M m=0 1 c γ (m)z m c γ (m)z m 18

19 Prediction Gain D(z) is minimum phase Gain of D(z) is one Predictor: Q(z) = k=1 a(k)z k Cost Function: ε = E [ e 2 (n) ] Prediction Gain: G = E [ x 2 (n) ] E [ e 2 (n) ] x(n) x(n) 1/D(z) Q(z) + e(n) e(n) 19

20 Analysis of synthetic signal (Generalized Cepstral Analysis) Log magnitude 10dB True γ =-1-3/4-1/2-1/ Frequency(Hz) Prediction gain(db) 8 7 (a) Example 1 M= (LPC) (UELS) γ 20

21 Analysis of synthetic signal (Generalized Cepstral Analysis) Log magnitude 10dB True γ =-1-3/4-1/2-1/ Frequency(Hz) Prediction gain(db) 5 4 (c) Example 3 M= (LPC) (UELS) γ 21

22 Analysis of synthetic signal (Generalized Cepstral Analysis) Log magnitude 10dB True γ =-1-3/4-1/2-1/ Frequency(Hz) Prediction gain(db) 7 6 (b) Example 2 M= (LPC) (UELS) γ 22

23 Analysis of natural speech (Generalized Cepstral Analysis) /e/ 80 Magnitude(dB) γ = -1 γ = -1/2 γ = -1/3 γ = Frequency(kHz) Frequency(kHz) Frequency(kHz) Frequency(kHz) Prediction gain(db) M= (LPC) (UELS) γ (a) male /e/ 23

24 Analysis of natural speech (Generalized Cepstral Analysis) /N/ 80 Magnitude(dB) γ = -1 γ = -1/2 γ = -1/3 γ = Frequency(kHz) Frequency(kHz) Frequency(kHz) Frequency(kHz) Prediction gain(db) M= (LPC) (UELS) γ (b) male /N/ 24

25 Structure of synthesis filter H(z) (γ = 1/n) 1st 3nd... n-th input 1 1 σ 1 C(z) C(z) C(z) output H(z) =σd(z) =σ { 1 C(z) } n C(z) = 1+γ M c γ (m) z m m=0 25

26 Structure of synthesis filter H(z) (γ =0) LMA filter Input F (z) F (z) F (z) F (z) A L,1 A L,2 A L,3 A L,4 Output + D(z) = exp F (z) R L (F (z)) = L l=1 L l=1 A L,l {F (z)} l A L,l { F (z)} l F (z) = M m=1 c γ (m) z m 26

27 Introduction of auditory frequency scale First-order all-pass function: z 1 α =Ψ(z) = z 1 α 1 αz 1 Phase Characteristics can be used for Frequency Transformation: ω = tan 1 (1 α 2 )sinω (1 + α 2 ) cos ω 2α where Ψ(e jω )=e j ω Warped frequency ω (rad) π π /2 mel scale 10kHz sampling α = π/2 π Frequency ω (rad) 27

28 Mel-Generalized Cepstral Analysis Mel-generalized cepstrum: c α,γ (m) H(z) = s 1 γ = M m=0 1+γ exp c α,γ (m) z m M m=0 M m=0 α c α,γ (m) z m α 1/γ, 0 < γ 1 c α,γ (m) z m α, γ =0 z 1 α = z 1 α 1 αz 1 28

29 (α, γ) =(0, 0) Cepstral model: H(z) = exp (α, γ) =(0, 1) AR model: M m=0 c α,γ (m)z m H(z) = 1 M m=0 1 c α,γ (m)z m (α, γ) =(0.35, 0) Mel-cepstral model: H(z) = exp M m=0 c α,γ (m) z m α (α, γ) =(0.47, 1) Warped AR model: H(z) = 1 M m=0 1 c α,γ (m) z m α 29

30 A Unified Approach to Speech Spectral Estimation α < 1, 1 γ 0 α =0 Generalized cepstral analysis Mel-generalized cepstral analysis γ = 1 Linear prediction Warped Linear Prediction γ =0 Unbiased Cepstral analysis Mel-cepstral analysis 30

31 Mel-generalized analysis of natural speech /N/ M =12 Magnitude(dB) Magnitude(dB) Magnitude(dB) α = 0 α = 0.35 α = Frequency(kHz) Frequency(kHz) Frequency(kHz) γ = 0 γ = 0.5 γ = 1 31

32 Example α =0 γ = 1 γ = 1/3 γ =0 Frequency(kHz) Frequency(kHz) Frequency(kHz) n a N b u d e w a (ms) (a) Waveform 40(dB) 40(dB) (α, γ, M) = (0, -1, 12) LP (α, γ, M) = (0, -1/3, 12) GCEP 40(dB) (α, γ, M) = (0, 0, 12) UELS (b) Spectral estimates ( α = 0) 32

33 Example α =0.35 γ = 1 γ = 1/3 γ =0 Frequency(kHz) Frequency(kHz) Frequency(kHz) n a N b u d e w a (ms) (a) Waveform 40(dB) 40(dB) (α, γ, M) = (0.35, -1, 12) WLP (α, γ, M) = (0.35, -1/3, 12) MGCEP 40(dB) (α, γ, M) = (0.35, 0, 12) MCEP (b) Spectral estimates ( α = 0.35) 33

34 Structure of synthesis filter H(z) (γ = 1/n) 1st 2nd... n-th input 1 1 σ 1 C(z) C(z) C(z) output Structure of H(z) H(z) =σd(z) =σ C(z) = ( 1+γ M m=0 { 1 C(z) } n c α,γ(m) z m α Output ) Input + α z 1 + z 1 + z 1 + z 1 γ(1 α 2 ) + b(1) + α + b(2) Structure of C(z) (M =3) + α b(3) 34

35 Structure of synthesis filter H(z) (γ =0) MLSA filter D(z) = exp F (z) R L (F (z)) F (z) = M m=0 c α,γ(m) z m α sufficient accuracy: maximum spectral error 0.24dB O(8M) multiply-add operations a sample guaranteed stability M multiply-add operations for filter coefficients calculation 35

36 Structure of MLSA filter Input F (z) F (z) F (z) F (z) A L,1 A L,2 A L,3 A L,4 Output + R L (F (z)) exp F (z) =D(z), L =4 Input 1 α 2 α - + α α z 1 + z 1 + z 1 + z b(1) b(2) b(3) + Basic filter F (z), M =3 + Output 36

37 The choice of α, γ for speech analysis/synthesis Analysis/synthesis system with fixed α and γ speech quality change with γ γ 1 Clear γ 0 Smooth When γ = 0, speech quality with (α, M) =(0.35, 15) is almost equivalent to that with (α,m) =(0, 30). When the analysis order is high enough, the difference becomes small. 37

38 Feature of Unified Approach Linear prediction analysis, Cepstral analysis are the special cases. Mathematically well-defined Physical interpretation Minimization of energy of inverse filter output x is Gaussian Minimization of p(x c) (ML estimation) Global solution, stability of the system function Synthesis filter for direct synthesis from the estimated coefficients LMA/MLSA/GMSLA filter Extension to adaptive analysis (sample by sample basis) Parameter transformation for speech recognition 38

39 Word Recognition based on HMM Spectral Analysis: 1. (α 1,γ 1,M 1 )=(0, 1, 12) Linear Prediction 2. (α 1,γ 1,M 1 )=(0.35, 1/3, 12) Mel-generalized cepstral analysis 3. (α 1,γ 1,M 1 )=(0.35, 0, 12) Mel-cepstral analysis Output vector of HMM: (α 2,γ 2,M 2 )=(0.35, 0, 12) Mel-cepstral coefficients and Δ (dynamic coefficients) H(z) =s 1 γ 1 M 1 m=0 c α1,γ 1 (m) z m α 1 = s 1 γ 2 m=0 c α2,γ 2 (m) z m α 2 39

40 Recognition Accuracy (Continuous HMM, 33 phoneme models, 2618 words) Linear Prediction (α 1, γ 1, M 1)= (0, -1, 12) Mel-Generalized Cepstral Analysis (α 1, γ 1, M 1)= (0.35, -1/3, 12) Mel-Cepstral Analysis (α 1, γ 1, M 1)= (0.35, 0, 12) 77.5% 80.7% 79.6% Recognition accuracy (%) 40

41 Application to 16kb/s wideband CELP coder Input Quantization of MGC Coef. MGC Analysis encoder CELP Excitation Generator Minimize MSE Synthesis Filter MGC coef. MGC coef. Perceptual Weighting Filter MGC coef. MGC coef. decoder CELP Excitation Generator Synthesis Filter Postfilter Output 41

42 Speech quality as a function of α (γ = 1/2) D M O S 3.5 average female male α 42

43 Subjective Evaluation MNRU(dB) G kb/s G kb/s G kb/s Conv.CELP 16kb/s MGC-CELP 16kb/s D M O S average female male 43

44 Summary A unified approach to speech spectral estimation A unified approach to Linear predicton and Cepstral analysis Introduction of auditory frequency scale Efficint representation of speech spectrum with an appropriate choice of α and γ Application to speech anaylysis/synthesis, speech coding, speech recognition Future work: Optimal α and γ (Phoneme/Speaker dependent?) Speech Signal Processing Toolkit: 44

Feature extraction 2

Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Feature extraction 2 Dr Philip Jackson Linear prediction Perceptual linear prediction Comparison of feature methods