Improved Method for Epoch Extraction in High Pass Filtered Speech

Size: px

Start display at page:

Download "Improved Method for Epoch Extraction in High Pass Filtered Speech"

Lucinda Warner
6 years ago
Views:

1 Improved Method for Epoch Extraction in High Pass Filtered Speech D. Govind Center for Computational Engineering & Networking Amrita Vishwa Vidyapeetham (University) Coimbatore, Tamilnadu d govind@cb.amrita.edu S. R. Mahadeva Prasanna, Ramesh K Department of Electronics & Electrical Engineering Indian Institute of Technology Guwahati, Assam {prasanna,kk.ramesh}@iitg.ernet.in Abstract The objective of present work is to improve the epoch estimation performance in high pass filtered (HPF) speech using conventional zero frequency filtering (ZFF) approach. The strength of impulse at zero frequency is significantly attenuated in case of HPF speech and hence shows significant degradation in epoch estimation performance by ZFF approach. Since linear prediction (LP) residual of speech is characterized by sharper impulse discontinuities at epochs location compared to speech waveform, the present work uses LP residual of HPF speech for epoch estimation using ZFF method. The Gabor filtering on LP residual is carried out for further increasing strength of impulses at epochs location of LP residual. The epochs location are estimated by ZFF of Gabor filtered LP residual. The performance of proposed method is better compared to that of existing Hilbert envelope based ZFF approach with improved epoch identification accuracy. I. INTRODUCTION The epochs in speech are the time instants at which excitation of vocal tract is maximum [], [2], [3], [4]. The epochs represent instants of glottal closure in case of voiced speech and onset of burst or frication in unvoiced speech. Due to the effect of vocal tract characteristics, estimation of epochs from speech becomes a challenging task [2]. Hence many methods are proposed in the literature for reliable estimation of epochs from speech [5], [2], [3], [4]. Due to its significance, in many applications, the processing is carried out anchored around epochs location [6], [7], [8]. Among the existing approaches, the group delay (GD) based processing, DYPSA and zero frequency filtering (ZFF) methods are the popular approaches used for extracting epochs. Among these methods, ZFF method is a well known approach for reliable epoch estimation with reduced computational complexity [8], [2]. The impulse like characteristics of epochs are exploited in the ZFF method [2]. In ZFF method, the speech is initially passed through cascade of two zero frequency resonators. The trend in zero frequency resonator output is then removed by local mean subtraction to obtain zero frequency filtered signal. The negative to positive zero crossings of zero frequency filtered signal are hypothesized as epochs location. The ZFF method provides the best epochs estimation performance for clean speech signals which has sufficient energy near the zero frequency. However, due to significant attenuation of low frequency components near the zero frequency, the performance degrades for band limited signals such as high pass filtered (HPF) and telephone recorded speech signals [9]. There were attempts to improve epochs estimation performances in HPF speech by the ZFF method [9]. In this work, low frequency nature of Hilbert envelope is used to emphasize the energy near the zero frequency of speech signal. An improved epochs estimation performance for HPF speech is obtained by passing Hilbert envelope of the HPF speech or its residual through zero frequency resonator. Although, the epoch estimation performance is improved in terms of higher epoch identification rate, lower miss rate and false alarm rates, the epoch identification error measured as the deviation of the estimated epochs from the reference epochs remain higher. Reduced epoch identification accuracy of this approach make the method less suitable for applications such as epoch based prosody modification, where the perceptual quality of the prosody modified speech mostly depends on the accuracy with which epochs are estimated [8]. The smoothing tendency of the Hilbert envelope at the epochs location increases the deviation of estimated epochs from actual reference epochs location obtained from the differenced electro-glottogram (EGG). Hence epoch estimation using Hilbert envelope of HPF of speech results in a poor temporal resolution. Figure compares the epochs estimated from ZFF of Hilbert envelope of HPF speech and reference epochs estimated from the corresponding differenced EGG. It can be observed from the Figure that, though the discontinuities at the epochs location are enhanced by Hilbert envelope, the low frequency nature of Hilbert envelope smoothes the peaks. Hence the estimation of epochs location result in a poor temporal resolution. The poor temporal resolution of estimated epochs can be confirmed by comparing reference epochs location represented differenced EGG peaks given in Figure. The samples in LP residual are uncorrelated and higher prediction errors form strong impulse like discontinuities at the epochs location. Hence the present work focusses on exploiting the impulse like discontinuities in LP residual for epoch estimation from HPF speech. Even though, ZFF of LP residual obtained from HPF speech gives better performance compared to performance of conventional ZFF of HPF speech, the performance is not at par with ZFF of Hilbert envelope of HPF speech. However, the epoch identification accuracy is found to be better than the ZFF of Hilbert envelope of HPF speech. To further enhance the impulse like discontinuities at the epochs location, the LP residual of HPF speech is filtered using a Gabor filter having a shape equivalent to the discontinuity at glottal pulse. The epochs are obtained by ZFF of Gabor filtered residual sequence. The performance of pro-

2 Time (Samples) Fig.. Deviation of estimated epochs in the Hilbert envelope of HPF speech from true locations. A voiced segment of HPF speech, its Hilbert envelope Estimated epochs by ZFF of Hilbert envelope of HPF speech Difference EGG peaks showing the reference epochs location posed method is confirmed by improved epoch identification accuracy compared to that of ZFF of Hilbert envelope of HPF speech. The rest of the paper is organized as follows: Section II describes the algorithmic steps in ZFF method. The comparison of epoch estimation performances of ZFF of HPF speech and LP residual of HPF speech is given in Section III. The description of Gabor filter and proposed ZFF method using Gabor filtering of LP residual is given in Section IV. Finally Section V summarizes the present work with scope for future work. II. ZERO FREQUENCY FILTERING OF SPEECH This section reviews the ZFF method for epoch estimation and performance measures used for evaluating epoch extraction methods. A. Epoch Estimation Using ZFF method The algorithm for estimating the epochs in clean speech by ZFF is given as follows [2]: Difference input speech signal s(n) x(n) = s(n) s(n ) () Compute the output of cascade of two ideal digital resonators at Hz 4 y(n) = a k y(n k) + x(n) (2) k= where a = 4, a 2 = -6, a 3 = 4, a 4 = - Remove the trend i.e., ŷ(n) = y(n) ȳ(n) (3) Fig. 2. Epoch estimation performance measure for epoch identification, missing, false alram and identification accuracy where ȳ(n) = N 2N+ n= N y(n) and 2N + corresponds to average pitch period computed over a longer segment of speech The trend removed signal ŷ(n) is termed as zero frequency filtered signal. The positive zero crossings of filtered signal will give location of epochs. B. Performance Measures for Epoch Estimation The performance measures proposed in [4] such as epoch identification rate, miss rate, false alarm rate and identification accuracy are the measures that are used for epoch estimation performance analysis. The description of these measures are as follows: Larynx cycle: The range of sample (/2) (l r +l r )< n <(/2)(l r+ + l r ) where l r, l r and l r+ are the current, preceding and succeeding reference epoch locations, respectively Identification Rate (IDR): The percentage of larynx cycles for which exactly one epoch is detected. Miss Rate (MR): The percentage of larynx cycles for which no epoch is detected. False Alarm Rate (FAR): The percentage of larynx cycles for which more than one epoch is detected. Identification Error (ζ): The timing error between reference and detected epochs in larynx cycles for which exactly one epoch was detected. Identification Accuracy (σ) (IDA): The standard deviation of identification error ζ. Small values of σ indicate high accuracy of identification. Figure 2 gives the graphical illustration for the epochs identification, missing, false alarm and epoch identification accuracy. III. EPOCHS ESTIMATION PERFORMANCE FOR CLEAN AND HIGH PASS FILTERED SPEECH The performance is evaluated across CMU arctic database having simultaneous EGG recordings []. 32 phonetically

3 TABLE I. COMPARISON ZFF EPOCH ESTIMATION PERFORMANCE FOR CLEAN SPEECH, HPF SPEECH AND LP RESIDUAL OF HPF SPEECH IN THE CMU ARCTIC DATABASE. Speaker IDR MR FAR IDA (ms) Speech HPF Speech LP residual- HPF speech Gabor filter, σ=.3, ω=.75, N=8 x Time (s) 5 Fig. 4. Gabor filtering of LP residual. A voiced segment of HPF speech, corresponding LP residual, two time convolved residual sequence with the Gabor filter and the Gabor filtered residual sequence Time index (n) Fig. 3. Gabor filter with parameters σ=.3, ω =.75 and N=8 filter is given by, g(n) = e ( (n N ) 2 2 2σ 2 2πσ +jωn) (4) balanced utterances of three speakers (2 males and male) are used for evaluation. The reference epochs are obtained by ZFF of difference EGG. All utterances of CMU Arctic database are converted from the original recorded sampling rate of 32 khz to 8 khz. The HPF speech signals are generated by filtering Arctic speech utterances using a high pass filter with a cutoff frequency of 5 Hz [9]. The cutoff frequency of 5 Hz for high pass filter is selected in order to attenuate all the frequency components that are in human pitch range. Table I shows comparison of epochs estimation performance from clean and HPF speech using ZFF method [9]. Table I shows the effectiveness of ZFF method in extracting accurate epochs locations for clean speech. However, a significant degradation in the performance is observed in the estimated epochs in HPF speech using ZFF method. As the LP residual shows sharp discontinuities at epochs location, the epoch estimated by the ZFF of LP residual of HPF speech gives better performance compared to ZFF of HPF speech. The LP residual of HPF speech is computed by th order LP analysis with a frame size of 2 ms and shift of ms. However, the performance is not at par with that of clean speech case. The degradation of the epoch estimation performance in HPF speech using DYPSA is reported in [9]. The epoch estimation performance from HPF speech can be further improved by enhancing impulse like discontinuities at epochs location of LP residual. Section IV describes the proposed method of improving epochs estimation performance of ZFF of LP residual obtained from HPF speech using Gabor filter. IV. EPOCH ESTIMATION FROM LP RESIDUAL USING GABOR FILTER The impulse like discontinuities at epochs locations in LP residual is sharpened by convolving LP residual with a Gabor filter or a modulated gaussian pulse. The expression for Gabor where σ represents spread of gaussian, ω is frequency of the modulating sinusoid, n is time index and N is length of filter [], [2]. In the present work, value of σ, ω and filter length (N) are selected as.3,.75 and 8, respectively. From the Figure 3, it can be observed that the shape of gabor filter is similar to discontinuities at reference epochs location of the difference EGG. To further sharpen the discontinuities, the residual of HPF speech is filtered two times with Gabor filter. The filtered residual is then subtracted from the residual of HPF speech. This mathematically represented by the following equation, y(n) = r(n) r(n) (5) where r(n) is obtained by convolving residual of HPF speech, r(n), two times with Gabor coefficients, g(n) given in Eq. 4. Hereafter, the sequence y(n) is termed as Gabor filtered residual sequence. Figure 4 plots a voiced frame of HPF speech, LP residual, residual sequence obtained by convolution with Gabor filter coefficients. Comparison of Figure 4 and shows sharper impulse like discontinuities for Gabor filtered residual sequence than residual of HPF speech. Also, it has to be noted from the plot that impulse like discontinuities of other regions are suppressed in Gabor filtered residual compared to LP residual of HPF speech. The epochs in HPF speech are estimated by ZFF of Gabor filtered residual signal. Table II presents the epochs estimation performance obtained for each speaker of CMU-Arctic database. A significant improvement in epochs estimation performance over ZFF of HPF residual given in Table I. A. Comparison with Epoch Estimation by ZFF of Hilbert Envelope of HPF Speech Table III shows the epochs estimation performance of Hilbert envelope of HPF speech and Hilbert envelope of

4 (e) (g).5 (h).5 (i) (f) Time (Samples) Fig. 5. Comparison of Epochs estimation by ZFF method using HPF speech, Gabor filtered residual and Hilbert envelope of HPF speech. A voiced segment of HPF speech, Gabor filtered residual and Hilbert envelope of HPF speech. The corresponding segments of zero frequency filtered signal and Estimated epochs locations from HPF speech (&(g)), Gabor filtered residual ((e)&(h))and Hilbert envelope of HPF speech ((f)&(i)). TABLE II. PERFORMANCE EVALUATION OF EPOCH ESTIMATION BY THE ZFF OF GABOR FILTERED RESIDUAL FROM HPF SPEECH. CMU-Arctic Spkr IDR MR FAR IDA (ms) Tot. Ref. Epochs SLT BDL JMK Tot. Avg TABLE III. EPOCHS ESTIMATION PERFORMANCES OF ZFF USING HILBERT ENVELOPE OF SPEECH AND LP RESIDUAL OF HPF SPEECH. THE PERFORMANCE IS THE AVERAGE PERFORMANCE OF ALL SPEAKERS IN CMU-ARCTIC DATABASE BY CONSIDERING A TOTAL OF REFERENCE EPOCHS. Signal HE-HPF Speech HE-LP residual-hpf Speech IDR MR FAR IDA (ms) HE:- Hilbert envelope LP residual of HPF speech. Even though Hilbert envelope of HPF speech gives significantly better epochs estimation performance in terms of higher epoch identification rate and reduced miss rate and false alarm rate, provides relatively poor epoch identification accuracy. However, ZFF of Gabor filtered LP residual gives a better identification accuracy compared to that of Hilbert envelope of HPF speech or Hilbert envelope of residual of HPF speech cases. Figure 5 compares the zero frequency filtered signal and epochs estimated by the ZFF of HPF speech, Gabor filtered residual and Hilbert envelope of HPF speech respectively. The spurious zero crossings in the zero frequency filtered signal, as shown in Figure 5, result in the false estimation of epochs in the conventional ZFF of HPF speech which is given in Figure 5(g). The zero frequency filtered signal segment obtained by the ZFF of Gabor filtered residual, shown in Figure 5(e), is free from spurious zero crossings. Figure 5 shows the Hilbert envelope of HPF speech. It can be observed Error Probability Density HE HPF Proposed Absolute Deviation (ms) Fig. 6. Comparison of distributions of estimated epochs deviation(σ) values obtained by ZFF of Hilbert envelope-hpf Speech (Blue colored plot) and Proposed Gabor filtered residual of HPF speech (Red color plot). that the low pass nature of the Hilbert envelope smoothes the impulse like discontinuity around the epochs location and hence a smooth ZFFS without spurious zero crossings is obtained in Figure 5(f). However, the deviation of the estimated epochs from the true locations can be observed by comparing the estimated epochs given in Figure 5(h). Figure 6 probability distribution of σ values for Hilbert envelope of HPF speech case and Gabor filtered residual case. Histogram of standard deviation (σ) values of estimated epochs location with respect to reference epochs location obtained for each utterance in the CMU-Arctic database are used to compute the probability density function. The plot indicates the probable deviation occurred for the utterances in the whole database. The epochs estimated by the ZFF of Hilbert envelope of HPF speech has an average identification accuracy of.59 ms which is higher than that of the proposed Gabor filtered residual case which has an average deviation of.34 ms. The increased spread of the Hilbert envelope of HPF speech case in Figure 6 indicates higher deviation of estimated epochs from reference epochs location. V. SUMMARY AND SCOPE FOR FUTURE WORK A significant degradation in the epoch estimation performance by ZFF method is observed in case of HPF speech.

5 The use of Hilbert envelope of HPF speech improves the epoch identification rate at the cost of reduced identification accuracy. To improve the epoch identification accuracy, the strength of impulse like discontinuities at epochs location of LP residual of HPF speech are enhanced using a Gabor filter. The identification accuracy of estimated epochs using ZFF of Gabor filtered residual found to show improvement over ZFF of Hilbert envelope of HPF speech. As the HPF speech signal is a special case for bandlimited telephonic speech signal (between 35 Hz-3.4kHz), performance of proposed epochs estimation method has to be evaluated for a large telephonic speech database. VI. ACKNOWLEDGEMENTS The work presented in this paper is a part of DST Fast track project titled, Analysis, processing and synthesis of emotions in speech. Hence we are thaknful to the funding agency, Science and Engineering Research Board (SERB), New Delhi, for supporting this project. REFERENCES [] T. Drugman and T. Dutoit, Glottal closure and opening instant from speech signals, in Proc. INTERSPEECH, 29. [2] K. S. R. Murty and B. Yegnanarayana, Epoch extraction from speech signals, IEEE Trans. Audio, Speech and Language Process., vol. 6, no. 8, pp , Nov. 28. [3] R. Smits and B. Yegnanarayana, Determination of instants of significant excitation in speech using group delay function, IEEE Trans. Acoustics, Speech and Signal Processing, vol. 4, pp , Sep.995. [4] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, Estimation of glottal closure instants in voiced speech using DYPSA algorithm, IEEE Trans. Audio, Speech and Lang. Process., vol. 5, no., pp , 27. [5] T. V. Ananthapadmanabha and B. Yegnanarayana, Epoch extraction from linear prediction residual for identification of closed glottis interval, IEEE Trans. Acoust., Speech and Signal Process., vol. ASSP-27, no. 4, pp , 979. [6] K. S. Rao and B. Yegnanarayana, Prosody modification using instants of significant excitation, IEEE Trans. Audio, Speech and Language Processing, vol. 4, pp , May 26. [7] E. A. P. Habets, N. D. Gaubitch, and P. A. Naylor, Temporal selective dereverberation of noisy speech using one microphone, in Proc. ICASSP, Jan. 28, pp [8] S. R. M. Prasanna, D. Govind, K. S. Rao, and B. Yenanarayana, Fast prosody modification using instants of significant excitation, in Proc Speech Prosody, May 2. [9] D. Govind, S. R. M. Prasanna, and D. Pati, Epoch extraction in high pass filtered speech using hilbert envelope, in Proc. INTERSPEECH, 2. [] J. Kominek and A. Black, CMU-Arctic speech databases, in in 5th ISCA Speech Synthesis Workshop, Pittsburgh, PA, 24, pp [] D. Gabor, Theory of communications, J. Inst. Elect. Eng., vol. 93, no. 2, p , 946. [2] K. S. Rao, S. R. M. Prasanna, and B. Yegnanarayana, Determination of instants of significant excitation in speech using hilbert envelope and group delay function, IEEE Signal Processing Letters, vol. 4, pp , Oct. 27.

A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information

Sādhanā Vol. 38, Part 4, August 23, pp. 59 62. c Indian Academy of Sciences A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information DEBADATTA