Wavelet Transform in Speech Segmentation

Size: px

Start display at page:

Download "Wavelet Transform in Speech Segmentation"

Bathsheba Sara Hood
5 years ago
Views:

1 Wavelet Transform in Speech Segmentation M. Ziółko, 1 J. Gałka 1 and T. Drwięga 2 1 Department of Electronics, AGH University of Science and Technology, Kraków, Poland, ziolko@agh.edu.pl, jgalka@agh.edu.pl 2 Faculty of Applied Mathematics, AGH University of Science and Technology, Kraków, Poland, drwiega@wms.mat.agh.edu.pl Summary. A non-uniform speech segmentation method based on discrete wavelet transform is used for the localization of phoneme boundaries. A vector of real values representing the digital speech signal is decomposed into phone-like units by placing segment borders according to the result of the multiresolution analysis. The final decision on localization of boundaries is taken by analysis of the energy flow among the decomposition levels. Distribution-like event functions indicate events, regarded as the segment boundaries. 1 Introduction Many speech segmentation algorithms (see [1], [2]) have been used in systems built for the speech technology, but only a few use the wavelet spectra [1, 5]. Wavelet methods are known to be very useful in the time-frequency analysis of signals. Wavelet transform combines the best properties of classic frequency and time analysis in a common tool. Most of the segmentation methods utilise some kind of statistical modelling of the signals and use optimisation methods (Viterbi decoding or dynamic time warping (DTW))(see [4]). These methods can only be used if the proper models of the language are known. This assumption leads to the necessity of preparing such models what usually is rough and time-consuming task. The algorithm proposed in this paper is feature-driven and thus does not need any additional language models. Phonetically annotated database of spoken Polish - Corpora 97 was used for tuning and testing the method. 2 Wavelet Decomposition The discrete wavelet transformation (DWT) belongs to the group of frequency transformations and is used to obtain a time-frequency spectrum (see [3, 8]) of signal {s(n)}. This encourages us to use the DWT as an artificial method

2 2 M. Ziółko, J. Gałka and T. Drwięga of speech analysis. Dyadic frequency division makes the DWT much more compatible with the principles of the operation of human hearing system, equipped with subsystem for frequency analysis (to reveal the information important for speech recognition ability), than other methods. In order to obtain the DWT, the coefficient c m+1,i of series s(n) = i c m+1,i φ m+1,i (n) (1) are computed for m = M,M 1,...,1, where φ m,i (n) = 2 m 2 φ(2 m n t i) (2) is the ith wavelet function at the mth resolution level and t is the sampling density. An example of wavelet function φ(t) and its spectrum is presented in Fig. 1. Due to the orthogonality of wavelet functions {φ m+1,i } i we obtain c m+1,i = 2 m+1 2 = 2 m n= s a (t) φ ( 2 m+1 t i ) dt s a (n) + φ ( 2 m+1 t i ) sin(π (t n t) / t) dt, (3) π (t n t) / t where s a (t) is an analog signal and its samples create the digital signal, i.e. s a (n t) = s(n). Fig. 1. Spectrum (left figure) and its Meyer scale function with N = 33 samples (right figure) Formula (3) has two disadvantages very important from the computational point of view. Firstly, it is difficult to compute integrals numerically when wavelet supports are unlimited. Secondly, the numerical computations of integrals are time-consuming, because the high quality standard needs series (1) for each second of the recorded speech signal. Therefore instead of formula (3), we used approximation

3 Wavelet Transform in Speech Segmentation 3 c m+1,i = s(n) φ m+1,i (n), n D i (4) where D i are compact supports of φ m+1,i. The support of scale function φ(t) must be compact to provide the fast calculations in the real time. It is common feature of the scale functions that φ(t) 0 very fast as t +. In practice the support can be limited to the segment [ T,T] where T = max {t R : φ(t) h}. (5) The threshold h should depend on the extreme value of the scale function. We choose condition h = α max φ(t), where α can be taken arbitrary, e.g. t α = In that way, the support of scale function was bounded to obtain the reasonable compromise: fast computations in real time and relatively small errors. The number of samples should be the smallest integer value N which satisfies inequality (N 1) t 2T, that is N T because the sampling frequency f s = 1/ t = Hz. The sampling density in the frequency domain f = 0.5/T and (N 1) f Hz because the whole frequency band is spread from 8000 to 8000 Hz. The coefficients of the lower level are calculated by applying the well known (see [3, 9]) formulae c m,n = i d m,n = i h i 2n c m+1,i (6) g i 2n c m+1,i (7) where {h i } and {g i } are the coefficients which depend on the assumed pair: scale function φ and wavelet ψ. In other words, the speech spectrum is decomposed using digital filtering and downsampling procedures defined by (6) and (7). It means that given the wavelet coefficients c m+1,i of the (m + 1)th resolution level, (6) and (7) are applied to compute the coefficients of the mth resolution level. The coefficients of next resolution levels are calculated recursively by applying formulae (6) and (7). The multiresolution analysis gives a hierarchical and fast scheme for the computation of the wavelet spectrum for a given signal s. The undertaken experiments show that the speech signal decomposition into six levels is sufficient (see Table 1) to cover the frequency band of voice. The energy of the speech signal above 8 khz and below 125 Hz is very low and can be neglected. The above presented wavelet decomposition leads to series s(n) = i M c 1,i φ 1,i (n) + d m,i ψ m,i (n) (8) m=1 i

4 4 M. Ziółko, J. Gałka and T. Drwięga Decomposition level m Frequency band [Hz] Approximation Table 1. Frequency division obtained for M = 6 levels of dyadic wavelet decomposition. Sampling frequency f s = 16 khz where φ 1,i (n) = 2 (1 M)/2 { φ (( 2 1 M n i ) t ) if M n i N 1 0 for other 2 1 M n i (9) and ψ m,i (n) = 2 (m M)/2 { ψ (( 2 1 M n i ) t ) if 0 2 m M n i N 1 0 for other 2 m M n i (10) The elements of the DWT for a mth level may be collected into a vector d m = (d m,1,d m,2,...) T. In this way the values of DWT for M + 1 levels can be obtained. It means that discrete wavelet spectrum DWT (s) = {d M,d M 1,...,d 1,c 1 } (11) is created from the coefficients of series (8). 3 Segmentation Scheme The role of the segmentation algorithm is to detect significant transitions of the energy among the wavelet sub-bands. When significant enough transition is found, it is marked and scored as a spectral-phonetic event. It is assumed that events occur when the energy transition changes the order of the powersorted bands. The non-uniform segmentation algorithm consists of the following steps: 1. Decompose signal s into the six levels of DWT = {d 6,n,d 5,n,...,d 1,n }. 2. Calculate the sum of power samples in all frequency sub-bands according to rule B m,k = k 2 6 m n=(k 1) 2 6 m +1 d 2 m,n. (12)

5 Wavelet Transform in Speech Segmentation 5 3. Calculate the power envelopes as a running mean values B env m,k = 2 K 2 k+ 1 K n=k K 2 B m,n, (13) where K = 2 M t µ f s for expected mean duration t µ of the segment of speech. For the given t µ = 100 ms, f s = 16 khz and M = 6 we obtain K = 25 samples. 4. Generate importance matrix M = [M m,k ] R 6 L of frequency bands by sorting the envelopes in each time k position i.e. M k = {m i } 6 i=1 : Benv m 1,k Bm env 2,k Bm env 3,k Bm env 4,k Bm env 5,k Bm env 6,k where L depends on the length of the speech signal. 5. Compute event-function f (k) = 6 m=1 M m,k+1 M m,k. (14) m 6. Segment border s locations can now be extracted from f (k) by choosing its local maxima, which fulfill two conditions: each of the chosen maximum has to be the highest value within the neighborhood of t min milliseconds, which is related to minimal assumed segment duration, local maximum is greater than specified threshold f tr. Time-range condition rejects multiple changes related to the same border and segments shorter than t min. Threshold adjusts sensitivity of the segmentation. By increasing its value we reduce the number of chosen events. It is reasonable to set its value on-line, according to f tr (k) = β P n= P 2P f (k n) where P is adaptation range corresponding to 100 milliseconds., (15) 4 Conclusions Presented algorithm was tested using Polish annotated speech database - Corpora 97. The speech of five different persons, with 1825 utterances were used for evaluation. These utterances include all of the 37 phonemes of Polish language and its natural concatenations. Reference phonetic annotation of speech was known, since it had been prepared earlier. Various values of the detection

6 6 M. Ziółko, J. Gałka and T. Drwięga parameters t min and β were used in order to find the combination producing the less number of errors. The best results were obtained for parameter t min set in the range milliseconds. In this range phone recognition, insertion and deletion rates are taking their best values. Threshold adaptation factor β does not affect mentioned rates when is set within 0 1. When β obtains the values greater than 1, results degrade considerably because of increase the rate of deletions, which are the most corrupting errors in speech segmentation (see [6]). It must be mentioned, that segmentation procedure uses acoustic, not phonetic features of speech. It will result in increased level of insertion rate because some phonemes are not acoustically uniform. This feature, however, does not affect overall performance of speech recognition systems (see [6, 7]). The use of wavelet analysis turns out to be an effective tool in finding the boundaries between two phonemes. The use of non-uniform segmentation reduces total number of segments to be processed by higher-level parts of ASR systems (HMM modeling). The effect is a significant decrease of Viterbi decoding search-space and computational cost. 5 Acknowledgments We would like to thank Stefan Grocholewski form Institute of Computer Science, Poznań University of Technology for providing a corpus of spoken Polish - Corpora 97. This work was supported by grant R References 1. A. Alani and M. Deriche, Proceedings of The Fifth International Symposium on Signal Processing and its Applications, (1999) 2. S. Cheng and H. Wang, Proceedings of 8th European Conference on Speech Communication and Technology - EUROSPEECH, (2003) 3. I. Daubechies, Ten lectures on Wavelets (SIAM, 1992) 4. K. Demuynck and T. Laureys, Proceedings of the 5th International Conference on Text, Speech and Dialogue, (2002) 5. O. Farooq and S. Datta, IEE Proceedings: Vision, Image and Signal Processing, 151(3), (2004) 6. J. Gałka and B. Ziółko, NAUN International Journal Of Circuits, Systems And Signal Processing, 2(1), (2007) 7. S. Grocholewski, Proceedings of International Conference on Language Resources and Evaluation, (1998) 8. Y. Meyer, Wavelets and applications (Masson, 1991) 9. O. Rioul and M. Vetterli, IEEE Signal Processing Magazine, 8, (1991)

CS 188: Artificial Intelligence Fall 2011

CS 188: Artificial Intelligence Fall 2011 Lecture 20: HMMs / Speech / ML 11/8/2011 Dan Klein UC Berkeley Today HMMs Demo bonanza! Most likely explanation queries Speech recognition A massive HMM! Details