Phoneme segmentation based on spectral metrics

Size: px

Start display at page:

Download "Phoneme segmentation based on spectral metrics"

Joy Paul
5 years ago
Views:

1 Phoneme segmentation based on spectral metrics Xianhua Jiang, Johan Karlsson, and Tryphon T. Georgiou Introduction We consider the classic problem of segmenting speech signals into individual phonemes. This is a key step in current-day speech recognition applications, but can be equally well motivated based on speech coding and data compression. The mathematical problem is that of identifying time instances where a non-stationary waveform undergoes significant changes. It is customary to identify and employ a variety of features and cues for knowledge-based recognition and, accordingly, classification of sounds. There is a substantial literature on the subject for which we refer to [1, 7, 8, 9, 1] and the reference therein. The purpose of this paper is to explore the potential of certain natural metrics for comparing power spectra. Thus, we consider signals as consisting of stationary segments where second-order statistics contain most useful information, we estimate the power spectra based on sliding windows, and base our segmentation process on the relative distance between power spectra of neighboring windows. We utilize a (new) metric which quantifies differences in predictive qualities of two power spectra, and compare this with two other natural notions of distance (which can be motivated by analogous rationale). We also compare these with the well-known Itakura-Saito distance which has similarities with the aforementioned metric. We present results on a particular voice signal which Partially supported by the National Science Foundation and the Swedish research council. Dept. of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 5555; jiang8@ece.umn.edu Department of Mathematics, Division of Optimization and Systems Theory, Royal Institute of Technology, 1 Stockholm, Sweden, johan.karlsson@math.kth.se Dept. of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 5555; tryphon@ece.umn.edu

2 has been marked by a human specialist ( expert ). These data comes from the American-English DARPA-TIMIT database of []. Interestingly, the result of our, rather direct usage of metrics, is very similar to the marking by the expert. Approach We first motivate a metric between power spectra which was introduced in [3]. Our starting point is the degradation of the variance of the prediction error when the design of the predictor is based on the wrong choice between two alternatives. To this end, let f, f 1 represent spectral density functions of discrete-time zero-mean stochastic processes u fi (k) for i {, 1} and k Z, and let p fi (l) (l {1,, 3,...}) represent values for the coefficients that minimize the linear prediction error variance E{ u fi () p(l)u fi ( l) }. l=1 The optimal set of coefficients depends on the power spectral density function of the process. It is reasonable to take as a distance between f and f 1 the degradation of predictive error variance when the coefficients p(l) are selected assuming one of the two, and then used to predict the other process. The ratio of the degraded predictive error variance over the optimal variance turns out to be ρ(f, f 1 ) := E{ u f () l=1 p f 1 (l)u f ( l) } E{ u f () l=1 p f (l)u f ( l) } ρ(f, f 1 ) = exp ( π f (θ) dθ f 1(θ) π ) ( π f(θ) log( f ) dθ 1(θ) π (1) ), () see [3]. The derivation is based on the Szegö-Kolmogorov theory [5, 6]. Since the above ratio is greater than or equal to 1, ρ(f, f 1 ) 1 =: γ(f, f 1 ) (3a) takes values in [, ] and can be used to quantify dissimilarity between the shapes of f and f 1. If we consider the distance γ(f, f + ) between a nominal power spectral density f and a perturbations f +, and eliminate cubic terms and beyond, we are led (modulo a scaling factor of ) to the Riemannian pseudo-metric g f ( ) := π ( ) ( (θ) dθ π ) f(θ) π (θ) dθ () f(θ) π on density functions. It turns out that geodesic paths f τ (τ [, 1]) connecting spectral densities f, f 1 and having minimal length can be explicitly computed [3].

3 Furthermore, the length along such geodesics can be also explicitly computed in terms of the end points and becomes π ( log f ) ( 1(θ) dθ π f (θ) π log f ) 1(θ) dθ. (5) f (θ) π It is seen that this is a standard-deviation -like expression of the difference log(f 1 ) log(f ). It is interesting to observe that the logarithm maps spectra into a linear space where the L -norm relates to optimal prediction as above. Earlier distance measures, such as the Itakura-Saito distance [], as well as other distances between the spectral logarithms, have similar qualities and will also be used in our case study below. For the experimental component of this work, we apply the geodesic distance to speech signal analysis. We compare neighboring sliding windows and compute the distance between the corresponding power spectra. We use the following four distances: geodesic dist. ( π log f1(θ) f (θ) ) dθ π ( π ) f1(θ) dθ log f (θ) π L -log log(f ) log(f 1 ) L 1 -log log(f ) log(f 1 ) 1 Itakura-Saito f f 1 1 log( f f 1 ) 1 The demarkation corresponding to maximal distance is taken as the separation between nearby phonemes. The segmentation is compared with that drawn by an expert, and it was often found to be remarkably accurate. The speech signal and phoneme segmentation results are shown in Figure 1. The speech segment corresponds to the precise sentence She has your dark suit in greasy wash water all year. In the top sub-figure, diamonds mark phoneme boundaries as decided by a human expert. Asterisks, circles, squares and crosses indicate the boundaries between phonemes that were obtained by comparing the geodesic distance, L -log distance, L 1 -log distance and the Itakura-Saito distance, respectively, between spectra of neighboring windows. Sub-figures, 3, and, are plots of the distance between such neighboring windows obtained by the respective metrics, as a function of the border between neighboring windows. Local maxima suggest boundaries between phonemes. Because the Itakura-Saito distance fluctuates considerably, we plot this on a logarithmic scale. Figure shows more clearly the detail about a part of the sentence that corresponds to she has. We can see that the segmentation markers by all these methods are consistent with those drawn by the expert. The segmentation in this last figure sparates phonemes corresponding to the sounds S, i, h, a, s, e.

4 Geodesic Distance x 1 x 1 L log Distance 1 5 x 1 L1 log Distance 1 5 x 1 Itakura Saito Distance x 1 Figure 1. Speech sentence and phoneme segmentation. In the top subfigure, diamonds designate segmentation by an expert. Asterisks, circles, squares and crosses correspond to segmentation based on geodesic distance, L -log distance, L 1 -log distance and the Itakura-Saito distance respectively Geodesic Distance L log Distance L1 log Distance Itakura Saito Distance Figure. Part of the sentence and phoneme segmentation results. In the top sub-figure, diamonds indicate markings by an expert. Asterisks, circles, squares and crosses are marked based on geodesic distance, L -log distance, L 1 -log distance and the Itakura-Saito distance respectively.

5 Bibliography [1] A.M. Abdelatty Ali, J. Van der Spiegel, P. Mueller, G. Haentjens, and J. Berman, An Acoustic-Phonetic feature-based system for automatic phoneme recognition in continuous speech, in Proc. ISCAS, volume 3, pages 11811, May [] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren, DARPA TIMIT acoustic-phonetic continuous speech corpus, U.S. Department of Commerce, NIST, Gaithersburg, Md, USA, February [3] T.T. Georgiou, Distances between power spectral densities, preprint at: ( IEEE Trans. on Signal Processing, 55(8): , 7. [] R. Gray, A. Buzo, A. Gray, and Matsuyama, Distortion measures for speech processing, IEEE Trans. on Acoustics, Speech, and Signal Proc., 8(): , Aug [5] U. Grenander and G. Szegö, Toeplitz Forms and their Applications, Chelsea, [6] P. Stoica and R. Moses, Introduction to Spectral Analysis, Prentice Hall, 5. [7] D.B. Grayden and M.S. Scordilis, Phonemic segmentation of fluent speech, in Proc. ICASSP, vol.1, I/73-I/76, Apr 199. [8] B. Zioko, S. Manandhar, and R.C. Wilson, Phoneme segmentation of speech, in Proc. ICPR, vol., 8-85, 6. [9] C.J. Weinstein, S.S. Mccandless, L.F. Mondshein and V.W. Zue System for Acoustic-Phonetic Analysis Continuous Speech, IEEE Trans. on Acoustics, Speech, and Signal Proc., 3(1): 5-67, Feb [1] K. Demuynck and T. Laureys, A Comparison of Different Approaches to Automatic Speech Segmentation, Proceedings of the 5th International Conference on Text, Speech and Dialogue, 77-8,.

INPUT-TO-STATE COVARIANCES FOR SPECTRAL ANALYSIS: THE BIASED ESTIMATE

INPUT-TO-STATE COVARIANCES FOR SPECTRAL ANALYSIS: THE BIASED ESTIMATE JOHAN KARLSSON AND PER ENQVIST Abstract. In many practical applications second order moments are used for estimation of power spectrum.