MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION

Size: px

Start display at page:

Download "MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION"

Veronica Osborne
6 years ago
Views:

ISCA Archive MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION Mauro Cettolo and Marcello Federico ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-385 Povo, Trento, Italy ABSTRACT

1 ISCA Archive MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION Mauro Cettolo and Marcello Federico ITC-irst - Centro per la Ricerca Scientifica e Tecnologica I-385 Povo, Trento, Italy ABSTRACT Robust acoustic segmentation has become a critical issue in order to apply speech recognition to audio streams with variable acoustic content, e.g. radio programs. Many techniques in the literature base segmentation on statistical model selection, by applying the Bayesian Information Criterion. This work reviews alternative model selection criteria and presents comparative experiments both under controlled conditions and on a broadcast news corpus. 1. INTRODUCTION The problem of acoustic segmentation and classification has become crucial to the application of automatic speech recognition to audio stream processing. For instance, in order to generate transcripts of broadcast news programs, it is necessary to isolate and filter out portions of the signal which do not contain speech, e.g. jingles and signature tunes. Moreover, transcription accuracy can be significantly improved, by using condition dependent acoustic models, if the speech signal is segmented and classified according to bandwidth, speaker gender, and speaker identity. In recent years, several algorithms have been presented which use a statistical decision criterion to detect spectral changes (SCs) within the feature space of the signal. Assuming that data are generated by a Gaussian process, SCs are detected within a sliding window through a model selection method. The most likely SC is tested by comparing two hypothesis: (i) the window contains data generated by the same distribution; (ii) the left and right semi-windows, with respect to to the SC point, contain data drawn by two different distributions. The test is performed with a likelihood ratio that, besides the maximum likelihood of each hypothesis, takes into account the different size of the corresponding models. Usually, the Bayesian Information Criterion () [11] is applied to select the simplest and best fitting model. This paper reviews alternative model selection criteria and presents comparative experiments both on synthetic data and real audio data. 2. SEGMENTATION PROBLEM Acoustic segmentation can be seen as a particular instance of the more general problem of partitioning data into distinct homogeneous regions [2]. The data partitioning problem arises in all applications which require to partition data into chunks, e.g. image processing, data mining, text processing, etc. The problem can be formulated as follows. Let x 1 ; x 2 ;:::x n be an ordered sample of data in the < d space. We assume that the data are generated by a Gaussian process with at most c transitions. The problem of segmentation is that of detecting all the transition points in the data set. The general problem can be approached, without loss of generality, by first considering the simplest case c =1. Single Transition Detection The search of one potential transition point goes through the definition of n different statistical models: ffl n 1 two-segment models M t (t =1;:::;n 1), each of them assuming: x 1 :::x t ο iid N d (x; μ 1 ; ± 1 ) (1) x t+1 :::x n ο iid N d (x; μ 2 ; ± 2 ) (2) ffl one single-segment model M n which assumes: x 1 ; x 2 ;:::x n ο N d (x; μ; ±): (3) The basic idea is to choose the model (M t : t =1;:::;n) that better fits the observations. The application of the maximum likelihood principle would however invariably lead to choosing one of the two-segment models, and hence to hypothesize a break point at some t =1:::n 1,astheyhave a higher number of free parameters than the one-segment model. In order to take into account the notion of dimension of the model, the following extension to the maximum likelihood principle was first proposed by Akaike [1]. The (Akaike s Information Criterion) suggests to maximize the likelihood for each model i separately, obtaining say L i = L i (x 1 ; x 2 ;:::x n ), and then choose the model for which logl i k i is largest, where k i is the dimension of the model. Computations Given a sample x1; x2;:::x n ο iid N d (x; μ; ±) the likelihood function achieves the maximum value [12]: L(x 1 ; x 2 ;:::x n )=(2ß) nd 2 j ^± j n 2 e nd=2 (4)

2 Name Author Year Reference Penalty Akaike 1972 [1] k Schwarz 1978 [11] k 2 log n k Bozdogan 1987 [3] 2 log n + k 2 F Bozdogan 1987 [3] k + k 2 log n log j I( ) j k Rissanen 1987 [1] 2 log n +(k +1)log(k +2) 2 d M Wallace & Freeman 1987 [14] 2 (1+log» d)+ 1 2 Notation: k Number of free parameters in the model. n Size of the data sample. d Dimension of the data space. I( ) Fisher Information matrix of the model.» d Constant of the optimal d-dimensional quantizing lattice [6]. Table 1: Model Dimension Estimates. log j I( ) j at ^μ = μx, the sample mean, and ^± = 1 n nx i=1 (x i μx)(x i μx) (5) the maximum-likelihood estimate of the covariance matrix. The number of free parameters of a multivariate normal distribution is equal to the dimension of the mean plus the number of variances and covariances to be estimated. For a full covariance matrix it is: (d +1) k = d + d : (6) 2 Decision Rule Several model selection criteria have been proposed in the literature that can be applied to Akaike s framework of model selection. In general, each criterion proposes a penalty function P that takes into account the model dimension. By computing the likelihood function of each model, the following decision rule can be derived. Look for the best two-segment model for the data: logl t P t = max t t=1;:::;n 1 2 log j ^± 1 j n t 2 log j ^± 2 j P t (7) then, take the one-segment model function: log L n P n = n 2 log j ^± j P n (8) and choose to segment the data at point t if and only if: (log L t log L n ) (P t P n ) > : (9) In the experimental part it will be shown that performance of the rule can be tuned by replacing the zero threshold with a value to be empirically estimated. Multiple Transition Detection The extension of the method to an arbitrary large number of potential segments requires considering a number of competing models that combinatorially grows with n and c. In general, application dependent simplifications are introduced to reduce the complexity of the problem. For the acoustic segmentation, the audio signal can be segmented through a sliding window. By keeping the window size sufficiently large to reliably apply the method, and sufficiently short to avoid multiple transitions, a segmentation algorithm can be devised that relies on the basic c = 1 case. In Figure 1 an algorithm is proposed [5] that was derived by the one described in [7]. The main idea is to have a shifting variable-size window in which a SC can be hypothesized according to (9). To reduce computations, the maximization (7) is not computed over all points 1 :::;n 1, but at a lower resolution rate. The resolution rate is increased when a potential SC is detected, in order to validate it and refine its time position. Let us refer to Figure 1. The starting window (WinMin) has to be small to contain no more than one SC, but large enough to allow reliable statistics of the criterion to be computed. It is located at the beginning of the input audio stream. Values of the criterion are computed with low resolution rate (ResLow), e.g. every 3 observations (step 2.). The window is enlarged (11.) until a potential SC is detected (3.), or a maximum size (WinMax) is reached (1.). In the first case, the potential SC is validated by computing the criterion values on the window centered around the candidate, and using an higher resolution rate (ResHigh) (4.). In the second case, the window is shifted on the right (13.). If the potential SC is validated (5.,7.) it is stored (6.), then, the window is set to the minimum size (8.) and placed just after the detected SC (9.). Steps 2-13 are repeated until all the input audio data have been processed (1.). 3. MODEL SELECTION CRITERIA Several model selection criteria have been proposed starting from the early 7s. As mentioned before, the seminal work of Akaike tried to extend the maximum likelihood principle with a term that estimates the dimension or complexity of the considered statistical model. Refinements to the Akaike s Information Criterion () were proposed by Schwarz [11], with the Bayesian Information Cri-

3 Parameters: WinMin: minimum window size WinMax: maximum window size WinDelta: window increase step WinStep: window shift step ResLow: low resolution ResHigh: high resolution N: input data length Thresh: threshold for the used criterion Variables: WinStart: left boundary of the window WinSize: current window size SC: detected spectral changes Subroutine: MaxSearch(WinStart,WinSize,Res): returns the best potential SC and its score, computed by a given criterion, inside the specified window at the given resolution Res. Initialization: WinStart=1 WinSize=WinMin SC=() Algorithm: 1. while (WinStart+WinSize < N) 2. (max,t)=maxsearch(winstart,winsize, ResLow) 3. if (max > Thresh) 4. (max,t)= MaxSearch(t - WinSize/2,WinSize, ResHigh) 5. if (max > Thresh) 6. push SC t 7. if (max > Thresh) 8. WinSize=WinMin 9. WinStart=t else if (WinSize < WinMax) 11. WinSize=WinSize + WinDelta 12. else 13. WinStart=WinStart + WinStep Figure 1: Algorithm for detecting multiple spectral changes. terion (), and by Bozdogan [3], with the Consistent (), and the Consistent with Fisher information (F). By following an information and coding theory approach to statistical modeling and stochastic complexity, Rissanen [1] and Wallace and Friedman [14] proposed in the 8s two different criteria, respectively called Minimum Description Length () and Minimum Message Length (M). Without going into the details of each method which would require too much space, the penalty terms derived by each of the mentioned criteria are given in Table 1. For the sake of comparison, a version of the Hotelling s T 2 test and the maximum likelihood method are also considered. Hotelling s Test The Hotelling s T 2 (T2) test [13] computes the maximum likelihood estimate of a changing point of the mean in the sample by: t = arg max t=1;:::;n 1 T 2 t (1) t(n t) = arg max (μx 1 μx 2 ) S 1 (μx p 1 μx 2 ) t=1;:::;n 1 n 2 where S p is the pooled variance: S p = 1 n 2 (t ^± 1 +(n t) ^± 2 ) (11) and (μx 1 ; ^± 1 ) and (μx 2 ; ^± 2 ) are, respectively, the sample means and covariance estimates on x 1 ;:::;x t and x t+1 ;:::;x n. The hypothesis of a changing point can again be accepted with a confidence level (1 ff) if: n d 1 (n 2)d T 2 t F d;n d 1;ff (12) where F d;n d 1;ff is upper ff1% point of the F- distribution with (d,n-d-1) degrees of freedom. Maximum Likelihood Test The Maximum Likelihood () criterion corresponds to a model selection criterion with a zero penalty function. Hence, a SC is detected if the two-segment model fits the data better than the single-segment model. 4. EVALUATION METRICS Segmenting an audio stream, like a broadcast news program, requires in general to detect spectral changes regarding: ffl acoustic sources, i.e. female/male speech, music ffl acoustic channels, i.e. wide/narrow band. According to [9], performance of automatic SC detection should be calculated with respect to a set of target SCs. To each target SC there is usually associated a time interval [S SC ;E SC ], rather than as a single point. This because silence or other non-speech events may occur between changes. Tolerances in detecting SCs can be introduced by extending such intervals. Hence, an hypothesized SC is considered correct if it falls inside one of the augmented target intervals [S SC tol; E SC + tol], where tol is the admitted tolerance. For comparing target and hypothesized SCs, one can adopt the recall and precision measures: a recall = 1 (13) a + c a precision = 1 (14) a + b where a is the number of hypothesized SCs that fall inside the target SC intervals, b is the number of hypothesized SCs that do not fall inside any target SC interval, and c is the number of target SC intervals which no hypothesized SC falls inside. 5. EXPERIMENTS UNDER CONTROLLED CONDITIONS Comparison of model selection criteria has been first performed under controlled conditions. Random samples of size n = 3 were generated according to different multivariate normal distributions, and for values of the dimension d =1; 5; 1. In particular, random samples were generated either by shifting the mean or by scaling the variances of a standard normal distribution.

4 Mean Shifting Random samples of size n=3 were generated according to the following scheme: x 1 ;:::x n 2 ο N d (x; ;I) (15) x n 2 +1 ;:::x n ο N d (x; + 1 ff; I) (16) with ff=.1,...,.5. Variance Scaling Random samples of size n=3 were generated according to the following scheme: x 1 ;:::x n 2 ο N d (x; ;I) (17) x n 2 +1 ;:::x n ο N d (x; ;ff I) (18) with ff=.5,.6,...,.9. Experimental Conditions The basic segmentation algorithm (c = 1) was applied to the above problems with a slight variation. Two-segment models (7) were only evaluated on the central third of the 1 data set, i.e. 3 n» t» 2 n. This to reliably compute 3 the model parameters. 1 data samples were generated for each focus condition. Finally, as for each condition the correct model has a diagonal covariance matrix, the number of free parameters k was set equal to 2 d. Performance in terms of precision/recall were computed, for each condition, by assuming transition detections correct if they fall within the interval [14; 16]. Moreover, each method was also evaluated on 1 homogeneous data samples generated according to a standard normal distribution. Hence, the statistic b of equation (14) was estimated by counting the number of hypothesized transition points found in the homogeneous samples. Finally, the T2 method was only applied to detect mean shifts in the data, with a confidence level ff =:95. Experimental Results Experimental results are reported in Figure 2. The three vertical plots on the left size correspond to experiments applying mean shifts, while the three plots on the right correspond to experiments applying variance scalings. Increasing values of the dimension d of the data are considered going from the top to the bottom plots. Each single plot shows precision versus recall performance of each criterion, under different shifting/scaling conditions. Vertical slices correspond, going rightward, to easier segmentation tasks. According to the definition of the precision/recall measures, best performing methods are those which are closest to the top-right corner of a slice. By looking at the two upper plots, which corresponding to dimension d =1, it results that the best two performing criteria are and. M follows with a lower recall, which gets closer to the best methods as the task gets easier. With a slightly better precision, but much lower recall,, T2 (just for mean shifting), and follow in the order. F often keeps abreast of the best methods, in terms of recall, but scores much lower in terms of precision. Results significantly change by looking at dimension d =5 and d =1. performs significantly better than, especially for the mean shifting case. M worsens significantly and gets close to the best methods just in the easiest variance scaling case. T2 provides the best precision/recall trade-off on the mean shifting conditions. shows a good precision-recall trade-off on both dimensions and conditions. In particular, shows the highest precision on the most difficult conditions (left most plot slices). 6. EXPERIMENTS ON REAL DATA Experiments with all the segmentation criteria were performed on audio data coming from a broadcast news data base. The aim is to detect spectral changes that occur within the signal that are mainly due to channel and source switches. The IBNC Corpus For testing purposes, data from the IBNC (Italian Broadcast News Corpus) database, developed at ITC-irst [8], were employed. The IBNC consists of 3 hours of radio news recordings, which were manually transcribed, segmented and labeled. The test set consists of six radio news programs (about minutes of audio signal) that were selected as a representative sample of the whole corpus, with respect to all the issues concerning automatic broadcast news transcription [4]. Table 2 reports statistics on the test set regarding segments. A segment is defined as a contiguous portion of audio signal, homogeneous in terms of acoustic source and channel. # average duration (s) music segments speech segments Table 2: Statistics of segments in the test set. The test set contains a total of 212 SCs (218 segments distributed among six news programs). Experimental Conditions Multivariate observations of dimension 13 were used, i.e. 12 mel-scaled cepstral coefficients and the log-energy. SCs detections was performed by using a tolerance value of 5ms. Multiple SC detection was performed by means of the algorithm shown in Figure 1. Moreover, in order to compute a precision/recall operating curve of each method, an empirical threshold was introduced in the decision criteria (9) and (12). In fact, the threshold can be seen as an empirically estimated additional penalty to the method. Different values of the threshold were tested and the resulting precision/recall statistics were computed. Experimental Results Precision vs. recall points of each method are shown in Figure 3. As a reference, complete curves are plotted for the and T2 methods. The left most points of all the model selection criteria correspond to setting the threshold

5 to the original value, i.e. zero. By looking at Figure 3 the following can be observed: ffl straightforward application of the methods on audio data provides high recall but very low precision; ffl by suitably tuning the threshold value, on each single method, much better performance can be achieved; ffl optimal values of the threshold make all methods, with the exception of T2, perform comparably well; ffl T2 performs significantly worse than all other methods. Moreover, no improvement was achieved even by using a universal pooled variance estimated as suggested in [15]; ffl,, and confirm to be among the best performing methods; ffl the pure empirically tuned method performs as well as the best model selection methods; 7. CONCLUSIONS Several model selection methods for acoustic segmentation were presented and tested, both on synthetic and real audio data. Tangible differences among the methods appeared in experiments performed under controlled conditions. In particular, methods with simple penalty functions showed to perform better with multivariate data. Methods based on the Fisher information (i.e. M and F) did not result competitive versus easier methods, at least on the here considered segmentation problems. Application of any method on real audio data requires introducing an empirical threshold on the decision criterion. Tuning the threshold on each method permits to achieve significantly better retrieval performance. Almost all the considered methods reached very similar optimal performance. Besides, methods which best performed on the synthetic data sets also worked well on the audio data. To conclude, a major point in applying the considered methods on audio data concerns their robustness with respect to the normality assumption on the data source. By the introduction of an empirical threshold in the decision criterion, all the tested selection criteria showed to be reasonably robust. Future work will be devoted to the development and evaluation of non parametric methods for the acoustic segmentation problem. [2] R. A. Baxter. Minimum Message Lenght Inference: Theory and Applications. PhD thesis, Department of Computer Science Monash University, Clayton, Victoria, Australia, [3] H. Bozdogan. Model selection and the Akaike s information criterion (): the general theory and its analytical extensions. Psychometrika, 52(3):345 37, [4] F. Brugnara, M. Cettolo, M. Federico, and D. Giuliani. A system for the segmentation and transcription of Italian radio news. In Proceedings of RIAO Content-Based Multimedia Information Access, Paris, France, 2. [5] M. Cettolo. Segmentation, classification and clustering of an Italian broadcast news corpus. In Proceedings of the RIAO International Conference, Paris, France, 2. [6] J. H. Conway and N. J. A. Sloane. Sphere Packing, Lattices and Groups. Springer Verlag, Berlin, Germany, [7] P. Delacourt, D. Kryze, and C. J. Wellekens. Speaker-based segmentation for audio data indexing. In Proceedings of the ESCA ETRW workshop Accessing Information in Spoken Audio, Cambridge, UK, [8] M. Federico, D. Giordani, and P. Coletti. Development and Evaluation of an Italian Broadcast News Corpus. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece, 2. [9] D. Liu and F. Kubala. Fast speaker change detection for broadcast news transcription and indexing. In Proceedings of the 6th European Conference on Speech Communication and Technology, pages , Budapest, Hungary, [1] J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, B, 49(3): , [11] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6(2): , [12] G. A. F. Seber. Multivariate Observations. John Wiley & Sons, New York, NY, [13] M. S. Srivastava and E. M. Carter. An Introduction to Applied Multivariate Statistics. North-Holland, New York, NY, [14] C. S. Wallace and P. R. Freeman. Estimation and inference by compact coding. Journal of the Royal Statistical Society, B, 49(3):24 265, [15] S. Wegmann, P. Zhan, and L. Gillick. Progress in broadcast news transcription at Dragon Systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, volume I, pages 33 36, Phoenix, AZ, ACKNOWLEDGMENTS The here presented work was carried out within the European project CORETEX (IST ). The authors thank R. A. Baxter and D. Giuliani for their help and useful suggestions. REFERENCES [1] H. Akaike. On entropy maximization principle. In P. R. Krishnaiah, editor, Applications of Statistics, pages North-Holland, Amsterdam, Nederlands, 1977.

6 MEAN SHIFTING VARIANCE SCALING α=.1 1- α=.2 1- α=.3 1- α=.4 1- α=.5 1- α=.9 1- α=.8 1- α=.7 1- α=.6 1- α=.5 1- d=1 d=1 1 1 RECALL RECALL RECALL 5 F M T α=.1 α=.2 α=.3 α=.4 α=.5 α=.9 α=.8 α=.7 α=.6 α=.5 d=5 d= d=1 1 5 F M T α=.1 α=.2 α=.3 α=.4 α= F M F M α=.9 α=.8 α=.7 α=.6 α=.5 d=1 F M T PRECISION F M PRECISION Figure 2: Results of experiments under controlled conditions.

7 RECALL 85 8 aic bic caic caicf mdl ml mml t PRECISION Figure 3: Precision vs. recall curve by different methods on an audio segmentation task.

MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION

= = = MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION Mauro Cettolo and Marcello Federico ITC-irst - Centro per la Ricerca Scientifica e Tecnoloica I-385 Povo, Trento, Italy ABSTRACT Robust acoustic