MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION

= = = MODEL SELECTION CRITERIA FOR ACOUSTIC SEGMENTATION Mauro Cettolo and Marcello Federico ITC-irst - Centro per la Ricerca Scientifica e Tecnoloica I-385 Povo, Trento, Italy ABSTRACT Robust acoustic sementation has become a critical issue in order to apply speech reconition to audio streams with variable acoustic content, e.. radio prorams. Many techniques in the literature base sementation on statistical model selection, by applyin the Bayesian Information Criterion. This work reviews alternative model selection criteria and presents comparative experiments both under controlled conditions and on a broadcast news corpus. 1. INTRODUCTION The problem of acoustic sementation and classification has become crucial to the application of automatic speech reconition to audio stream processin. For instance, in order to enerate transcripts of broadcast news prorams, it is necessary to isolate and filter out portions of the sinal which do not contain speech, e.. jinles and sinature tunes. Moreover, transcription accuracy can be sinificantly improved, by usin condition dependent acoustic models, if the speech sinal is semented and classified accordin to bandwidth, speaker ender, and speaker identity. In recent years, several alorithms have been presented which use a statistical decision criterion to detect spectral chanes (SCs) within the feature space of the sinal. Assumin that data are enerated by a Gaussian process, SCs are detected within a slidin window throuh a model selection method. The most likely SC is tested by comparin two hypothesis: (i) the window contains data enerated by the same distribution; (ii) the left and riht semi-windows, with respect to to the SC point, contain data drawn by two different distributions. The test is performed with a likelihood ratio that, besides the maximum likelihood of each hypothesis, takes into account the different size of the correspondin models. Usually, the Bayesian Information Criterion () [11] is applied to select the simplest and best fittin model. This paper reviews alternative model selection criteria and presents comparative experiments both on synthetic data and real audio data. 2. SEGMENTATION PROBLEM Acoustic sementation can be seen as a particular instance of the more eneral problem of partitionin data into distinct homoeneous reions [2]. The data partitionin problem arises in all applications which require to partition data into chunks, e.. imae processin, data minin, text processin, etc. The problem can be formulated as follows. Let be an ordered sample of data in the space. We assume that the data are enerated by a Gaussian process with at most transitions. The problem of sementation is that of detectin all the transition points in the data set. The eneral problem can be approached, without loss of enerality, by first considerin the simplest case. Sinle Transition Detection The search of one potential transition point oes throuh the definition of different statistical models: two-sement models ( ), each of them assumin: "!$#%# '&(*),+ - /. (1) 21!$#%# '&(*),+ - /. 3 (2) one sinle-sement model which assumes: 4 * 5! & ),+ -$ /. (3) The basic idea is to choose the model (6879$: ) that better fits the observations. The application of the maximum likelihood principle would however invariably lead to choosin one of the two-sement models, and hence to hypothesize a break point at some (;, as they have a hiher number of free parameters than the one-sement model. In order to take into account the notion of dimension of the model, the followin extension to the maximum likelihood principle was first proposed by Akaike [1]. The (Akaike s Information Criterion) suests to maximize the likelihood for each model < separately, obtainin say #> which @BA3C the model. # )?, and then choose the model for # EDF# is larest, where D*# is the dimension of Computations Given a sample G G G!$#%# H&IF),+ -$ /. the likelihood function achieves the maximum value [12]: = ) J )LKNM /O,PQ R(ST. S O,PRVU O XW (4)

D ] = at h - ji, the sample mean, and T. Name Author Year Reference Penalty Akaike 1972 [1] D Schwarz 1978 [11] Y Z%[F\ Bozdoan 1987 [3] Y Z%[F\ ^] Y F Bozdoan 1987 [3] D8]_Y Z`[*\ a] Z`[*\ SNb )dc S Rissanen 1987 [1] Y Z%[F\ ^] ) Y ]E Z%[F\ ) D8] K M Wallace & Freeman 1987 [14] ) e] Z%[F\ef ] Z`[*\ SNb )2c S Notation: Number of free parameters in the model. Size of the data sample. Dimension of the data space. b )dc Fisher Information matrix of the model. f Constant of the optimal -dimensional quantizin lattice [6]. k #%l ) #9nm ) #9nm po (5) the maximum-likelihood estimate of the covariance matrix. The number of free parameters of a multivariate normal distribution is equal to the dimension of the mean plus the number of variances and covariances to be estimated. For a full covariance matrix it is: Dq ) ]E K Table 1: Model Dimension Estimates. (6) Decision Rule Several model selection criteria have been proposed in the literature that can be applied to Akaike s framework of model selection. In eneral, each criterion proposes a penalty function r that takes into account the model dimension. By computin the likelihood function of each model, the followin decision rule can be derived. Look for the best two-sement model for the data: vqwnx 2l Xy{z{z{z y O K @da3c S T. S } K then, take the one-sement model function: Z%[F\ = }r K @da3c 2stru2su @da3c S T.~ S r (7) Z`[*\ S T. S r (8) and choose to sement the data at point o if and only if: ) Z%[F\ = s Z`[*\ = ) r s9r, ƒ (9) In the experimental part it will be shown that performance of the rule can be tuned by replacin the zero threshold with a value c to be empirically estimated. Multiple Transition Detection The extension of the method to an arbitrary lare number of potential sements requires considerin a number of competin models that combinatorially rows with and. In eneral, application dependent simplifications are introduced to reduce the complexity of the problem. For the acoustic sementation, the audio sinal can be semented throuh a slidin window. By keepin the window size sufficiently lare to reliably apply the method, and sufficiently short to avoid multiple transitions, a sementation alorithm can be devised that relies on the basic q case. In Fiure 1 an alorithm is proposed [5] that was derived by the one described in [7]. The main idea is to have a shiftin variable-size window in which a SC can be hypothesized accordin to (9). To reduce computations, the maximization (7) is not computed over all points q, but at a lower resolution rate. The resolution rate is increased when a potential SC is detected, in order to validate it and refine its time position. Let us refer to Fiure 1. The startin window (WinMin) has to be small to contain no more than one SC, but lare enouh to allow reliable statistics of the criterion to be computed. It is located at the beinnin of the input audio stream. Values of the criterion are computed with low resolution rate (ResLow), e.. every 3 observations (step 2.). The window is enlared (11.) until a potential SC is detected (3.), or a maximum size (WinMax) is reached (1.). In the first case, the potential SC is validated by computin the criterion values on the window centered around the candidate, and usin an hiher resolution rate (ResHih) (4.). In the second case, the window is shifted on the riht (13.). If the potential SC is validated (5.,7.) it is stored (6.), then, the window is set to the minimum size (8.) and placed just after the detected SC (9.). Steps 2-13 are repeated until all the input audio data have been processed (1.). 3. MODEL SELECTION CRITERIA Several model selection criteria have been proposed startin from the early 7s. As mentioned before, the seminal work of Akaike tried to extend the maximum likelihood principle with a term that estimates the dimension or complexity of the considered statistical model. Refinements to the Akaike s Information Criterion () were proposed by Schwarz [11], with the Bayesian Information Cri-

Parameters: WinMin: minimum window size WinMax: maximum window size WinDelta: window increase step WinStep: window shift step ResLow: low resolution ResHih: hih resolution N: input data lenth Thresh: threshold for the used criterion Variables: WinStart: left boundary of the window WinSize: current window size SC: detected spectral chanes Subroutine: MaxSearch(WinStart,WinSize,Res): returns the best potential SC and its score, computed by a iven criterion, inside the specified window at the iven resolution Res. Initialization: WinStart=1 WinSize=WinMin SC=() Alorithm: 1. while (WinStart+WinSize N) 2. (max,t)=maxsearch(winstart,winsize, ResLow) 3. if (max Thresh) 4. (max,t)= MaxSearch(t - WinSize/2,WinSize, ResHih) 5. if (max Thresh) 6. push SC t 7. if (max Thresh) 8. WinSize=WinMin 9. WinStart=t + 1 1. else if (WinSize WinMax) 11. WinSize=WinSize + WinDelta 12. else 13. WinStart=WinStart + WinStep Fiure 1: Alorithm for detectin multiple spectral chanes. terion (), and by Bozdoan [3], with the Consistent (), and the Consistent with Fisher information (F). By followin an information and codin theory approach to statistical modelin and stochastic complexity, Rissanen [1] and Wallace and Friedman [14] proposed in the 8s two different criteria, respectively called Minimum Description Lenth () and Minimum Messae Lenth (M). Without oin into the details of each method which would require too much space, the penalty terms derived by each of the mentioned criteria are iven in Table 1. For the sake of comparison, a version of the Hotellin s test and the maximum likelihood method are also considered. Hotellin s Test (T2) test [13] computes the maximum likelihood estimate of a chanin point of the mean in the sample by: The Hotellin s o ẅ ˆ \ vẅ x 2l /y{z{z{z y O ẅ ˆ \ vẅ x 2l /y{z{z{z y O (1) ) } K ) i ni o Š~O Œ ) i ni where Š Œ is the pooled variance: Š Œ K ). T ] ) }.~ T (11) and ) i. T and ) i. T 3 are, respectively, the sample means and covariance estimates on and 21. The hypothesis of a chanin point can aain be accepted with a confidence level ) Ž if: ƒ ) K s ƒ y O O Xy (12) where y O O Xy is upper Že * point of the F- distribution with (d,n-d-1) derees of freedom. Maximum Likelihood Test The Maximum Likelihood () criterion corresponds to a model selection criterion with a zero penalty function. Hence, a SC is detected if the two-sement model fits the data better than the sinle-sement model. 4. EVALUATION METRICS Sementin an audio stream, like a broadcast news proram, requires in eneral to detect spectral chanes reardin: acoustic sources, i.e. female/male speech, music acoustic channels, i.e. wide/narrow band. Accordin to [9], performance of automatic SC detection should be calculated with respect to a set of taret SCs. To each taret SC there is usually associated a time interval Š 4, rather than as a sinle point. This because silence or other non-speech events may occur between chanes. Tolerances in detectin SCs can be introduced by extendin such intervals. Hence, an hypothesized SC is considered correct if it falls inside one of the aumented taret intervals Š 6pAN@ ]špan@, where pan@ is the admitted tolerance. For comparin taret and hypothesized SCs, one can adopt the recall and precision measures: where recall ]œ precision ]Ÿž F (13) F (14) is the number of hypothesized SCs that fall inside the taret SC intervals, ž is the number of hypothesized SCs that do not fall inside any taret SC interval, and is the number of taret SC intervals which no hypothesized SC falls inside. 5. EXPERIMENTS UNDER CONTROLLED CONDITIONS Comparison of model selection criteria has been first performed under controlled conditions. Random samples of size j F were enerated accordin to different multivariate normal distributions, and for values of the dimension j?. In particular, random samples were enerated either by shiftin the mean or by scalin the variances of a standard normal distribution.

Mean Shiftin Random samples of size =3 were enerated accordin to the followin scheme: PR! & ),+/ b (15) PR 1 N 5! & ),+/ ]E Ž b (16) with Ž =.1,...,.5. Variance Scalin Random samples of size =3 were enerated accordin to the followin scheme: PR! &( ),+/ 9 b (17) PR 1! &( ),+/ 9 Ž> b (18) with Ž =.5,.6,...,.9. Experimental Conditions The basic sementation alorithm (} ) was applied to the above problems with a sliht variation. Two-sement models (7) were only evaluated on the central third of the data set, i.e. ª «š. This to reliably compute the model parameters. data samples were enerated for each focus condition. Finally, as for each condition the correct model has a diaonal covariance matrix, the number of free parameters D was set equal to K. Performance in terms of precision/recall were computed, for each condition, by assumin transition detections correct if they fall within the interval ` 3. Moreover, each method was also evaluated on homoeneous data samples enerated accordin to a standard normal distribution. Hence, the statistic ž of equation (14) was estimated by countin the number of hypothesized transition points found in the homoeneous samples. Finally, the T2 method was only applied to detect mean shifts in the data, with a confidence level Ž *. Experimental Results Experimental results are reported in Fiure 2. The three vertical plots on the left size correspond to experiments applyin mean shifts, while the three plots on the riht correspond to experiments applyin variance scalins. Increasin values of the dimension of the data are considered oin from the top to the bottom plots. Each sinle plot shows precision versus recall performance of each criterion, under different shiftin/scalin conditions. Vertical slices correspond, oin rihtward, to easier sementation tasks. Accordin to the definition of the precision/recall measures, best performin methods are those which are closest to the top-riht corner of a slice. By lookin at the two upper plots, which correspondin to dimension n, it results that the best two performin criteria are and. M follows with a lower recall, which ets closer to the best methods as the task ets easier. With a slihtly better precision, but much lower recall,, T2 (just for mean shiftin), and follow in the order. F often keeps abreast of the best methods, in terms of recall, but scores much lower in terms of precision. Results sinificantly chane by lookin at dimension and :. performs sinificantly better than, especially for the mean shiftin case. M worsens sinificantly and ets close to the best methods just in the easiest variance scalin case. T2 provides the best precision/recall trade-off on the mean shiftin conditions. shows a ood precision-recall trade-off on both dimensions and conditions. In particular, shows the hihest precision on the most difficult conditions (left most plot slices). 6. EXPERIMENTS ON REAL DATA Experiments with all the sementation criteria were performed on audio data comin from a broadcast news data base. The aim is to detect spectral chanes that occur within the sinal that are mainly due to channel and source switches. The IBNC Corpus For testin purposes, data from the IBNC (Italian Broadcast News Corpus) database, developed at ITC-irst [8], were employed. The IBNC consists of 3 hours of radio news recordins, which were manually transcribed, semented and labeled. The test set consists of six radio news prorams (about minutes of audio sinal) that were selected as a representative sample of the whole corpus, with respect to all the issues concernin automatic broadcast news transcription [4]. Table 2 reports statistics on the test set reardin sements. A sement is defined as a contiuous portion of audio sinal, homoeneous in terms of acoustic source and channel. # averae duration (s) music sements 17 2. speech sements 21 22.3 Table 2: Statistics of sements in the test set. The test set contains a total of 212 SCs (218 sements distributed amon six news prorams). Experimental Conditions Multivariate observations of dimension 13 were used, i.e. 12 mel-scaled cepstral coefficients and the lo-enery. SCs detections was performed by usin a tolerance value of 5ms. Multiple SC detection was performed by means of the alorithm shown in Fiure 1. Moreover, in order to compute a precision/recall operatin curve of each method, an empirical threshold was introduced in the decision criteria (9) and (12). In fact, the threshold can be seen as an empirically estimated additional penalty to the method. Different values of the threshold were tested and the resultin precision/recall statistics were computed. Experimental Results Precision vs. recall points of each method are shown in Fiure 3. As a reference, complete curves are plotted for the and T2 methods. The left most points of all the model selection criteria correspond to settin the threshold

to the oriinal value, i.e. zero. By lookin at Fiure 3 the followin can be observed: straihtforward application of the methods on audio data provides hih recall but very low precision; by suitably tunin the threshold value, on each sinle method, much better performance can be achieved; optimal values of the threshold make all methods, with the exception of T2, perform comparably well; T2 performs sinificantly worse than all other methods. Moreover, no improvement was achieved even by usin a universal pooled variance estimated as suested in [15];,, and confirm to be amon the best performin methods; the pure empirically tuned method performs as well as the best model selection methods; 7. CONCLUSIONS Several model selection methods for acoustic sementation were presented and tested, both on synthetic and real audio data. Tanible differences amon the methods appeared in experiments performed under controlled conditions. In particular, methods with simple penalty functions showed to perform better with multivariate data. Methods based on the Fisher information (i.e. M and F) did not result competitive versus easier methods, at least on the here considered sementation problems. Application of any method on real audio data requires introducin an empirical threshold on the decision criterion. Tunin the threshold on each method permits to achieve sinificantly better retrieval performance. Almost all the considered methods reached very similar optimal performance. Besides, methods which best performed on the synthetic data sets also worked well on the audio data. To conclude, a major point in applyin the considered methods on audio data concerns their robustness with respect to the normality assumption on the data source. By the introduction of an empirical threshold in the decision criterion, all the tested selection criteria showed to be reasonably robust. Future work will be devoted to the development and evaluation of non parametric methods for the acoustic sementation problem. [2] R. A. Baxter. Minimum Messae Lenht Inference: Theory and Applications. PhD thesis, Department of Computer Science Monash University, Clayton, Victoria, Australia, 1996. [3] H. Bozdoan. Model selection and the Akaike s information criterion (): the eneral theory and its analytical extensions. Psychometrika, 52(3):345 37, 1987. [4] F. Brunara, M. Cettolo, M. Federico, and D. Giuliani. A system for the sementation and transcription of Italian radio news. In Proceedins of RIAO Content-Based Multimedia Information Access, Paris, France, 2. [5] M. Cettolo. Sementation, classification and clusterin of an Italian broadcast news corpus. In Proceedins of the RIAO International Conference, Paris, France, 2. [6] J. H. Conway and N. J. A. Sloane. Sphere Packin, Lattices and Groups. Spriner Verla, Berlin, Germany, 1988. [7] P. Delacourt, D. Kryze, and C. J. Wellekens. Speaker-based sementation for audio data indexin. In Proceedins of the ESCA ETRW workshop Accessin Information in Spoken Audio, Cambride, UK, 1999. [8] M. Federico, D. Giordani, and P. Coletti. Development and Evaluation of an Italian Broadcast News Corpus. In Proceedins of the Second International Conference on Lanuae Resources and Evaluation (LREC), Athens, Greece, 2. [9] D. Liu and F. Kubala. Fast speaker chane detection for broadcast news transcription and indexin. In Proceedins of the 6th European Conference on Speech Communication and Technoloy, paes 131 134, Budapest, Hunary, 1999. [1] J. Rissanen. Stochastic complexity. Journal of the Royal Statistical Society, B, 49(3):223 239, 1987. [11] G. Schwarz. Estimatin the dimension of a model. The Annals of Statistics, 6(2):461 464, 1978. [12] G. A. F. Seber. Multivariate Observations. John Wiley & Sons, New York, NY, 1984. [13] M. S. Srivastava and E. M. Carter. An Introduction to Applied Multivariate Statistics. North-Holland, New York, NY, 1988. [14] C. S. Wallace and P. R. Freeman. Estimation and inference by compact codin. Journal of the Royal Statistical Society, B, 49(3):24 265, 1987. [15] S. Wemann, P. Zhan, and L. Gillick. Proress in broadcast news transcription at Draon Systems. In Proceedins of the IEEE International Conference on Acoustics, Speech and Sinal Processin, volume I, paes 33 36, Phoenix, AZ, 1999. 8. ACKNOWLEDGMENTS The here presented work was carried out within the European project CORETEX (IST-1999-11876). The authors thank R. A. Baxter and D. Giuliani for their help and useful suestions. REFERENCES [1] H. Akaike. On entropy maximization principle. In P. R. Krishnaiah, editor, Applications of Statistics, paes 27 41. North-Holland, Amsterdam, Nederlands, 1977.

MEAN SHIFTING VARIANCE SCALING α=.1 - α=.2 - α=.3 - α=.4 - α=.5 - α=.9 - α=.8 - α=.7 - α=.6 - α=.5 - d=1 d=1 RECALL RECALL RECALL 5 F M T-2 - - - - - - - - - - α=.1 α=.2 α=.3 α=.4 α=.5 α=.9 α=.8 α=.7 α=.6 α=.5 d=5 d=5 5 d=1 5 F M T-2 - - - - - α=.1 α=.2 α=.3 α=.4 α=.5 5 5 5 F M F M - - - - - α=.9 α=.8 α=.7 α=.6 α=.5 d=1 F M T-2 - - - - - PRECISION F M - - - - - PRECISION Fiure 2: Results of experiments under controlled conditions.

95 9 RECALL 85 8 aic bic caic caicf mdl ml mml t2 2 3 4 5 6 7 8 9 PRECISION Fiure 3: Precision vs. recall curve by different methods on an audio sementation task.