Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara Ishikawa, Koya Sahashi, Seiichi Nakagawa Department of Computer Science and Engineering Toyohashi University of Technology Toyohashi, Aichi, Japan Email: {c145302, k143331}@edu.tut.ac.jp, nakagawa@tut.jp Abstract Acoustic Event Detection plays an important role for computational acoustic scene analysis. Although we would face with a sound overlapping problem in a real situation, conventional methods do not consider the problem enough. In this paper, we propose a new overlapped acoustic event detection technique combined a source separation technique of Non-negative Matrix Factorization with shared basis vectors and a deep neural network based acoustic model to improve the detection performance. Our approach showed 20.0% absolute higher performance than the best result achieved in the D-CASE 2012 challenge on the frame based F-measure. I. INTRODUCTION Acoustic Event Detection (AED) plays an important role for Computational Acoustic Scene Analysis (CASA) [1]. Applications such as lifelog tagging, a security system with image processing, and noise pollution detection, and so on have been considered with this technique [2], [3], [4], [5]. To detect an acoustic event, two general approaches mainly have been used. One of them is the use of acoustic models and features like as in an automatic speech recognition (ASR) system [6]. ASR systems usually use Gaussian Mixture Models (GMMs) or Deep Neural Network (DNN) based likelihood calculator with Hidden Markov Models (HMMs) as the acoustic model and Mel-frequency cepstrum coefficients (MFCC) or log Mel-filterbank outputs (FBANK) as the acoustic features. The other one is a source separation technique such as Non-negative matrix factorization (NMF) [7]. The IEEE D-CASE 2012 workshop [8] has been held for the CASA task in 2012. This task had two tasks: Acoustic Scene Classification and Acoustic Event Detection. As the subtasks of AED, there were Office Live (OL) task and Office Synthetic (OS) task. For both subtasks, an office environment and 16 acoustic events was supposed as the acoustic condition. In OL task, development tracks and test tracks were recorded in a real room. These tracks had no overlapped segments of acoustic events. In OS task, D- CASE provided development tracks and test tracks which included overlapped segments of acoustic events by artificially synthesizing the tracks. For these AED tasks, in similarity to ASR methods, Vuegen et al. [9] proposed a MFCC-GMM based system to detect acoustic events for OL and OS. The AED performance of this system was 43.4% frame-based F-measure for OL and 13.5% for OS. Gemmeke et al. [10] took an NMF-HMM based approach. Their method gave the best performance on OS in this challenge. This method showed 31.4% frame-based F-measure for OL and 21.3% for OS. The performance of these detection methods was still low and the detection accuracy (F-measure) was decreased in event overlapping situation like as in ASR system. It is an important factor to be able to detect multiple acoustic events in event overlapping segments which often appear in a real application. To improve the detection accuracy in event overlapping segments, there are some points to be considered. One of them is complex acoustic features of the event due to overlapping other acoustic events or background noises. It leads mismatching between acoustic features and acoustic models and difficulty of detection of the event. Another one is the variation of acoustic events. This task includes many sound source classes, not only human speech or voice. However, some sound classes are very similar partly and they have similar basis vectors in their basis matrix of NMF. Therefore, it is hard to discriminate particular characteristics for each class by NMF. In this paper, we propose a shared basis vector method for NMF to detect overlapped acoustic events. Conventional NMF works with a basis (dictionary) matrix and an activation matrix. We hope that the basis vectors are different each other in each class. However, the basis matrix has many similar component bases over event classes actually as described above, and the NMF process tends to give low activation weights equally to the bases or to give high activation weights to only one basis, which leads misdetections especially for overlapped acoustic events. In such a case, a basis vector in a common basis matrix over classes which is shared among suitable plural classes could be better basis representation. We believe that this method helps the NMF process provide appropriate activation weights to the bases in overlapping segments. This paper is organized in four sections: The next section describes conventional and our proposed basis methods and our detection system. Section III shows the experimental results. Finally, Section IV offers some conclusions. II. ACOUSTIC EVENT DETECTION FRAMEWORK Figure 1 shows the block diagram of our AED system. As preprocessing, the input signal (mixed sound source signal) is separated into each event class sound signal by NMF. After the separation, each sound spectrum is converted into acoustic features (MFCC or FBANK in this paper). To calculate a likelihood score for each event, the converted acoustic feature is fed to a deep neural acoustic model. To detect events, the scores are compared with pre-defined thresholds and the acoustic event is detected when the score is over the threshold. A. Conventional NMF Non-negative matrix factorization (NMF) is an effective tool for separating acoustic events in an overlapping segment [11]. By using NMF, a temporal-frequency spectrogram matrix S can be represented by multiplication of a class basis matrix W and an activation matrix H, i.e. S Ŝ = W H. As the number of frequency bins is noted as L, the number of frames is T, and the total number of basis vectors is C, the size of matrix S and Ŝ becomes L T, the size of matrix W is L C, and the size of matrix H is C T. When W n and 978-1-5090-4045-2/17/$31.00 2017 IEEE 420

Fig. 1. Block diagram of our acoustic event detection system H n is a basis matrix and an activation matrix for the event class n (N represents the number of classes), W and H is also represented as W = [W 1, W 2,..., W N ] and H = [H1, t H2, t..., HN t ] t (t means matrix transpose). Also we can represent the observed spectrogram by W and H in a summation formula, S N WnHn. In this n=1 paper, to separate a mixed signal into each class signal, we used a Wiener like separation filter: Fig. 2. Distance between basis vectors within class and over classes S n S ˆ W nh n n = S N. (1) m=1 WmHm In this paper, LBG algorithm (i.e. vector quantization) was used to make a basis matrix that represents a set of spectral bases of the event class [12]. B. NMF with shared basis vectors The conventional NMF method assumes source identity. It is very important factor in order to separate mixed signal into target class signals. However, it does not work well when there are many target classes because of the basis similarity over classes. Actually, when we worked on this task with DNN acoustic model and a conventional NMF method, we obtained worse results than DNN without any separation method (see Section III). In order to explore the conventional basis vectors, we visualize the distance between basis vectors for each class and over classes in Figure 2. Each cell represents the Euclidean distance between basis vectors. The basis matrix with 4 basis vectors for each class is used for this figure. In the figure, the color represents a distance measure: red means long distance, but blue means near. It is no wonder that the diagonal component is blue and the surroundings in 4 blocks are also close to blue because they are the same or very near vectors. However, many of the distances between the bases are small valued, even between classes. This similarity of basis vectors causes generation of inappropriate activation matrix for each basis. Therefore, it is hard to get sufficient separation performance with the conventional NMF for such data. To solve this problem, we propose NMF with shared basis vectors. This method uses a common basis matrix which is made by LBG algorithm from all classes data. A basis vector in the common basis matrix is linked to basis vectors in the class-wise basis matrices, and the activation of a common basis vector is shared by all the linked classes to the common basis vector. To link a basis vector in the common basis matrix with plural classes, we use the Euclidean distance between a basis vector in the common basis matrix and a basis from the class-wise basis matrix. We compare two linking criteria between a common basis vector and a class-wise basis vector: (a) Threshold selection (Figure 3) This criterion is to choose a common basis vector which the Fig. 3. Shared bases (a) - The number of bases for each class is variable depending on the threshold. Fig. 4. Shared bases (b) - The number of bases for each class is constant. Euclidean distance between the basis and a class-wise basis vector is lower than a pre-defined threshold. The basis in a common basis matrix can be shared in several classes. (b) Constant selection (Figure 4) This criteria is to choose a k-nearest basis vectors in a common basis matrix for each class based on the Euclidean distance. The number of assigned basis vectors k is constant across all classes. In [13], [14], Komatsu et al. reported how to make a improved basis matrix for NMF having the same motivation. However, we believe that our method can make more efficient basis vectors explicitly. 978-1-5090-4045-2/17/$31.00 2017 IEEE 421

Fig. 5. Score Plot (black line: silence, blue line: event (keys), red line: event (pendrop), top broken line: silence threshold, bottom broken line: event threshold) C. Event detection We utilize DNN output scores to detect acoustic events. The DNN score o j from the output unit (class) j can be calculated as follows: o j = I w ijh i + b j, (2) i=1 where h i means the value of unit i in the last hidden layer, w ij the weight between the output unit j and the hidden unit i, b j the bias of output unit j, and I the number of units in the last hidden layer. By using o j, a posterior p j of the unit j can be calculated as follows: p j = exp(o j) J exp(o, (3) j =1 j ) where J is the number of output units (i.e. the number of classes). In this paper, we use p j for silence detection and o j for acoustic event detection. As an example, Figure 5 shows a score chart. The first vertical axis is a posterior, p j, for silence, the second vertical axis is the DNN score, o j, for event classes and the horizontal axis is the frame index. To detect acoustic events, we define two thresholds for silence and event classes. We use a posterior only for the silence class detection since a posterior varies depending on the number of simultaneous events. Additionally, we take the moving average for each class score values for score smoothing to avoid quick changes of the score values. From the preliminary experiment with OL (Office Live) development tracks, the moving average window size is decided on ±9 frames. For comparison, we also used only NMF or DNN, respectively. In the case of NMF, we used the summation of estimated activation weights for all basis vectors in each class with pre-defined thresholds to detect events. In the case of DNN, we used the output values of DNN, Equation 2, for each class with pre-defined thresholds without sound source separation processing. A. Experimental setup III. EXPERIMENTS We evaluate our method on IEEE D-CASE 2012 TASK2 OS (Office Synthetic) [8]. The test tracks of OS include acoustic event sequences and overlapped events. In this task, 16 event classes are defined: alert, clear-throat, cough, door-slam, drawer, keyboard, keys, knock, laughter, mouse, page-turn, pen-drop, phone, printer, speech, and switch. In the D-CASE 2012 challenge, the training, development and test sets are provided. The sound samples were recorded in single channel at rate of 44.1kHz with 24bit quantization. There are 20 training samples per class and the total duration is about 15 minutes. As the test tracks, the D-CASE 2012 challenge provides 12 tracks. Each track has a 2 minute duration. The number of reference frames is 99,981, and there are 15,180 overlapped event frames in them. 1) NMF: The basis matrix for conventional NMF was made of normalized amplitude spectra. We first carried out Fourier transform to obtain linear amplitude spectra. The analysis Hamming window was 20ms length with 50% overlapping. After obtaining the spectra, we got a basis matrix by using LBG algorithm for each class. We used Euclidean distance as the distance measure between vectors for the LBG algorithm. We adopted KL-divergence as a cost function for NMF [15]. In this experiment, we used four basis vectors for each class. For our proposed basis matrix, we made a common basis matrix from all class data. The number of basis vectors in the matrix was fixed as 64, and the number of basis vectors for each class was set as 4 or 8. By using the methods (a) and (b) described in Section II-B, we made links between a basis vector in the all class basis matrix and a basis vector in each class basis matrix using Euclidean distance. To find the optimum links, we used the following setups: (a) Threshold: 0.35, 0.40, 0.45, 0.50 (in the case of 4 basis vectors in each class) or 0.40, 0.45, 0.50, 0.55 (in the case of 8 basis vectors in each class). (b) Constant selection: Select 4 or 5 basis vectors for each class. 2) Acoustic feature: Before the feature extraction process, Spectral Subtraction (SS) was applied to suppress the background noise [16]. The subtraction coefficient was set to 2.0 and the flooring coefficient 0.01. To estimate the noise spectrum, we used noise regions denoted in the label in the training process, and the first 100 frames in the test process. As acoustic features, we used MFCC and FBANK. For extracting acoustic features, the analysis Hamming window was 20ms length with 50% overlapping, which is the same as for NMF. The number of filters in the Mel-filterbank was set to 33 for MFCC. We used 12 dimensional MFCC with log power, and their temporal s and s. The final feature vector of MFCC had 39 dimensions per frame. For FBANK, the number of channels was set to 45. As the same as for MFCC, we produced temporal s and s. The feature vector of FBANK became finally 135 dimensions. 3) Acoustic model: As an acoustic model, we used DNN. Since the original training data is not enough amount for training a DNN acoustic model, we produced multi-condition training data by adding noise data which we originally recorded in several office rooms to original training data. We corrupted the original training data by the room noises at 20, 15, and 10dB SNRs. We also added the OL development tracks to the multi-condition training set. We used a ±3 frame context as input to DNN, so that the input layer had 273 units for MFCC or 945 units for FBANK. Our DNN had 5 hidden layers. The first hidden layer had 512 units, the second, third and fourth one had 256, 128, and 64 units respectively, and the last hidden layer had 32 units. Finally the output layer had 17 units corresponding to 16 acoustic event classes and silence. We used the Rectified Linear function for DNN unit activation. For training DNN, we took a supervised learning method without pre-training. We also evaluate a score combination method of two DNN acoustic models of MFCC and FBANK. The method (we called it FUSION here) uses the maximum score from the two acoustic models simply. 978-1-5090-4045-2/17/$31.00 2017 IEEE 422

TABLE I D-CASE 2012 CHALLENGE RESULTS FOR OS Measure Baseline [8] DHV [6] GVV [10] VVK [9] F [%] 12.8 18.7 21.3 13.5 TABLE II ONLY NMF OR DNN RESULTS DNN only NMF only Measure MFCC FBANK FUSION R overlap [%] 9.5 14.8 30.0 28.3 F [%] 14.4 27.3 33.0 36.5 We hope to improve the performance of event detection by reflecting different acoustic model characteristics to the scores. 4) Evaluation metric: We followed the evaluation metrics of the D-CASE 2012 challenge that are defined as the frame-based Recall, P recision, and F -measure. The frame based evaluation is judged every 10 ms. As the number of correct event frames is C, the number of total event detection frames is E, and GT is the number of event frames in reference labels, each metric is defined as the following formulas: P recision[%] = C 100 E (4) Recall[%] = C 100 GT (5) F [%] = 2 P recision Recall 100 P recision + Recall (6) We also used the R overlap measure which is Recall only for the event overlapping frames. We compared the proposed method with the previous results on the D-CASE 2012 challenge. B. Experimental Results Tables I and II show the D-CASE 2012 results and our simple NMF or DNN results for OS. Using DNN gave F -measure improvement, especially with FBANK and FUSION, from the D-CASE 2012 challenge results. However, using NMF only, we obtained only 14.4% on F -measure. Additionally, when we used NMF with DNN, the result was worse than DNN only (shown in Table IV). These results show that the source separation based on a conventional NMF does not work well. Table III and IV show the results of our proposed method in this paper. Our proposed method shows 37.6% in F -measure with MFCC, 37.4% with FBANK, and 41.3% by FUSION at each best and R overlap was improved as well. The sharing basis method improved the performance remarkably by 27.2% for MFCC and 16.8% for FBANK absolutely from NMF-DNN as shown in Table IV. The best result was from FUSION with threshold selection (Threshold = 0.45, the number of basis = 8 for each class). The threshold selection method is better than the constant selection as shown in Table III. In the shared basis vectors, the laughter class had most links with basis that was 17, and alert and switch got only 5 basis sharing. The average links between class and the common basis matrix was 11 basis vectors and they are less links than the constant selection case. We guess that the thresholding avoids to make links to unrequired basis vectors. IV. CONCLUSION In this paper, we proposed the sharing basis vector to improve the NMF separation and AED performance for event overlapping segments. We evaluated the proposed method on the D-CASE 2012 challenge. Comparing with the previous results, we obtained 41.3% TABLE III RESULTS OF SHARING BASIS METHOD (NMF-DNN) F -MEASURE [%] Threshold selection Threshold #basis MFCC FBANK 0.35 4 36.7 35.9 0.40 4 36.2 35.4 0.45 4 35.7 33.8 0.50 4 36.3 36.4 0.40 8 29.6 35.3 0.45 8 34.2 37.4 0.50 8 32.0 36.0 0.55 8 37.6 36.5 Constant selection Selection #basis MFCC FBANK 4 4 30.7 31.2 5 4 29.6 34.0 4 8 30.7 31.2 5 8 28.1 24.8 TABLE IV PERFORMANCE COMPARISON BETWEEN CONVENTIONAL AND THE PROPOSED METHODS Method Feature R overlap[%] F [%] DNN FUSION 28.3 36.5 NMF-DNN w/o sharing basis Sharing basis Threshold = 0.45 #basis = 8 Sharing basis Threshold = 0.55 #basis = 8 MFCC 8.76 10.4 FBANK 23.1 20.6 FUSION 32.1 17.2 MFCC 31.4 34.2 FBANK 48.7 37.4 FUSION 39.9 41.3 MFCC 33.4 37.6 FBANK 36.3 36.5 FUSION 39.6 40.5 frame-based F -measure at the best and absolute 20% improvement from the previous challenge best result. As future works, we are interested in other spectral reconstruction methods that uses an original class-wise basis matrix to make a reconstructed spectrum instead of a common basis matrix after performing the sharing basis NMF. We also plan to evaluate our method in new AED challenge, D-CASE 2016. REFERENCES [1] D. Wang and G. Brown, Eds., Computational auditory scene analysis: principles, algorithms and applications. Hoboken ( N.J.): J. Wiley & sons, 2006. [Online]. Available: http://opac.inria.fr/record=b1120769 [2] J. Salmon and J. P. Bello, Unsupervised feature learning for urban sound classification, in Proc. IEEE ICASSP 2015, 2015, pp. 171 175. [3] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, Audio analysys for surveillance applications, in Proc. 2005 IEEE WASPAA. IEEE, 2005, pp. 158 161. [4] M. Espi, M.Fujimoto, K. Kinoshita, and T. Nakatani, Acoustic event detection in speech overlapping scenarios based on high-resolution spectral input and deep learning, in IEICE Transactions on Information and Systems, vol. E98-D, 2015, pp. 1799 1807. [5] K. Yamamoto and K. Itou, Browsing audio life-log data using acoustic and location information, in Proc. IEEE UBICOMM 09, 2014, pp. 96 101. [6] A. Diment, T. Heittola, and T. Virtanen, Sound event detection for office live and office synthetic aasp challenge, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp. 1 3. [7] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, in Proc. IEEE ICASSP 2015, 2015, pp. 151 155. [8] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, Detection and Classification of Acousitc Scens and Events, in IEEE Transactions on Multimedia, vol. 17. IEEE, 2014, pp. 1733 1746. 978-1-5090-4045-2/17/$31.00 2017 IEEE 423

[9] L. Vuegen, B. V. D. Broeck, P. Karsmakers, J. F. Gemmeke, B. Vanrumste, and H. V. hamme, An MFCC - GMM approach apploach for event detection and classification, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp. 1 3. [10] J. F. Gemmeke, L. Vuegen, P. Karsmakers, and H. Hamme, An exemplar-based NMF approach to audio event detection, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp. 1 3. [11] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS, 2000, pp. 556 562. [12] S. Nakano, K. Yamamoto, and S. Nakagawa, Speech recognition in mixed sound of speech and music based on vector quantization and non-negative matrix facorization, in Proc. INTERSPEECH 2011, 2011, pp. 1781 1784. [13] T. Komatsu, Y. Senda, and R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, in Proc. IEEE ICASSP 2016, 2016, pp. 2259 2263. [14] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries, DCASE2016 Challenge, Tech. Rep., September 2016. [15] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS 2000, 2000, pp. 556 562. [16] S. F. Boll, Suppresson of acoustic noise in speech using spectral subtraction, in IEEE Transactions on Audio and Acoustic Signal Processing, 1979, pp. 113 120. 978-1-5090-4045-2/17/$31.00 2017 IEEE 424