Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Similar documents
BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

DETECTION OF OVERLAPPING ACOUSTIC EVENTS USING A TEMPORALLY-CONSTRAINED PROBABILISTIC MODEL

arxiv: v2 [cs.sd] 15 Aug 2016

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Sound Recognition in Mixtures

Automatic Speech Recognition (CS753)

Polyphonic Sound Event Tracking using Linear Dynamical Systems

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Robust Sound Event Detection in Continuous Audio Environments

Environmental Sound Classification in Realistic Situations

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

Robust Speaker Identification

Independent Component Analysis and Unsupervised Learning

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY

arxiv: v2 [cs.sd] 7 Feb 2018

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

Duration-Controlled LSTM for Polyphonic Sound Event Detection

Single Channel Signal Separation Using MAP-based Subspace Decomposition

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

Model-based unsupervised segmentation of birdcalls from field recordings

Exemplar-based voice conversion using non-negative spectrogram deconvolution

THE task of identifying the environment in which a sound

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Voice Activity Detection Using Pitch Feature

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING. Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard

Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Deep NMF for Speech Separation

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Multi-level Attention Model for Weakly Supervised Audio Classification

Comparing linear and non-linear transformation of speech

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

2D Spectrogram Filter for Single Channel Speech Enhancement

Dominant Feature Vectors Based Audio Similarity Measure

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.

Exploring the Relationship between Conic Affinity of NMF Dictionaries and Speech Enhancement Metrics

arxiv: v1 [cs.sd] 29 Apr 2016

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

A State-Space Approach to Dynamic Nonnegative Matrix Factorization

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION?

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

Monaural speech separation using source-adapted models

Bayesian Hierarchical Modeling for Music and Audio Processing at LabROSA

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription

Why DNN Works for Acoustic Modeling in Speech Recognition?

NONNEGATIVE FEATURE LEARNING METHODS FOR ACOUSTIC SCENE CLASSIFICATION

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

Support Vector Machines using GMM Supervectors for Speaker Verification

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Diffuse noise suppression with asynchronous microphone array based on amplitude additivity model

Estimation of Cepstral Coefficients for Robust Speech Recognition

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

CS229 Project: Musical Alignment Discovery

Mixtures of Gaussians with Sparse Structure

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

Modifying Voice Activity Detection in Low SNR by correction factors

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Deep Neural Networks

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Proc. of NCC 2010, Chennai, India

Detection-Based Speech Recognition with Sparse Point Process Models

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Harmonic Structure Transform for Speaker Recognition

An Evolutionary Programming Based Algorithm for HMM training

OVERLAPPING ANIMAL SOUND CLASSIFICATION USING SPARSE REPRESENTATION

SUPERVISED NON-EUCLIDEAN SPARSE NMF VIA BILEVEL OPTIMIZATION WITH APPLICATIONS TO SPEECH ENHANCEMENT

CEPSTRAL analysis has been widely used in signal processing

Augmented Statistical Models for Speech Recognition

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Pattern Recognition Applied to Music Signals

EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Transcription:

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Email: yamamoto@cs.chubu.ac.jp Chikara Ishikawa, Koya Sahashi, Seiichi Nakagawa Department of Computer Science and Engineering Toyohashi University of Technology Toyohashi, Aichi, Japan Email: {c145302, k143331}@edu.tut.ac.jp, nakagawa@tut.jp Abstract Acoustic Event Detection plays an important role for computational acoustic scene analysis. Although we would face with a sound overlapping problem in a real situation, conventional methods do not consider the problem enough. In this paper, we propose a new overlapped acoustic event detection technique combined a source separation technique of Non-negative Matrix Factorization with shared basis vectors and a deep neural network based acoustic model to improve the detection performance. Our approach showed 20.0% absolute higher performance than the best result achieved in the D-CASE 2012 challenge on the frame based F-measure. I. INTRODUCTION Acoustic Event Detection (AED) plays an important role for Computational Acoustic Scene Analysis (CASA) [1]. Applications such as lifelog tagging, a security system with image processing, and noise pollution detection, and so on have been considered with this technique [2], [3], [4], [5]. To detect an acoustic event, two general approaches mainly have been used. One of them is the use of acoustic models and features like as in an automatic speech recognition (ASR) system [6]. ASR systems usually use Gaussian Mixture Models (GMMs) or Deep Neural Network (DNN) based likelihood calculator with Hidden Markov Models (HMMs) as the acoustic model and Mel-frequency cepstrum coefficients (MFCC) or log Mel-filterbank outputs (FBANK) as the acoustic features. The other one is a source separation technique such as Non-negative matrix factorization (NMF) [7]. The IEEE D-CASE 2012 workshop [8] has been held for the CASA task in 2012. This task had two tasks: Acoustic Scene Classification and Acoustic Event Detection. As the subtasks of AED, there were Office Live (OL) task and Office Synthetic (OS) task. For both subtasks, an office environment and 16 acoustic events was supposed as the acoustic condition. In OL task, development tracks and test tracks were recorded in a real room. These tracks had no overlapped segments of acoustic events. In OS task, D- CASE provided development tracks and test tracks which included overlapped segments of acoustic events by artificially synthesizing the tracks. For these AED tasks, in similarity to ASR methods, Vuegen et al. [9] proposed a MFCC-GMM based system to detect acoustic events for OL and OS. The AED performance of this system was 43.4% frame-based F-measure for OL and 13.5% for OS. Gemmeke et al. [10] took an NMF-HMM based approach. Their method gave the best performance on OS in this challenge. This method showed 31.4% frame-based F-measure for OL and 21.3% for OS. The performance of these detection methods was still low and the detection accuracy (F-measure) was decreased in event overlapping situation like as in ASR system. It is an important factor to be able to detect multiple acoustic events in event overlapping segments which often appear in a real application. To improve the detection accuracy in event overlapping segments, there are some points to be considered. One of them is complex acoustic features of the event due to overlapping other acoustic events or background noises. It leads mismatching between acoustic features and acoustic models and difficulty of detection of the event. Another one is the variation of acoustic events. This task includes many sound source classes, not only human speech or voice. However, some sound classes are very similar partly and they have similar basis vectors in their basis matrix of NMF. Therefore, it is hard to discriminate particular characteristics for each class by NMF. In this paper, we propose a shared basis vector method for NMF to detect overlapped acoustic events. Conventional NMF works with a basis (dictionary) matrix and an activation matrix. We hope that the basis vectors are different each other in each class. However, the basis matrix has many similar component bases over event classes actually as described above, and the NMF process tends to give low activation weights equally to the bases or to give high activation weights to only one basis, which leads misdetections especially for overlapped acoustic events. In such a case, a basis vector in a common basis matrix over classes which is shared among suitable plural classes could be better basis representation. We believe that this method helps the NMF process provide appropriate activation weights to the bases in overlapping segments. This paper is organized in four sections: The next section describes conventional and our proposed basis methods and our detection system. Section III shows the experimental results. Finally, Section IV offers some conclusions. II. ACOUSTIC EVENT DETECTION FRAMEWORK Figure 1 shows the block diagram of our AED system. As preprocessing, the input signal (mixed sound source signal) is separated into each event class sound signal by NMF. After the separation, each sound spectrum is converted into acoustic features (MFCC or FBANK in this paper). To calculate a likelihood score for each event, the converted acoustic feature is fed to a deep neural acoustic model. To detect events, the scores are compared with pre-defined thresholds and the acoustic event is detected when the score is over the threshold. A. Conventional NMF Non-negative matrix factorization (NMF) is an effective tool for separating acoustic events in an overlapping segment [11]. By using NMF, a temporal-frequency spectrogram matrix S can be represented by multiplication of a class basis matrix W and an activation matrix H, i.e. S Ŝ = W H. As the number of frequency bins is noted as L, the number of frames is T, and the total number of basis vectors is C, the size of matrix S and Ŝ becomes L T, the size of matrix W is L C, and the size of matrix H is C T. When W n and 978-1-5090-4045-2/17/$31.00 2017 IEEE 420

Fig. 1. Block diagram of our acoustic event detection system H n is a basis matrix and an activation matrix for the event class n (N represents the number of classes), W and H is also represented as W = [W 1, W 2,..., W N ] and H = [H1, t H2, t..., HN t ] t (t means matrix transpose). Also we can represent the observed spectrogram by W and H in a summation formula, S N WnHn. In this n=1 paper, to separate a mixed signal into each class signal, we used a Wiener like separation filter: Fig. 2. Distance between basis vectors within class and over classes S n S ˆ W nh n n = S N. (1) m=1 WmHm In this paper, LBG algorithm (i.e. vector quantization) was used to make a basis matrix that represents a set of spectral bases of the event class [12]. B. NMF with shared basis vectors The conventional NMF method assumes source identity. It is very important factor in order to separate mixed signal into target class signals. However, it does not work well when there are many target classes because of the basis similarity over classes. Actually, when we worked on this task with DNN acoustic model and a conventional NMF method, we obtained worse results than DNN without any separation method (see Section III). In order to explore the conventional basis vectors, we visualize the distance between basis vectors for each class and over classes in Figure 2. Each cell represents the Euclidean distance between basis vectors. The basis matrix with 4 basis vectors for each class is used for this figure. In the figure, the color represents a distance measure: red means long distance, but blue means near. It is no wonder that the diagonal component is blue and the surroundings in 4 blocks are also close to blue because they are the same or very near vectors. However, many of the distances between the bases are small valued, even between classes. This similarity of basis vectors causes generation of inappropriate activation matrix for each basis. Therefore, it is hard to get sufficient separation performance with the conventional NMF for such data. To solve this problem, we propose NMF with shared basis vectors. This method uses a common basis matrix which is made by LBG algorithm from all classes data. A basis vector in the common basis matrix is linked to basis vectors in the class-wise basis matrices, and the activation of a common basis vector is shared by all the linked classes to the common basis vector. To link a basis vector in the common basis matrix with plural classes, we use the Euclidean distance between a basis vector in the common basis matrix and a basis from the class-wise basis matrix. We compare two linking criteria between a common basis vector and a class-wise basis vector: (a) Threshold selection (Figure 3) This criterion is to choose a common basis vector which the Fig. 3. Shared bases (a) - The number of bases for each class is variable depending on the threshold. Fig. 4. Shared bases (b) - The number of bases for each class is constant. Euclidean distance between the basis and a class-wise basis vector is lower than a pre-defined threshold. The basis in a common basis matrix can be shared in several classes. (b) Constant selection (Figure 4) This criteria is to choose a k-nearest basis vectors in a common basis matrix for each class based on the Euclidean distance. The number of assigned basis vectors k is constant across all classes. In [13], [14], Komatsu et al. reported how to make a improved basis matrix for NMF having the same motivation. However, we believe that our method can make more efficient basis vectors explicitly. 978-1-5090-4045-2/17/$31.00 2017 IEEE 421

Fig. 5. Score Plot (black line: silence, blue line: event (keys), red line: event (pendrop), top broken line: silence threshold, bottom broken line: event threshold) C. Event detection We utilize DNN output scores to detect acoustic events. The DNN score o j from the output unit (class) j can be calculated as follows: o j = I w ijh i + b j, (2) i=1 where h i means the value of unit i in the last hidden layer, w ij the weight between the output unit j and the hidden unit i, b j the bias of output unit j, and I the number of units in the last hidden layer. By using o j, a posterior p j of the unit j can be calculated as follows: p j = exp(o j) J exp(o, (3) j =1 j ) where J is the number of output units (i.e. the number of classes). In this paper, we use p j for silence detection and o j for acoustic event detection. As an example, Figure 5 shows a score chart. The first vertical axis is a posterior, p j, for silence, the second vertical axis is the DNN score, o j, for event classes and the horizontal axis is the frame index. To detect acoustic events, we define two thresholds for silence and event classes. We use a posterior only for the silence class detection since a posterior varies depending on the number of simultaneous events. Additionally, we take the moving average for each class score values for score smoothing to avoid quick changes of the score values. From the preliminary experiment with OL (Office Live) development tracks, the moving average window size is decided on ±9 frames. For comparison, we also used only NMF or DNN, respectively. In the case of NMF, we used the summation of estimated activation weights for all basis vectors in each class with pre-defined thresholds to detect events. In the case of DNN, we used the output values of DNN, Equation 2, for each class with pre-defined thresholds without sound source separation processing. A. Experimental setup III. EXPERIMENTS We evaluate our method on IEEE D-CASE 2012 TASK2 OS (Office Synthetic) [8]. The test tracks of OS include acoustic event sequences and overlapped events. In this task, 16 event classes are defined: alert, clear-throat, cough, door-slam, drawer, keyboard, keys, knock, laughter, mouse, page-turn, pen-drop, phone, printer, speech, and switch. In the D-CASE 2012 challenge, the training, development and test sets are provided. The sound samples were recorded in single channel at rate of 44.1kHz with 24bit quantization. There are 20 training samples per class and the total duration is about 15 minutes. As the test tracks, the D-CASE 2012 challenge provides 12 tracks. Each track has a 2 minute duration. The number of reference frames is 99,981, and there are 15,180 overlapped event frames in them. 1) NMF: The basis matrix for conventional NMF was made of normalized amplitude spectra. We first carried out Fourier transform to obtain linear amplitude spectra. The analysis Hamming window was 20ms length with 50% overlapping. After obtaining the spectra, we got a basis matrix by using LBG algorithm for each class. We used Euclidean distance as the distance measure between vectors for the LBG algorithm. We adopted KL-divergence as a cost function for NMF [15]. In this experiment, we used four basis vectors for each class. For our proposed basis matrix, we made a common basis matrix from all class data. The number of basis vectors in the matrix was fixed as 64, and the number of basis vectors for each class was set as 4 or 8. By using the methods (a) and (b) described in Section II-B, we made links between a basis vector in the all class basis matrix and a basis vector in each class basis matrix using Euclidean distance. To find the optimum links, we used the following setups: (a) Threshold: 0.35, 0.40, 0.45, 0.50 (in the case of 4 basis vectors in each class) or 0.40, 0.45, 0.50, 0.55 (in the case of 8 basis vectors in each class). (b) Constant selection: Select 4 or 5 basis vectors for each class. 2) Acoustic feature: Before the feature extraction process, Spectral Subtraction (SS) was applied to suppress the background noise [16]. The subtraction coefficient was set to 2.0 and the flooring coefficient 0.01. To estimate the noise spectrum, we used noise regions denoted in the label in the training process, and the first 100 frames in the test process. As acoustic features, we used MFCC and FBANK. For extracting acoustic features, the analysis Hamming window was 20ms length with 50% overlapping, which is the same as for NMF. The number of filters in the Mel-filterbank was set to 33 for MFCC. We used 12 dimensional MFCC with log power, and their temporal s and s. The final feature vector of MFCC had 39 dimensions per frame. For FBANK, the number of channels was set to 45. As the same as for MFCC, we produced temporal s and s. The feature vector of FBANK became finally 135 dimensions. 3) Acoustic model: As an acoustic model, we used DNN. Since the original training data is not enough amount for training a DNN acoustic model, we produced multi-condition training data by adding noise data which we originally recorded in several office rooms to original training data. We corrupted the original training data by the room noises at 20, 15, and 10dB SNRs. We also added the OL development tracks to the multi-condition training set. We used a ±3 frame context as input to DNN, so that the input layer had 273 units for MFCC or 945 units for FBANK. Our DNN had 5 hidden layers. The first hidden layer had 512 units, the second, third and fourth one had 256, 128, and 64 units respectively, and the last hidden layer had 32 units. Finally the output layer had 17 units corresponding to 16 acoustic event classes and silence. We used the Rectified Linear function for DNN unit activation. For training DNN, we took a supervised learning method without pre-training. We also evaluate a score combination method of two DNN acoustic models of MFCC and FBANK. The method (we called it FUSION here) uses the maximum score from the two acoustic models simply. 978-1-5090-4045-2/17/$31.00 2017 IEEE 422

TABLE I D-CASE 2012 CHALLENGE RESULTS FOR OS Measure Baseline [8] DHV [6] GVV [10] VVK [9] F [%] 12.8 18.7 21.3 13.5 TABLE II ONLY NMF OR DNN RESULTS DNN only NMF only Measure MFCC FBANK FUSION R overlap [%] 9.5 14.8 30.0 28.3 F [%] 14.4 27.3 33.0 36.5 We hope to improve the performance of event detection by reflecting different acoustic model characteristics to the scores. 4) Evaluation metric: We followed the evaluation metrics of the D-CASE 2012 challenge that are defined as the frame-based Recall, P recision, and F -measure. The frame based evaluation is judged every 10 ms. As the number of correct event frames is C, the number of total event detection frames is E, and GT is the number of event frames in reference labels, each metric is defined as the following formulas: P recision[%] = C 100 E (4) Recall[%] = C 100 GT (5) F [%] = 2 P recision Recall 100 P recision + Recall (6) We also used the R overlap measure which is Recall only for the event overlapping frames. We compared the proposed method with the previous results on the D-CASE 2012 challenge. B. Experimental Results Tables I and II show the D-CASE 2012 results and our simple NMF or DNN results for OS. Using DNN gave F -measure improvement, especially with FBANK and FUSION, from the D-CASE 2012 challenge results. However, using NMF only, we obtained only 14.4% on F -measure. Additionally, when we used NMF with DNN, the result was worse than DNN only (shown in Table IV). These results show that the source separation based on a conventional NMF does not work well. Table III and IV show the results of our proposed method in this paper. Our proposed method shows 37.6% in F -measure with MFCC, 37.4% with FBANK, and 41.3% by FUSION at each best and R overlap was improved as well. The sharing basis method improved the performance remarkably by 27.2% for MFCC and 16.8% for FBANK absolutely from NMF-DNN as shown in Table IV. The best result was from FUSION with threshold selection (Threshold = 0.45, the number of basis = 8 for each class). The threshold selection method is better than the constant selection as shown in Table III. In the shared basis vectors, the laughter class had most links with basis that was 17, and alert and switch got only 5 basis sharing. The average links between class and the common basis matrix was 11 basis vectors and they are less links than the constant selection case. We guess that the thresholding avoids to make links to unrequired basis vectors. IV. CONCLUSION In this paper, we proposed the sharing basis vector to improve the NMF separation and AED performance for event overlapping segments. We evaluated the proposed method on the D-CASE 2012 challenge. Comparing with the previous results, we obtained 41.3% TABLE III RESULTS OF SHARING BASIS METHOD (NMF-DNN) F -MEASURE [%] Threshold selection Threshold #basis MFCC FBANK 0.35 4 36.7 35.9 0.40 4 36.2 35.4 0.45 4 35.7 33.8 0.50 4 36.3 36.4 0.40 8 29.6 35.3 0.45 8 34.2 37.4 0.50 8 32.0 36.0 0.55 8 37.6 36.5 Constant selection Selection #basis MFCC FBANK 4 4 30.7 31.2 5 4 29.6 34.0 4 8 30.7 31.2 5 8 28.1 24.8 TABLE IV PERFORMANCE COMPARISON BETWEEN CONVENTIONAL AND THE PROPOSED METHODS Method Feature R overlap[%] F [%] DNN FUSION 28.3 36.5 NMF-DNN w/o sharing basis Sharing basis Threshold = 0.45 #basis = 8 Sharing basis Threshold = 0.55 #basis = 8 MFCC 8.76 10.4 FBANK 23.1 20.6 FUSION 32.1 17.2 MFCC 31.4 34.2 FBANK 48.7 37.4 FUSION 39.9 41.3 MFCC 33.4 37.6 FBANK 36.3 36.5 FUSION 39.6 40.5 frame-based F -measure at the best and absolute 20% improvement from the previous challenge best result. As future works, we are interested in other spectral reconstruction methods that uses an original class-wise basis matrix to make a reconstructed spectrum instead of a common basis matrix after performing the sharing basis NMF. We also plan to evaluate our method in new AED challenge, D-CASE 2016. REFERENCES [1] D. Wang and G. Brown, Eds., Computational auditory scene analysis: principles, algorithms and applications. Hoboken ( N.J.): J. Wiley & sons, 2006. [Online]. Available: http://opac.inria.fr/record=b1120769 [2] J. Salmon and J. P. Bello, Unsupervised feature learning for urban sound classification, in Proc. IEEE ICASSP 2015, 2015, pp. 171 175. [3] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, Audio analysys for surveillance applications, in Proc. 2005 IEEE WASPAA. IEEE, 2005, pp. 158 161. [4] M. Espi, M.Fujimoto, K. Kinoshita, and T. Nakatani, Acoustic event detection in speech overlapping scenarios based on high-resolution spectral input and deep learning, in IEICE Transactions on Information and Systems, vol. E98-D, 2015, pp. 1799 1807. [5] K. Yamamoto and K. Itou, Browsing audio life-log data using acoustic and location information, in Proc. IEEE UBICOMM 09, 2014, pp. 96 101. [6] A. Diment, T. Heittola, and T. Virtanen, Sound event detection for office live and office synthetic aasp challenge, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp. 1 3. [7] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, in Proc. IEEE ICASSP 2015, 2015, pp. 151 155. [8] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, Detection and Classification of Acousitc Scens and Events, in IEEE Transactions on Multimedia, vol. 17. IEEE, 2014, pp. 1733 1746. 978-1-5090-4045-2/17/$31.00 2017 IEEE 423

[9] L. Vuegen, B. V. D. Broeck, P. Karsmakers, J. F. Gemmeke, B. Vanrumste, and H. V. hamme, An MFCC - GMM approach apploach for event detection and classification, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp. 1 3. [10] J. F. Gemmeke, L. Vuegen, P. Karsmakers, and H. Hamme, An exemplar-based NMF approach to audio event detection, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp. 1 3. [11] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS, 2000, pp. 556 562. [12] S. Nakano, K. Yamamoto, and S. Nakagawa, Speech recognition in mixed sound of speech and music based on vector quantization and non-negative matrix facorization, in Proc. INTERSPEECH 2011, 2011, pp. 1781 1784. [13] T. Komatsu, Y. Senda, and R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, in Proc. IEEE ICASSP 2016, 2016, pp. 2259 2263. [14] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries, DCASE2016 Challenge, Tech. Rep., September 2016. [15] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS 2000, 2000, pp. 556 562. [16] S. F. Boll, Suppresson of acoustic noise in speech using spectral subtraction, in IEEE Transactions on Audio and Acoustic Signal Processing, 1979, pp. 113 120. 978-1-5090-4045-2/17/$31.00 2017 IEEE 424