Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Size: px
Start display at page:

Download "Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors"

Transcription

1 Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors Kazumasa Yamamoto Department of Computer Science Chubu University Kasugai, Aichi, Japan Chikara Ishikawa, Koya Sahashi, Seiichi Nakagawa Department of Computer Science and Engineering Toyohashi University of Technology Toyohashi, Aichi, Japan {c145302, Abstract Acoustic Event Detection plays an important role for computational acoustic scene analysis. Although we would face with a sound overlapping problem in a real situation, conventional methods do not consider the problem enough. In this paper, we propose a new overlapped acoustic event detection technique combined a source separation technique of Non-negative Matrix Factorization with shared basis vectors and a deep neural network based acoustic model to improve the detection performance. Our approach showed 20.0% absolute higher performance than the best result achieved in the D-CASE 2012 challenge on the frame based F-measure. I. INTRODUCTION Acoustic Event Detection (AED) plays an important role for Computational Acoustic Scene Analysis (CASA) [1]. Applications such as lifelog tagging, a security system with image processing, and noise pollution detection, and so on have been considered with this technique [2], [3], [4], [5]. To detect an acoustic event, two general approaches mainly have been used. One of them is the use of acoustic models and features like as in an automatic speech recognition (ASR) system [6]. ASR systems usually use Gaussian Mixture Models (GMMs) or Deep Neural Network (DNN) based likelihood calculator with Hidden Markov Models (HMMs) as the acoustic model and Mel-frequency cepstrum coefficients (MFCC) or log Mel-filterbank outputs (FBANK) as the acoustic features. The other one is a source separation technique such as Non-negative matrix factorization (NMF) [7]. The IEEE D-CASE 2012 workshop [8] has been held for the CASA task in This task had two tasks: Acoustic Scene Classification and Acoustic Event Detection. As the subtasks of AED, there were Office Live (OL) task and Office Synthetic (OS) task. For both subtasks, an office environment and 16 acoustic events was supposed as the acoustic condition. In OL task, development tracks and test tracks were recorded in a real room. These tracks had no overlapped segments of acoustic events. In OS task, D- CASE provided development tracks and test tracks which included overlapped segments of acoustic events by artificially synthesizing the tracks. For these AED tasks, in similarity to ASR methods, Vuegen et al. [9] proposed a MFCC-GMM based system to detect acoustic events for OL and OS. The AED performance of this system was 43.4% frame-based F-measure for OL and 13.5% for OS. Gemmeke et al. [10] took an NMF-HMM based approach. Their method gave the best performance on OS in this challenge. This method showed 31.4% frame-based F-measure for OL and 21.3% for OS. The performance of these detection methods was still low and the detection accuracy (F-measure) was decreased in event overlapping situation like as in ASR system. It is an important factor to be able to detect multiple acoustic events in event overlapping segments which often appear in a real application. To improve the detection accuracy in event overlapping segments, there are some points to be considered. One of them is complex acoustic features of the event due to overlapping other acoustic events or background noises. It leads mismatching between acoustic features and acoustic models and difficulty of detection of the event. Another one is the variation of acoustic events. This task includes many sound source classes, not only human speech or voice. However, some sound classes are very similar partly and they have similar basis vectors in their basis matrix of NMF. Therefore, it is hard to discriminate particular characteristics for each class by NMF. In this paper, we propose a shared basis vector method for NMF to detect overlapped acoustic events. Conventional NMF works with a basis (dictionary) matrix and an activation matrix. We hope that the basis vectors are different each other in each class. However, the basis matrix has many similar component bases over event classes actually as described above, and the NMF process tends to give low activation weights equally to the bases or to give high activation weights to only one basis, which leads misdetections especially for overlapped acoustic events. In such a case, a basis vector in a common basis matrix over classes which is shared among suitable plural classes could be better basis representation. We believe that this method helps the NMF process provide appropriate activation weights to the bases in overlapping segments. This paper is organized in four sections: The next section describes conventional and our proposed basis methods and our detection system. Section III shows the experimental results. Finally, Section IV offers some conclusions. II. ACOUSTIC EVENT DETECTION FRAMEWORK Figure 1 shows the block diagram of our AED system. As preprocessing, the input signal (mixed sound source signal) is separated into each event class sound signal by NMF. After the separation, each sound spectrum is converted into acoustic features (MFCC or FBANK in this paper). To calculate a likelihood score for each event, the converted acoustic feature is fed to a deep neural acoustic model. To detect events, the scores are compared with pre-defined thresholds and the acoustic event is detected when the score is over the threshold. A. Conventional NMF Non-negative matrix factorization (NMF) is an effective tool for separating acoustic events in an overlapping segment [11]. By using NMF, a temporal-frequency spectrogram matrix S can be represented by multiplication of a class basis matrix W and an activation matrix H, i.e. S Ŝ = W H. As the number of frequency bins is noted as L, the number of frames is T, and the total number of basis vectors is C, the size of matrix S and Ŝ becomes L T, the size of matrix W is L C, and the size of matrix H is C T. When W n and /17/$ IEEE 420

2 Fig. 1. Block diagram of our acoustic event detection system H n is a basis matrix and an activation matrix for the event class n (N represents the number of classes), W and H is also represented as W = [W 1, W 2,..., W N ] and H = [H1, t H2, t..., HN t ] t (t means matrix transpose). Also we can represent the observed spectrogram by W and H in a summation formula, S N WnHn. In this n=1 paper, to separate a mixed signal into each class signal, we used a Wiener like separation filter: Fig. 2. Distance between basis vectors within class and over classes S n S ˆ W nh n n = S N. (1) m=1 WmHm In this paper, LBG algorithm (i.e. vector quantization) was used to make a basis matrix that represents a set of spectral bases of the event class [12]. B. NMF with shared basis vectors The conventional NMF method assumes source identity. It is very important factor in order to separate mixed signal into target class signals. However, it does not work well when there are many target classes because of the basis similarity over classes. Actually, when we worked on this task with DNN acoustic model and a conventional NMF method, we obtained worse results than DNN without any separation method (see Section III). In order to explore the conventional basis vectors, we visualize the distance between basis vectors for each class and over classes in Figure 2. Each cell represents the Euclidean distance between basis vectors. The basis matrix with 4 basis vectors for each class is used for this figure. In the figure, the color represents a distance measure: red means long distance, but blue means near. It is no wonder that the diagonal component is blue and the surroundings in 4 blocks are also close to blue because they are the same or very near vectors. However, many of the distances between the bases are small valued, even between classes. This similarity of basis vectors causes generation of inappropriate activation matrix for each basis. Therefore, it is hard to get sufficient separation performance with the conventional NMF for such data. To solve this problem, we propose NMF with shared basis vectors. This method uses a common basis matrix which is made by LBG algorithm from all classes data. A basis vector in the common basis matrix is linked to basis vectors in the class-wise basis matrices, and the activation of a common basis vector is shared by all the linked classes to the common basis vector. To link a basis vector in the common basis matrix with plural classes, we use the Euclidean distance between a basis vector in the common basis matrix and a basis from the class-wise basis matrix. We compare two linking criteria between a common basis vector and a class-wise basis vector: (a) Threshold selection (Figure 3) This criterion is to choose a common basis vector which the Fig. 3. Shared bases (a) - The number of bases for each class is variable depending on the threshold. Fig. 4. Shared bases (b) - The number of bases for each class is constant. Euclidean distance between the basis and a class-wise basis vector is lower than a pre-defined threshold. The basis in a common basis matrix can be shared in several classes. (b) Constant selection (Figure 4) This criteria is to choose a k-nearest basis vectors in a common basis matrix for each class based on the Euclidean distance. The number of assigned basis vectors k is constant across all classes. In [13], [14], Komatsu et al. reported how to make a improved basis matrix for NMF having the same motivation. However, we believe that our method can make more efficient basis vectors explicitly /17/$ IEEE 421

3 Fig. 5. Score Plot (black line: silence, blue line: event (keys), red line: event (pendrop), top broken line: silence threshold, bottom broken line: event threshold) C. Event detection We utilize DNN output scores to detect acoustic events. The DNN score o j from the output unit (class) j can be calculated as follows: o j = I w ijh i + b j, (2) i=1 where h i means the value of unit i in the last hidden layer, w ij the weight between the output unit j and the hidden unit i, b j the bias of output unit j, and I the number of units in the last hidden layer. By using o j, a posterior p j of the unit j can be calculated as follows: p j = exp(o j) J exp(o, (3) j =1 j ) where J is the number of output units (i.e. the number of classes). In this paper, we use p j for silence detection and o j for acoustic event detection. As an example, Figure 5 shows a score chart. The first vertical axis is a posterior, p j, for silence, the second vertical axis is the DNN score, o j, for event classes and the horizontal axis is the frame index. To detect acoustic events, we define two thresholds for silence and event classes. We use a posterior only for the silence class detection since a posterior varies depending on the number of simultaneous events. Additionally, we take the moving average for each class score values for score smoothing to avoid quick changes of the score values. From the preliminary experiment with OL (Office Live) development tracks, the moving average window size is decided on ±9 frames. For comparison, we also used only NMF or DNN, respectively. In the case of NMF, we used the summation of estimated activation weights for all basis vectors in each class with pre-defined thresholds to detect events. In the case of DNN, we used the output values of DNN, Equation 2, for each class with pre-defined thresholds without sound source separation processing. A. Experimental setup III. EXPERIMENTS We evaluate our method on IEEE D-CASE 2012 TASK2 OS (Office Synthetic) [8]. The test tracks of OS include acoustic event sequences and overlapped events. In this task, 16 event classes are defined: alert, clear-throat, cough, door-slam, drawer, keyboard, keys, knock, laughter, mouse, page-turn, pen-drop, phone, printer, speech, and switch. In the D-CASE 2012 challenge, the training, development and test sets are provided. The sound samples were recorded in single channel at rate of 44.1kHz with 24bit quantization. There are 20 training samples per class and the total duration is about 15 minutes. As the test tracks, the D-CASE 2012 challenge provides 12 tracks. Each track has a 2 minute duration. The number of reference frames is 99,981, and there are 15,180 overlapped event frames in them. 1) NMF: The basis matrix for conventional NMF was made of normalized amplitude spectra. We first carried out Fourier transform to obtain linear amplitude spectra. The analysis Hamming window was 20ms length with 50% overlapping. After obtaining the spectra, we got a basis matrix by using LBG algorithm for each class. We used Euclidean distance as the distance measure between vectors for the LBG algorithm. We adopted KL-divergence as a cost function for NMF [15]. In this experiment, we used four basis vectors for each class. For our proposed basis matrix, we made a common basis matrix from all class data. The number of basis vectors in the matrix was fixed as 64, and the number of basis vectors for each class was set as 4 or 8. By using the methods (a) and (b) described in Section II-B, we made links between a basis vector in the all class basis matrix and a basis vector in each class basis matrix using Euclidean distance. To find the optimum links, we used the following setups: (a) Threshold: 0.35, 0.40, 0.45, 0.50 (in the case of 4 basis vectors in each class) or 0.40, 0.45, 0.50, 0.55 (in the case of 8 basis vectors in each class). (b) Constant selection: Select 4 or 5 basis vectors for each class. 2) Acoustic feature: Before the feature extraction process, Spectral Subtraction (SS) was applied to suppress the background noise [16]. The subtraction coefficient was set to 2.0 and the flooring coefficient To estimate the noise spectrum, we used noise regions denoted in the label in the training process, and the first 100 frames in the test process. As acoustic features, we used MFCC and FBANK. For extracting acoustic features, the analysis Hamming window was 20ms length with 50% overlapping, which is the same as for NMF. The number of filters in the Mel-filterbank was set to 33 for MFCC. We used 12 dimensional MFCC with log power, and their temporal s and s. The final feature vector of MFCC had 39 dimensions per frame. For FBANK, the number of channels was set to 45. As the same as for MFCC, we produced temporal s and s. The feature vector of FBANK became finally 135 dimensions. 3) Acoustic model: As an acoustic model, we used DNN. Since the original training data is not enough amount for training a DNN acoustic model, we produced multi-condition training data by adding noise data which we originally recorded in several office rooms to original training data. We corrupted the original training data by the room noises at 20, 15, and 10dB SNRs. We also added the OL development tracks to the multi-condition training set. We used a ±3 frame context as input to DNN, so that the input layer had 273 units for MFCC or 945 units for FBANK. Our DNN had 5 hidden layers. The first hidden layer had 512 units, the second, third and fourth one had 256, 128, and 64 units respectively, and the last hidden layer had 32 units. Finally the output layer had 17 units corresponding to 16 acoustic event classes and silence. We used the Rectified Linear function for DNN unit activation. For training DNN, we took a supervised learning method without pre-training. We also evaluate a score combination method of two DNN acoustic models of MFCC and FBANK. The method (we called it FUSION here) uses the maximum score from the two acoustic models simply /17/$ IEEE 422

4 TABLE I D-CASE 2012 CHALLENGE RESULTS FOR OS Measure Baseline [8] DHV [6] GVV [10] VVK [9] F [%] TABLE II ONLY NMF OR DNN RESULTS DNN only NMF only Measure MFCC FBANK FUSION R overlap [%] F [%] We hope to improve the performance of event detection by reflecting different acoustic model characteristics to the scores. 4) Evaluation metric: We followed the evaluation metrics of the D-CASE 2012 challenge that are defined as the frame-based Recall, P recision, and F -measure. The frame based evaluation is judged every 10 ms. As the number of correct event frames is C, the number of total event detection frames is E, and GT is the number of event frames in reference labels, each metric is defined as the following formulas: P recision[%] = C 100 E (4) Recall[%] = C 100 GT (5) F [%] = 2 P recision Recall 100 P recision + Recall (6) We also used the R overlap measure which is Recall only for the event overlapping frames. We compared the proposed method with the previous results on the D-CASE 2012 challenge. B. Experimental Results Tables I and II show the D-CASE 2012 results and our simple NMF or DNN results for OS. Using DNN gave F -measure improvement, especially with FBANK and FUSION, from the D-CASE 2012 challenge results. However, using NMF only, we obtained only 14.4% on F -measure. Additionally, when we used NMF with DNN, the result was worse than DNN only (shown in Table IV). These results show that the source separation based on a conventional NMF does not work well. Table III and IV show the results of our proposed method in this paper. Our proposed method shows 37.6% in F -measure with MFCC, 37.4% with FBANK, and 41.3% by FUSION at each best and R overlap was improved as well. The sharing basis method improved the performance remarkably by 27.2% for MFCC and 16.8% for FBANK absolutely from NMF-DNN as shown in Table IV. The best result was from FUSION with threshold selection (Threshold = 0.45, the number of basis = 8 for each class). The threshold selection method is better than the constant selection as shown in Table III. In the shared basis vectors, the laughter class had most links with basis that was 17, and alert and switch got only 5 basis sharing. The average links between class and the common basis matrix was 11 basis vectors and they are less links than the constant selection case. We guess that the thresholding avoids to make links to unrequired basis vectors. IV. CONCLUSION In this paper, we proposed the sharing basis vector to improve the NMF separation and AED performance for event overlapping segments. We evaluated the proposed method on the D-CASE 2012 challenge. Comparing with the previous results, we obtained 41.3% TABLE III RESULTS OF SHARING BASIS METHOD (NMF-DNN) F -MEASURE [%] Threshold selection Threshold #basis MFCC FBANK Constant selection Selection #basis MFCC FBANK TABLE IV PERFORMANCE COMPARISON BETWEEN CONVENTIONAL AND THE PROPOSED METHODS Method Feature R overlap[%] F [%] DNN FUSION NMF-DNN w/o sharing basis Sharing basis Threshold = 0.45 #basis = 8 Sharing basis Threshold = 0.55 #basis = 8 MFCC FBANK FUSION MFCC FBANK FUSION MFCC FBANK FUSION frame-based F -measure at the best and absolute 20% improvement from the previous challenge best result. As future works, we are interested in other spectral reconstruction methods that uses an original class-wise basis matrix to make a reconstructed spectrum instead of a common basis matrix after performing the sharing basis NMF. We also plan to evaluate our method in new AED challenge, D-CASE REFERENCES [1] D. Wang and G. Brown, Eds., Computational auditory scene analysis: principles, algorithms and applications. Hoboken ( N.J.): J. Wiley & sons, [Online]. Available: [2] J. Salmon and J. P. Bello, Unsupervised feature learning for urban sound classification, in Proc. IEEE ICASSP 2015, 2015, pp [3] R. Radhakrishnan, A. Divakaran, and P. Smaragdis, Audio analysys for surveillance applications, in Proc IEEE WASPAA. IEEE, 2005, pp [4] M. Espi, M.Fujimoto, K. Kinoshita, and T. Nakatani, Acoustic event detection in speech overlapping scenarios based on high-resolution spectral input and deep learning, in IEICE Transactions on Information and Systems, vol. E98-D, 2015, pp [5] K. Yamamoto and K. Itou, Browsing audio life-log data using acoustic and location information, in Proc. IEEE UBICOMM 09, 2014, pp [6] A. Diment, T. Heittola, and T. Virtanen, Sound event detection for office live and office synthetic aasp challenge, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp [7] A. Mesaros, T. Heittola, O. Dikmen, and T. Virtanen, Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations, in Proc. IEEE ICASSP 2015, 2015, pp [8] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D.Plumbley, Detection and Classification of Acousitc Scens and Events, in IEEE Transactions on Multimedia, vol. 17. IEEE, 2014, pp /17/$ IEEE 423

5 [9] L. Vuegen, B. V. D. Broeck, P. Karsmakers, J. F. Gemmeke, B. Vanrumste, and H. V. hamme, An MFCC - GMM approach apploach for event detection and classification, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp [10] J. F. Gemmeke, L. Vuegen, P. Karsmakers, and H. Hamme, An exemplar-based NMF approach to audio event detection, in IEEE AASP Challenge: Detection of Acoustic Scenes and Events, 2013, pp [11] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS, 2000, pp [12] S. Nakano, K. Yamamoto, and S. Nakagawa, Speech recognition in mixed sound of speech and music based on vector quantization and non-negative matrix facorization, in Proc. INTERSPEECH 2011, 2011, pp [13] T. Komatsu, Y. Senda, and R. Kondo, Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation, in Proc. IEEE ICASSP 2016, 2016, pp [14] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, Acoustic event detection method using semi-supervised non-negative matrix factorization with a mixture of local dictionaries, DCASE2016 Challenge, Tech. Rep., September [15] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, in NIPS 2000, 2000, pp [16] S. F. Boll, Suppresson of acoustic noise in speech using spectral subtraction, in IEEE Transactions on Audio and Acoustic Signal Processing, 1979, pp /17/$ IEEE 424

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya

More information

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley University of Surrey Centre for Vision Speech and Signal Processing Guildford,

More information

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS Emad M. Grais and Hakan Erdogan Faculty of Engineering and Natural Sciences, Sabanci University, Orhanli

More information

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION Tomoki Hayashi 1, Shinji Watanabe 2, Tomoki Toda 1, Takaaki Hori 2, Jonathan Le Roux 2, Kazuya Takeda 1 1 Nagoya University, Furo-cho,

More information

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS Hyungui Lim 1, Jeongsoo Park 1,2, Kyogu Lee 2, Yoonchang Han 1 1 Cochlear.ai, Seoul, Korea 2 Music and Audio Research Group,

More information

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology Non-Negative Matrix Factorization And Its Application to Audio Tuomas Virtanen Tampere University of Technology tuomas.virtanen@tut.fi 2 Contents Introduction to audio signals Spectrogram representation

More information

DETECTION OF OVERLAPPING ACOUSTIC EVENTS USING A TEMPORALLY-CONSTRAINED PROBABILISTIC MODEL

DETECTION OF OVERLAPPING ACOUSTIC EVENTS USING A TEMPORALLY-CONSTRAINED PROBABILISTIC MODEL DETECTION OF OVERLAPPING ACOUSTIC EVENTS USING A TEMPORALLY-CONSTRAINED PROBABILISTIC MODEL Emmanouil Benetos 1, GrégoireLafay 2,Mathieu Lagrange 2,and MarkD.Plumbley 3 1 Centre for Digital Music,QueenMaryUniversity

More information

arxiv: v2 [cs.sd] 15 Aug 2016

arxiv: v2 [cs.sd] 15 Aug 2016 CaR-FOREST: JOINT CLASSIFICATION-REGRESSION DECISION FORESTS FOR OVERLAPPING AUDIO EVENT DETECTION Huy Phan, Lars Hertel, Marco Maass, Philipp Koch, and Alfred Mertins Institute for Signal Processing,

More information

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement Simon Leglaive 1 Laurent Girin 1,2 Radu Horaud 1 1: Inria Grenoble Rhône-Alpes 2: Univ. Grenoble Alpes, Grenoble INP,

More information

Sound Recognition in Mixtures

Sound Recognition in Mixtures Sound Recognition in Mixtures Juhan Nam, Gautham J. Mysore 2, and Paris Smaragdis 2,3 Center for Computer Research in Music and Acoustics, Stanford University, 2 Advanced Technology Labs, Adobe Systems

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic Feature Extraction for ASR Instructor: Preethi Jyothi Feb 13, 2017 Speech Signal Analysis Generate discrete samples A frame Need to focus on short

More information

Polyphonic Sound Event Tracking using Linear Dynamical Systems

Polyphonic Sound Event Tracking using Linear Dynamical Systems IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Polyphonic Sound Event Tracking using Linear Dynamical Systems Emmanouil Benetos, Member, IEEE, Grégoire Lafay, Mathieu Lagrange, and Mark

More information

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation Mikkel N. Schmidt and Morten Mørup Technical University of Denmark Informatics and Mathematical Modelling Richard

More information

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Tuomas Virtanen and Anssi Klapuri Tampere University of Technology, Institute of Signal Processing Korkeakoulunkatu

More information

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification Hafiz Mustafa and Wenwu Wang Centre for Vision, Speech and Signal Processing (CVSSP) University of Surrey,

More information

On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection

On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection proceedings Proceedings On the Joint Use of NMF and Classification for Overlapping Acoustic Event Detection Panagiotis Giannoulis 1,2, Gerasimos Potamianos 2,3, and Petros Maragos 1,2 1 School of ECE,

More information

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Segmental Recurrent Neural Networks for End-to-end Speech Recognition Segmental Recurrent Neural Networks for End-to-end Speech Recognition Liang Lu, Lingpeng Kong, Chris Dyer, Noah Smith and Steve Renals TTI-Chicago, UoE, CMU and UW 9 September 2016 Background A new wave

More information

Robust Sound Event Detection in Continuous Audio Environments

Robust Sound Event Detection in Continuous Audio Environments Robust Sound Event Detection in Continuous Audio Environments Haomin Zhang 1, Ian McLoughlin 2,1, Yan Song 1 1 National Engineering Laboratory of Speech and Language Information Processing The University

More information

Environmental Sound Classification in Realistic Situations

Environmental Sound Classification in Realistic Situations Environmental Sound Classification in Realistic Situations K. Haddad, W. Song Brüel & Kjær Sound and Vibration Measurement A/S, Skodsborgvej 307, 2850 Nærum, Denmark. X. Valero La Salle, Universistat Ramon

More information

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood

More information

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION Julian M. ecker, Christian Sohn Christian Rohlfing Institut für Nachrichtentechnik RWTH Aachen University D-52056

More information

Robust Speaker Identification

Robust Speaker Identification Robust Speaker Identification by Smarajit Bose Interdisciplinary Statistical Research Unit Indian Statistical Institute, Kolkata Joint work with Amita Pal and Ayanendranath Basu Overview } } } } } } }

More information

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien National Cheng Kung University TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent

More information

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen Department of Signal Processing, Tampere University

More information

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY Jean-Rémy Gloaguen, Arnaud Can Ifsttar - LAE Route de Bouaye - CS4 44344, Bouguenais, FR jean-remy.gloaguen@ifsttar.fr Mathieu

More information

arxiv: v2 [cs.sd] 7 Feb 2018

arxiv: v2 [cs.sd] 7 Feb 2018 AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang ong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley Center for Vision, Speech and Signal Processing, University of Surrey, U

More information

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION Hiroazu Kameoa The University of Toyo / Nippon Telegraph and Telephone Corporation ABSTRACT This paper proposes a novel

More information

Duration-Controlled LSTM for Polyphonic Sound Event Detection

Duration-Controlled LSTM for Polyphonic Sound Event Detection MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Duration-Controlled LSTM for Polyphonic Sound Event Detection Hayashi, T.; Watanabe, S.; Toda, T.; Hori, T.; Le Roux, J.; Takeda, K. TR2017-150

More information

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Single Channel Signal Separation Using MAP-based Subspace Decomposition Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong,

More information

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION Ikuo Degawa, Kei Sato, Masaaki Ikehara EEE Dept. Keio University Yokohama, Kanagawa 223-8522 Japan E-mail:{degawa,

More information

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

On Spectral Basis Selection for Single Channel Polyphonic Music Separation On Spectral Basis Selection for Single Channel Polyphonic Music Separation Minje Kim and Seungjin Choi Department of Computer Science Pohang University of Science and Technology San 31 Hyoja-dong, Nam-gu

More information

Model-based unsupervised segmentation of birdcalls from field recordings

Model-based unsupervised segmentation of birdcalls from field recordings Model-based unsupervised segmentation of birdcalls from field recordings Anshul Thakur School of Computing and Electrical Engineering Indian Institute of Technology Mandi Himachal Pradesh, India Email:

More information

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Exemplar-based voice conversion using non-negative spectrogram deconvolution Exemplar-based voice conversion using non-negative spectrogram deconvolution Zhizheng Wu 1, Tuomas Virtanen 2, Tomi Kinnunen 3, Eng Siong Chng 1, Haizhou Li 1,4 1 Nanyang Technological University, Singapore

More information

THE task of identifying the environment in which a sound

THE task of identifying the environment in which a sound 1 Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification Victor Bisot, Romain Serizel, Slim Essid, and Gaël Richard Abstract In this paper, we study the usefulness of various

More information

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments Andreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann Lehrstuhl für Multimediakommunikation

More information

Voice Activity Detection Using Pitch Feature

Voice Activity Detection Using Pitch Feature Voice Activity Detection Using Pitch Feature Presented by: Shay Perera 1 CONTENTS Introduction Related work Proposed Improvement References Questions 2 PROBLEM speech Non speech Speech Region Non Speech

More information

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers Kumari Rambha Ranjan, Kartik Mahto, Dipti Kumari,S.S.Solanki Dept. of Electronics and Communication Birla

More information

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley

AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE. Qiuqiang Kong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley AUDIO SET CLASSIFICATION WITH ATTENTION MODEL: A PROBABILISTIC PERSPECTIVE Qiuqiang ong*, Yong Xu*, Wenwu Wang, Mark D. Plumbley Center for Vision, Speech and Signal Processing, University of Surrey, U

More information

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION Amin Haji Abolhassani 1, Sid-Ahmed Selouani 2, Douglas O Shaughnessy 1 1 INRS-Energie-Matériaux-Télécommunications,

More information

ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING. Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard

ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING. Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard LTCI, CNRS, Télćom ParisTech, Université Paris-Saclay, 75013,

More information

Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification

Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification Feature Learning with Matrix Factorization Applied to Acoustic Scene Classification Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard To cite this version: Victor Bisot, Romain Serizel, Slim Essid,

More information

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017 I am v Master Student v 6 months @ Music Technology Group, Universitat Pompeu Fabra v Deep learning for acoustic source separation v

More information

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms Masahiro Nakano 1, Jonathan Le Roux 2, Hirokazu Kameoka 2,YuKitano 1, Nobutaka Ono 1,

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Deep NMF for Speech Separation

Deep NMF for Speech Separation MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Deep NMF for Speech Separation Le Roux, J.; Hershey, J.R.; Weninger, F.J. TR2015-029 April 2015 Abstract Non-negative matrix factorization

More information

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure

Correspondence. Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure Correspondence Pulse Doppler Radar Target Recognition using a Two-Stage SVM Procedure It is possible to detect and classify moving and stationary targets using ground surveillance pulse-doppler radars

More information

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS Jinjin Ye jinjin.ye@mu.edu Michael T. Johnson mike.johnson@mu.edu Richard J. Povinelli richard.povinelli@mu.edu

More information

Multi-level Attention Model for Weakly Supervised Audio Classification

Multi-level Attention Model for Weakly Supervised Audio Classification Multi-level Attention Model for Weakly Supervised Audio Classification Changsong Yu, Karim Said Barsim, Qiuqiang Kong and Bin Yang Institute of Signal Processing and System Theory, University of Stuttgart,

More information

Comparing linear and non-linear transformation of speech

Comparing linear and non-linear transformation of speech Comparing linear and non-linear transformation of speech Larbi Mesbahi, Vincent Barreaud and Olivier Boeffard IRISA / ENSSAT - University of Rennes 1 6, rue de Kerampont, Lannion, France {lmesbahi, vincent.barreaud,

More information

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring

Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Modeling Prosody for Speaker Recognition: Why Estimating Pitch May Be a Red Herring Kornel Laskowski & Qin Jin Carnegie Mellon University Pittsburgh PA, USA 28 June, 2010 Laskowski & Jin ODYSSEY 2010,

More information

2D Spectrogram Filter for Single Channel Speech Enhancement

2D Spectrogram Filter for Single Channel Speech Enhancement Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 007 89 D Spectrogram Filter for Single Channel Speech Enhancement HUIJUN DING,

More information

Dominant Feature Vectors Based Audio Similarity Measure

Dominant Feature Vectors Based Audio Similarity Measure Dominant Feature Vectors Based Audio Similarity Measure Jing Gu 1, Lie Lu 2, Rui Cai 3, Hong-Jiang Zhang 2, and Jian Yang 1 1 Dept. of Electronic Engineering, Tsinghua Univ., Beijing, 100084, China 2 Microsoft

More information

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY MengSun,HugoVanhamme Department of Electrical Engineering-ESAT, Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, Bus

More information

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION Sarika Hegde 1, K. K. Achary 2 and Surendra Shetty 3 1 Department of Computer Applications, NMAM.I.T., Nitte, Karkala Taluk,

More information

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail.

This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Powered by TCPDF (www.tcpdf.org) This is an electronic reprint of the original article. This reprint may differ from the original in pagination and typographic detail. Author(s): Title: Heikki Kallasjoki,

More information

Exploring the Relationship between Conic Affinity of NMF Dictionaries and Speech Enhancement Metrics

Exploring the Relationship between Conic Affinity of NMF Dictionaries and Speech Enhancement Metrics Interspeech 2018 2-6 September 2018, Hyderabad Exploring the Relationship between Conic Affinity of NMF Dictionaries and Speech Enhancement Metrics Pavlos Papadopoulos, Colin Vaz, Shrikanth Narayanan Signal

More information

arxiv: v1 [cs.sd] 29 Apr 2016

arxiv: v1 [cs.sd] 29 Apr 2016 LEARNING COMPACT STRUCTURAL REPRESENTATIONS FOR AUDIO EVENTS USING REGRESSOR BANKS Huy Phan, Marco Maass, Lars Hertel, Radoslaw Mazur, Ian McLoughlin, and Alfred Mertins Institute for Signal Processing,

More information

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks

Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks INTERSPEECH 2014 Boundary Contraction Training for Acoustic Models based on Discrete Deep Neural Networks Ryu Takeda, Naoyuki Kanda, and Nobuo Nukaga Central Research Laboratory, Hitachi Ltd., 1-280, Kokubunji-shi,

More information

A State-Space Approach to Dynamic Nonnegative Matrix Factorization

A State-Space Approach to Dynamic Nonnegative Matrix Factorization 1 A State-Space Approach to Dynamic Nonnegative Matrix Factorization Nasser Mohammadiha, Paris Smaragdis, Ghazaleh Panahandeh, Simon Doclo arxiv:179.5v1 [cs.lg] 31 Aug 17 Abstract Nonnegative matrix factorization

More information

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION?

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION? FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION? Tiago Fernandes Tavares, George Tzanetakis, Peter Driessen University of Victoria Department of

More information

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi Signal Modeling Techniques in Speech Recognition Hassan A. Kingravi Outline Introduction Spectral Shaping Spectral Analysis Parameter Transforms Statistical Modeling Discussion Conclusions 1: Introduction

More information

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN A METHOD OF ICA IN TIME-FREQUENCY DOMAIN Shiro Ikeda PRESTO, JST Hirosawa 2-, Wako, 35-98, Japan Shiro.Ikeda@brain.riken.go.jp Noboru Murata RIKEN BSI Hirosawa 2-, Wako, 35-98, Japan Noboru.Murata@brain.riken.go.jp

More information

Monaural speech separation using source-adapted models

Monaural speech separation using source-adapted models Monaural speech separation using source-adapted models Ron Weiss, Dan Ellis {ronw,dpwe}@ee.columbia.edu LabROSA Department of Electrical Enginering Columbia University 007 IEEE Workshop on Applications

More information

Bayesian Hierarchical Modeling for Music and Audio Processing at LabROSA

Bayesian Hierarchical Modeling for Music and Audio Processing at LabROSA Bayesian Hierarchical Modeling for Music and Audio Processing at LabROSA Dawen Liang (LabROSA) Joint work with: Dan Ellis (LabROSA), Matt Hoffman (Adobe Research), Gautham Mysore (Adobe Research) 1. Bayesian

More information

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm EngOpt 2008 - International Conference on Engineering Optimization Rio de Janeiro, Brazil, 0-05 June 2008. Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic

More information

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription Constrained Nonnegative Matrix Factorization with Applications to Music Transcription by Daniel Recoskie A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the

More information

Why DNN Works for Acoustic Modeling in Speech Recognition?

Why DNN Works for Acoustic Modeling in Speech Recognition? Why DNN Works for Acoustic Modeling in Speech Recognition? Prof. Hui Jiang Department of Computer Science and Engineering York University, Toronto, Ont. M3J 1P3, CANADA Joint work with Y. Bao, J. Pan,

More information

NONNEGATIVE FEATURE LEARNING METHODS FOR ACOUSTIC SCENE CLASSIFICATION

NONNEGATIVE FEATURE LEARNING METHODS FOR ACOUSTIC SCENE CLASSIFICATION NONNEGATIVE FEATURE LEARNING METHODS FOR ACOUSTIC SCENE CLASSIFICATION Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard LTCI, Télécom ParisTech, Université Paris Saclay, F-75013, Paris, France Université

More information

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES Kedar Patki University of Rochester Dept. of Electrical and Computer Engineering kedar.patki@rochester.edu ABSTRACT The paper reviews the problem of

More information

Support Vector Machines using GMM Supervectors for Speaker Verification

Support Vector Machines using GMM Supervectors for Speaker Verification 1 Support Vector Machines using GMM Supervectors for Speaker Verification W. M. Campbell, D. E. Sturim, D. A. Reynolds MIT Lincoln Laboratory 244 Wood Street Lexington, MA 02420 Corresponding author e-mail:

More information

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Interspeech 2018 2-6 September 2018, Hyderabad Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks Rohith Aralikatti, Dilip Kumar Margam, Tanay Sharma,

More information

Diffuse noise suppression with asynchronous microphone array based on amplitude additivity model

Diffuse noise suppression with asynchronous microphone array based on amplitude additivity model Diffuse noise suppression with asynchronous microphone array based on amplitude additivity model Yoshikazu Murase, Hironobu Chiba, Nobutaka Ono, Shigeki Miyabe, Takeshi Yamada, and Shoji Makino University

More information

Estimation of Cepstral Coefficients for Robust Speech Recognition

Estimation of Cepstral Coefficients for Robust Speech Recognition Estimation of Cepstral Coefficients for Robust Speech Recognition by Kevin M. Indrebo, B.S., M.S. A Dissertation submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment

More information

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features

Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Reformulating the HMM as a trajectory model by imposing explicit relationship between static and dynamic features Heiga ZEN (Byung Ha CHUN) Nagoya Inst. of Tech., Japan Overview. Research backgrounds 2.

More information

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition

Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition Experiments with a Gaussian Merging-Splitting Algorithm for HMM Training for Speech Recognition ABSTRACT It is well known that the expectation-maximization (EM) algorithm, commonly used to estimate hidden

More information

CS229 Project: Musical Alignment Discovery

CS229 Project: Musical Alignment Discovery S A A V S N N R R S CS229 Project: Musical Alignment iscovery Woodley Packard ecember 16, 2005 Introduction Logical representations of musical data are widely available in varying forms (for instance,

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT CONOLUTIE NON-NEGATIE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT Paul D. O Grady Barak A. Pearlmutter Hamilton Institute National University of Ireland, Maynooth Co. Kildare, Ireland. ABSTRACT Discovering

More information

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs Paris Smaragdis TR2004-104 September

More information

Modifying Voice Activity Detection in Low SNR by correction factors

Modifying Voice Activity Detection in Low SNR by correction factors Modifying Voice Activity Detection in Low SNR by correction factors H. Farsi, M. A. Mozaffarian, H.Rahmani Department of Electrical Engineering University of Birjand P.O. Box: +98-9775-376 IRAN hfarsi@birjand.ac.ir

More information

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation Yichi Zhang and Zhiyao Duan Audio Information Research (AIR) Lab Department of Electrical and Computer Engineering University of Rochester

More information

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation CENTER FOR COMPUTER RESEARCH IN MUSIC AND ACOUSTICS DEPARTMENT OF MUSIC, STANFORD UNIVERSITY REPORT NO. STAN-M-4 Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

More information

Deep Neural Networks

Deep Neural Networks Deep Neural Networks DT2118 Speech and Speaker Recognition Giampiero Salvi KTH/CSC/TMH giampi@kth.se VT 2015 1 / 45 Outline State-to-Output Probability Model Artificial Neural Networks Perceptron Multi

More information

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data Jan Vaněk, Lukáš Machlica, Josef V. Psutka, Josef Psutka University of West Bohemia in Pilsen, Univerzitní

More information

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints Paul D. O Grady and Barak A. Pearlmutter Hamilton Institute, National University of Ireland Maynooth, Co. Kildare,

More information

Proc. of NCC 2010, Chennai, India

Proc. of NCC 2010, Chennai, India Proc. of NCC 2010, Chennai, India Trajectory and surface modeling of LSF for low rate speech coding M. Deepak and Preeti Rao Department of Electrical Engineering Indian Institute of Technology, Bombay

More information

Detection-Based Speech Recognition with Sparse Point Process Models

Detection-Based Speech Recognition with Sparse Point Process Models Detection-Based Speech Recognition with Sparse Point Process Models Aren Jansen Partha Niyogi Human Language Technology Center of Excellence Departments of Computer Science and Statistics ICASSP 2010 Dallas,

More information

A Generative Model Based Kernel for SVM Classification in Multimedia Applications

A Generative Model Based Kernel for SVM Classification in Multimedia Applications Appears in Neural Information Processing Systems, Vancouver, Canada, 2003. A Generative Model Based Kernel for SVM Classification in Multimedia Applications Pedro J. Moreno Purdy P. Ho Hewlett-Packard

More information

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Dynamic Data Modeling, Recognition, and Synthesis Rui Zhao Thesis Defense Advisor: Professor Qiang Ji Contents Introduction Related Work Dynamic Data Modeling & Analysis Temporal localization Insufficient

More information

Harmonic Structure Transform for Speaker Recognition

Harmonic Structure Transform for Speaker Recognition Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie Mellon University, Pittsburgh PA, USA KTH Speech Music & Hearing, Stockholm, Sweden 29 August, 2011 Laskowski &

More information

An Evolutionary Programming Based Algorithm for HMM training

An Evolutionary Programming Based Algorithm for HMM training An Evolutionary Programming Based Algorithm for HMM training Ewa Figielska,Wlodzimierz Kasprzak Institute of Control and Computation Engineering, Warsaw University of Technology ul. Nowowiejska 15/19,

More information

OVERLAPPING ANIMAL SOUND CLASSIFICATION USING SPARSE REPRESENTATION

OVERLAPPING ANIMAL SOUND CLASSIFICATION USING SPARSE REPRESENTATION OVERLAPPING ANIMAL SOUND CLASSIFICATION USING SPARSE REPRESENTATION Na Lin, Haixin Sun Xiamen University Key Laboratory of Underwater Acoustic Communication and Marine Information Technology, Ministry

More information

SUPERVISED NON-EUCLIDEAN SPARSE NMF VIA BILEVEL OPTIMIZATION WITH APPLICATIONS TO SPEECH ENHANCEMENT

SUPERVISED NON-EUCLIDEAN SPARSE NMF VIA BILEVEL OPTIMIZATION WITH APPLICATIONS TO SPEECH ENHANCEMENT SUPERVISED NON-EUCLIDEAN SPARSE NMF VIA BILEVEL OPTIMIZATION WITH APPLICATIONS TO SPEECH ENHANCEMENT Pablo Sprechmann, 1 Alex M. Bronstein, 2 and Guillermo Sapiro 1 1 Duke University, USA; 2 Tel Aviv University,

More information

CEPSTRAL analysis has been widely used in signal processing

CEPSTRAL analysis has been widely used in signal processing 162 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999 On Second-Order Statistics and Linear Estimation of Cepstral Coefficients Yariv Ephraim, Fellow, IEEE, and Mazin Rahim, Senior

More information

Augmented Statistical Models for Speech Recognition

Augmented Statistical Models for Speech Recognition Augmented Statistical Models for Speech Recognition Mark Gales & Martin Layton 31 August 2005 Trajectory Models For Speech Processing Workshop Overview Dependency Modelling in Speech Recognition: latent

More information

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Non-negative Matrix Factorization: Algorithms, Extensions and Applications Non-negative Matrix Factorization: Algorithms, Extensions and Applications Emmanouil Benetos www.soi.city.ac.uk/ sbbj660/ March 2013 Emmanouil Benetos Non-negative Matrix Factorization March 2013 1 / 25

More information

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms Recognition of Visual Speech Elements Using Adaptively Boosted Hidden Markov Models. Say Wei Foo, Yong Lian, Liang Dong. IEEE Transactions on Circuits and Systems for Video Technology, May 2004. Shankar

More information

Pattern Recognition Applied to Music Signals

Pattern Recognition Applied to Music Signals JHU CLSP Summer School Pattern Recognition Applied to Music Signals 2 3 4 5 Music Content Analysis Classification and Features Statistical Pattern Recognition Gaussian Mixtures and Neural Nets Singing

More information

EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION

EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION Nicolas Boulanger-Lewandowski Gautham J. Mysore Matthew Hoffman Université de Montréal

More information

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization Nasser Mohammadiha*, Student Member, IEEE, Paris Smaragdis,

More information