Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Similar documents
SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

IN real-world audio signals, several sound sources are usually

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

Non-Negative Tensor Factorisation for Sound Source Separation

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION?

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

ACCOUNTING FOR PHASE CANCELLATIONS IN NON-NEGATIVE MATRIX FACTORIZATION USING WEIGHTED DISTANCES. Sebastian Ewert Mark D. Plumbley Mark Sandler

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION.

Single Channel Signal Separation Using MAP-based Subspace Decomposition

Environmental Sound Classification in Realistic Situations

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Scalable audio separation with light Kernel Additive Modelling

Gaussian Processes for Audio Feature Extraction

Convolutive Non-Negative Matrix Factorization for CQT Transform using Itakura-Saito Divergence

STRUCTURE-AWARE DICTIONARY LEARNING WITH HARMONIC ATOMS

EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION

OPTIMAL WEIGHT LEARNING FOR COUPLED TENSOR FACTORIZATION WITH MIXED DIVERGENCES

Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization

Sound Recognition in Mixtures

Independent Component Analysis and Unsupervised Learning

t Tao Group Limited, Reading, RG6 IAZ, U.K. wenwu.wang(ieee.org t Samsung Electronics Research Institute, Staines, TW1 8 4QE, U.K.

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

744 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

Convention Paper Presented at the 128th Convention 2010 May London, UK

Matrix Factorization for Speech Enhancement

Single-channel source separation using non-negative matrix factorization

SHIFTED AND CONVOLUTIVE SOURCE-FILTER NON-NEGATIVE MATRIX FACTORIZATION FOR MONAURAL AUDIO SOURCE SEPARATION. Tomohiko Nakamura and Hirokazu Kameoka,

EUSIPCO

A State-Space Approach to Dynamic Nonnegative Matrix Factorization

Bayesian Hierarchical Modeling for Music and Audio Processing at LabROSA

An Efficient Posterior Regularized Latent Variable Model for Interactive Sound Source Separation

Deep NMF for Speech Separation

Machine Learning for Signal Processing Non-negative Matrix Factorization

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

BILEVEL SPARSE MODELS FOR POLYPHONIC MUSIC TRANSCRIPTION

ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING. Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION

Sparseness Constraints on Nonnegative Tensor Decomposition

arxiv: v2 [cs.sd] 8 Feb 2017

Machine Recognition of Sounds in Mixtures

BAYESIAN NONNEGATIVE HARMONIC-TEMPORAL FACTORIZATION AND ITS APPLICATION TO MULTIPITCH ANALYSIS

Introduction Basic Audio Feature Extraction

SUPERVISED NON-EUCLIDEAN SPARSE NMF VIA BILEVEL OPTIMIZATION WITH APPLICATIONS TO SPEECH ENHANCEMENT

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription

SOUND SOURCE SEPARATION BASED ON NON-NEGATIVE TENSOR FACTORIZATION INCORPORATING SPATIAL CUE AS PRIOR KNOWLEDGE

Automatic Speech Recognition (CS753)

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

A TWO-LAYER NON-NEGATIVE MATRIX FACTORIZATION MODEL FOR VOCABULARY DISCOVERY. MengSun,HugoVanhamme

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

SIMULTANEOUS NOISE CLASSIFICATION AND REDUCTION USING A PRIORI LEARNED MODELS

NON-NEGATIVE SPARSE CODING

Identification and separation of noises with spectro-temporal patterns

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

CS229 Project: Musical Alignment Discovery

Cauchy Nonnegative Matrix Factorization

Machine Learning for Signal Processing Non-negative Matrix Factorization

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Sparse and Shift-Invariant Feature Extraction From Non-Negative Data

Model-based unsupervised segmentation of birdcalls from field recordings

Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization

Probabilistic Latent Variable Models as Non-Negative Factorizations

NONNEGATIVE MATRIX FACTORIZATION WITH TRANSFORM LEARNING. Dylan Fagot, Herwig Wendt and Cédric Févotte

COLLABORATIVE SPEECH DEREVERBERATION: REGULARIZED TENSOR FACTORIZATION FOR CROWDSOURCED MULTI-CHANNEL RECORDINGS. Sanna Wager, Minje Kim

Source Separation Tutorial Mini-Series III: Extensions and Interpretations to Non-Negative Matrix Factorization

c Springer, Reprinted with permission.

10ème Congrès Français d Acoustique

Dept. Electronics and Electrical Engineering, Keio University, Japan. NTT Communication Science Laboratories, NTT Corporation, Japan.

Real-time polyphonic music transcription with non-negative matrix factorization and beta-divergence

THE task of identifying the environment in which a sound

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen

Diffuse noise suppression with asynchronous microphone array based on amplitude additivity model

SPEECH ANALYSIS AND SYNTHESIS

502 IEEE SIGNAL PROCESSING LETTERS, VOL. 23, NO. 4, APRIL 2016

MANY digital speech communication applications, e.g.,

MULTIPLICATIVE ALGORITHM FOR CORRENTROPY-BASED NONNEGATIVE MATRIX FACTORIZATION

A Modular NMF Matching Algorithm for Radiation Spectra

Author manuscript, published in "7th International Symposium on Computer Music Modeling and Retrieval (CMMR 2010), Málaga : Spain (2010)"

Bayesian ensemble learning of generative models

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

Music transcription with ISA and HMM

Learning Dictionaries of Stable Autoregressive Models for Audio Scene Analysis

Nonnegative Matrix Factorization with the Itakura-Saito Divergence: With Application to Music Analysis

Unsupervised Analysis of Polyphonic Music by Sparse Coding

Speech enhancement based on nonnegative matrix factorization with mixed group sparsity constraint

Sparse Time-Frequency Transforms and Applications.

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Transcription:

Non-Negative Matrix Factorization And Its Application to Audio Tuomas Virtanen Tampere University of Technology tuomas.virtanen@tut.fi

2 Contents Introduction to audio signals Spectrogram representation Sound source separation Non-negative matrix factorization Application to sound source separation Algorithms Probabilistic formulation Bayesian extensions Supervised NMF Further analysis of the NMF components Applications & extensions of NMF

3 Introduction to audio signals Audio signal: representation of sound Can exist in different forms Acoustic (that s how we hear and often produce it) Electrical voltage (ouput of a microphone, input of a loudspeaker) Digital (mp3 files, compact disc, mobile phone)

4 Representations of audio signals The amplitude as a function of time is a natural representation of ausio signals Describes the variation of the sound pressure level around the DC Easy to record using a microphone and to reproduce by a loudspeaker Digital signals: sampling frequency 44.1 khz commonly used Allows representing frequencies 0 22.05 khz Humans can hear frequencies 20 Hz-20 khz Lower / higher sampling frequencies also used Most of the information in low frequencies

Spectrum of a sound 5 Obtained e.g. by calculating the DFT of the signal Perceptual properties of a sound are more clearly visible in the spectrum Amplitude in db closer to the loudness perception Phases less meaningful often magnitudes only are used

6 Spectrogram representation Represents the intensity of a sound as a function of time and frequency Obtained by calculating the spectrum in short frames (10-50 ms typically in the case of audio)

Linear superposition 7 When multiple sound sources are present, the signals add linearly

Spectrogram of polyphonic music 8 Mid-level representation suitable for audio analysis (Ellis & Rosenthal 1998) The rhythmic structure is still visible

9 Source separation In practical situations other sounds interfere the target sound Automatic recognition / processing of sounds within mixtures extremely difficult Applications: Robust speech recognition Speech enchacement Music content analysis (transcription, instrument identification, singer identification, lyrics transcription) Audio manipulation Object-based coding Very important in many other fields

10 How to separate Prior information about sources General assumptions: statistical independence, etc. Multiple microphones: direction of arrival How does the human auditory system separate sources?

11 Blind source separation No prior information about sources Only generic assumptions that are valid for all the possible sources E.g. statistical independence Involves unsupervised learning In many practical situations we have less sensors than sources: How to to estimate multiple signals from a smaller amount of observations?

12 Sparseness in broad sense Assumption: a source signal can be described using a small number of parameters in some domain One possible approach: latent variable decompositions

Example signal 13 Notes C4 and G4 played by guitar, first separately and then together

Sparseness of the time-domain signal 14 Five frames of the first note:

Sparseness of magnitude spectrum 15 Five magnitude spectra of the first note: phase-invariant representation leads to much more compact models

Mixture spectrogram 16

Linear model for the mixture 17 Spectrum vector x t is decomposed into weighted sum of frequency basis vectors a 1 and a 2 x a s a t 1 1t 2 2t a 1 and a 2 represent the spectra of note 1 and 2, respectively s 1t and s 2t represent the gain of the notes over time Model in vector-matrix form: s x1t a11 a12 x a a s 2t 21 22 1t s2t xft af1 af2 x t As t

18 ICA on spectrogram The model matches the ICA model: each frequency is an sensor, mixture weights are sources Let us try to use ICA to separate the notes ICA on spectrogram: Independent subspace analysis ISA, (Casey & Westner 2000)

19 Results with ICA Weights over time Negative weights(!) Both weights seem to represent the first note

Spectral basis vectors obtained with ICA 20 ICA estimate (upper panel) vs. original (lower panel) Both components represent note a combination Negative values

21 What goes wrong? Negative weights: subtraction of spectral basis vectors Negative values in spectral basis vectors Subtraction of magnitude of power spectra physically unrealistic Are the notes statistically independent? Are the modeling assumptions correct? Is the independence as defined in ICA a good assumption in this case?

22 Non-negativity restrictions Non-negativity restrictions difficult to place into ICA It has been shown that with non-negativity restrictions, PCA leads to independent components (Plumbley 2002, Wilson & Raj 2010)

Non-negativity restrictions alone 23 What if we seek for a representation x t As t while restricting the basis vectors and weights to non-negative values?

Model for multiple frames 24 x As, t 1, t t T written for all the frames in matrix form: x x x A s s s 1 2 T 1 2 and using matrices only: X AS T

25 Non-negative matrix factorization NMF: minimize the error of the approximation X = AS, while restricting A and S to non-negative values (Lee & Seung, 1999 & 2001)

Guitar example 26

Spectral basis vectors obtained with NMF 27 NMF estimate (upper panel) vs. original (lower panel) Bases correspond to individual notes Permutation ambiquity

28 Weight obtained with NMF The green basis represents partly the onset of the second note Good separation of notes

Why does NMF work? 29 By representing signals as a sum purely additive, nonnegative sources, we get a parts-based representation (Lee & Seung, 1999)

Vector quantization on face data (from Lee & Seung, Nature 1999) 30

PCA on face data 31

NMF of face data 32

33 NMF on complex polyphonic music NMF represents parts of the signal that fit the model (Virtanen, 2007) Individual drum instruments Repeating chords Any repetitive structure in the signal

Polyphonic example 34 Original 20 separated components:

NMF algorithms 35 NMF minimizes the error between X and AS while restricting A and S to be entry-wise non-negative Two commonly used distance measures (Lee & Seung 2001) Euclidean distance / L2 norm: d X AS euc 2 F Generalized Kullback-Leibler divergence: d ( X, AS) X log( X /[ AS] ) X [ AS] div ft ft ft ft ft f, t Many other measures

Multiplicative update rules 36 Update rules which are guaranteed to be nonincreasing Easy to implement and to extend Euclidean distance: XS T T A=A (AS)S T T KL divergence S=S AX A (AS) ( /( )) T A=A X AS S T 1S T ( / ) S=S A X AS T A1 where 1 is all-one matrix of size X

37 Optimization procedure 1. Initialize the entries in A and S with random positive values 2. Update A 3. Update S 4. Iterate steps 2 and 3 Also other optimization algorithms (e.g. projected steepest descent, Hoyer 2004)

38 NMF for audio in practice Calculate the magnitude spectrogram Obtain each frame by multiplying the signal using a window function (for example 40 ms Hamming) 50% or smaller frame shift Calculate DFT in each frame t Assign absolute values of the DFT to X ft store the original phases Apply NMF (see previous slide) to obtain A and S Magnitude spectrogram of component k is obtained by A(:,k) * S(k,:), or as X.*(A(:,k) * S(k,:))./ (AS) Matlab notation Synthesis: Assign the phases of the original mixture phase spectrogram to the separated component Get time-domain frame by IDFT Combine frames using overlap-add

NMF distance measures 39 The distance measure should be chosen according to the properties of the data NMF can be viewed as maximum likelihood estimation Euclidean distance assumes additive Gaussian noise p 2 ( X A,S) N( X f, t;[ AS] f, t, ) f, t KL assumes Poisson observation model (variance scales linearly with the model) p [ ] f, t f, t e f, t ft f, t f, t f, t ft ( X A,S) Po( X ;[ AS] ) [ AS] / X! Equivalent to the multinomial model of PLSA AS X

Bayesian approach (Virtanen and Cemgil 2008) 40 Bayes rule: p(a,s X) = p(x A,S) p(a,s) / p(x) Allows us to place priors for A and S -> maximum a posterior estimation Typically sparse prior for the mixture weights Exponential prior kt p( S) e S -> the objective to be minimized becomes (for example with the Gaussian model) X AS Skt kt, kt, -> non-negative sparse coding

41 Regularization in NMF Any cost terms can be added to the reconstruction error measure Sparseness, temporal continuity (Virtanen 2007) Correlation of weights (Wilson et al. 2008), spectra (Virtanen & Cemgil 2009) Correlation of components (Wilson & Raj 2010) Optimization may become more difficult

42 Connection to PLSA Normalization not needed Slightly different probabilistic model formulation

43 Supervised NMF Prior information easy to include by training the spectral basis vectors in advance Source separation scenario: Isolated training material of source 1 and source 2 Use NMF to train basis spectra for both sources separately Combine the basis vector sets Use NMF with the obtained basis vector set keep the basis vectors fixed while updating the mixing weights Synthesize source 1 by using its basis vectors only

44 Further analysis In practice a source source can be represented with more than one component Cluster the components to sources Supervised classification of components (train a classifier) Example: separation of drums from polyphonic music by classification of NMF components by SVM (Helen & Virtanen 2005) Basis vectors are spectra Pitch estimation (Vincent et al. 2007) Onset detection from mixture weights Suits well for automatic drum transcription (Paulus & Virtanen 2005, Vincent et al. 2007)

45 Extensions of NMF Convolution in frequency Translation of a basis vector in frequency: weight for each translation (Virtanen 2006) With constant-q spectral transformation allows modeling different pitches with a single basis vector Convolution in time Basis vector extended to cover multiple adjacent frames -> timevarying spectra (Smaragdis 2007, Virtanen 2004) Transpose of spectrogram -> equivalent to convolution in freq. Excitation-filter model (Heittola et al. 2009) Each basis vector modeled as a sum of excitation and filter Harmonic bases (Vincent et al. 2007) Each basis vector modeled as a weighted sum of harmonic combs with a limited frequency support

46 Voice separation demonstrations Demonstrations also available at http://www.cs.tut.fi/~tuomasv/ mixture NMFenhanced mixture sinusoidal model binary mask proposed

References 47 Casey, M. and Westner, A., "Separation of Mixed Audio Sources by Independent Subspace Analysis", in Proceedings of the International Computer Music Conference, ICMA, Berlin, 2000. M. Plumbley, Conditions for non-negative independent component analysis, IEEE Signal Processing Letters, vol. 9, no. 6, pp. 177 180, 2002. K. W. Wilson and B. Raj, Spectrogram dimensionality reduction with independence constraints, Int. Conf. on Audio, Speech, and Signal Processing, Dallas, USA, 2010, submitted for publication. D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Adv. Neural Info. Proc. Syst. 13, 556-562 (2001). D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788-791 (1999). T. Virtanen, Monaural Sound Source Separation by Non-Negative Matrix Factorization with Temporal Continuity and Sparseness Criteria, IEEE Transactions on Audio, Speech, and Language Processing, vol 15, no. 3, March 2007. P. O. Hoyer. Non-negative Matrix Factorization with sparseness constraints Journal of Machine Learning Research 5: 1457-1469, 2004. Helén, M., Virtanen, T., Separation of Drums From Polyphonic Music Using Non- Negative Matrix Factorization and Support Vector Machine, in proc. 13th European Signal Processing Conference Antalaya, Turkey, 2005. Paulus, J., Virtanen, T., Drum Transcription with Non-negative Spectrogram Factorisation, in proc. 13th European Signal Processing Conference Antalaya, Turkey, 2005

References (2) 48 E. Vincent, N. Bertin, R. Badeau Two Nonnegative matrix factorization methods for polyphonic pitch transcription. Proc. of the International Conf. on Music Information Retrieval (ISMIR), Vienne, 2007. T. Virtanen, A. T. Cemgil, and S. J. Godsill. Bayesian Extensions to Non-negative Matrix Factorisation for Audio Signal Modelling, ICASSP 2008. Wilson, K.W., B. Raj, and P. Smaragdis, 2008. Regularized Non-Negative Matrix Factorization with Temporal Dependencies for Speech Denoising. In proceedings of Interspeech 2008, Brisbane, Australia, September 2008. T. Virtanen and A. T. Cemgil. Mixtures of Gamma Priors for Non-Negative Matrix Factorization Based Speech Separation, in Proc. ICA 2009, Paraty, Brazil,2009. T. Virtanen. Sound Source Separation in Monaural Music Signals, PhD Thesis, Tampere University of Technology, 2006. T. Virtanen, Separation of Sound Sources by Convolutive Sparse Coding, ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing, SAPA 2004. T. Heittola, A. Klapuri, and T. Virtanen. Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation, to be presented in Proc. 10th Int. Society for Music Information Retrieval Conf. (ISMIR 2009), Kobe, Japan, 2009. Smaragdis, P. 2007. Convolutive Speech Bases and their Application to Speech Separation. In IEEE Transactions of Speech and Audio Processing. January 2007 D. Ellis and D.F Rosenthal (1998) Mid-level representations for Computational Auditory Scene Analysis, Chapter 17 in Computational auditory scene analysis, D. F. Rosenthal and H. Okuno, eds., Lawrence Erlbaum, pp. 257-272, 1998.