Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Similar documents
Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

Non-Negative Tensor Factorisation for Sound Source Separation

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

Sparseness Constraints on Nonnegative Tensor Decomposition

LONG-TERM REVERBERATION MODELING FOR UNDER-DETERMINED AUDIO SOURCE SEPARATION WITH APPLICATION TO VOCAL MELODY EXTRACTION.

IN real-world audio signals, several sound sources are usually

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

ESTIMATING TRAFFIC NOISE LEVELS USING ACOUSTIC MONITORING: A PRELIMINARY STUDY

744 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 4, MAY 2011

ACCOUNTING FOR PHASE CANCELLATIONS IN NON-NEGATIVE MATRIX FACTORIZATION USING WEIGHTED DISTANCES. Sebastian Ewert Mark D. Plumbley Mark Sandler

FACTORS IN FACTORIZATION: DOES BETTER AUDIO SOURCE SEPARATION IMPLY BETTER POLYPHONIC MUSIC TRANSCRIPTION?

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Convolutive Non-Negative Matrix Factorization for CQT Transform using Itakura-Saito Divergence

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

Environmental Sound Classification in Realistic Situations

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

SHIFTED AND CONVOLUTIVE SOURCE-FILTER NON-NEGATIVE MATRIX FACTORIZATION FOR MONAURAL AUDIO SOURCE SEPARATION. Tomohiko Nakamura and Hirokazu Kameoka,

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

Signal representations: Cepstrum

Bayesian Hierarchical Modeling for Music and Audio Processing at LabROSA

Lecture 7: Pitch and Chord (2) HMM, pitch detection functions. Li Su 2016/03/31

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Scalable audio separation with light Kernel Additive Modelling

ACOUSTIC SCENE CLASSIFICATION WITH MATRIX FACTORIZATION FOR UNSUPERVISED FEATURE LEARNING. Victor Bisot, Romain Serizel, Slim Essid, Gaël Richard

Convention Paper Presented at the 128th Convention 2010 May London, UK

Single Channel Signal Separation Using MAP-based Subspace Decomposition

TIME-DEPENDENT PARAMETRIC AND HARMONIC TEMPLATES IN NON-NEGATIVE MATRIX FACTORIZATION

10ème Congrès Français d Acoustique

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

Real-Time Pitch Determination of One or More Voices by Nonnegative Matrix Factorization

Bayesian harmonic models for musical signal analysis. Simon Godsill and Manuel Davy

Non-negative Matrix Factorization: Algorithms, Extensions and Applications

Speech Signal Representations

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

Expectation-maximization algorithms for Itakura-Saito nonnegative matrix factorization

Generalized Constraints for NMF with Application to Informed Source Separation

Sparse and Shift-Invariant Feature Extraction From Non-Negative Data

EXPLOITING LONG-TERM TEMPORAL DEPENDENCIES IN NMF USING RECURRENT NEURAL NETWORKS WITH APPLICATION TO SOURCE SEPARATION

MULTIPITCH ESTIMATION AND INSTRUMENT RECOGNITION BY EXEMPLAR-BASED SPARSE REPRESENTATION. Ikuo Degawa, Kei Sato, Masaaki Ikehara

Chapter 9. Linear Predictive Analysis of Speech Signals 语音信号的线性预测分析

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

BILEVEL SPARSE MODELS FOR POLYPHONIC MUSIC TRANSCRIPTION

Topic 6. Timbre Representations

arxiv: v2 [cs.sd] 8 Feb 2017

c Springer, Reprinted with permission.

Sparse Time-Frequency Transforms and Applications.

Cauchy Nonnegative Matrix Factorization

Author manuscript, published in "7th International Symposium on Computer Music Modeling and Retrieval (CMMR 2010), Málaga : Spain (2010)"

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

SOUND SOURCE SEPARATION BASED ON NON-NEGATIVE TENSOR FACTORIZATION INCORPORATING SPATIAL CUE AS PRIOR KNOWLEDGE

Automatic Speech Recognition (CS753)

STRUCTURE-AWARE DICTIONARY LEARNING WITH HARMONIC ATOMS

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

NONNEGATIVE matrix factorization was originally introduced

Jacobi Algorithm For Nonnegative Matrix Factorization With Transform Learning

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen

Model-based unsupervised segmentation of birdcalls from field recordings

BAYESIAN NONNEGATIVE HARMONIC-TEMPORAL FACTORIZATION AND ITS APPLICATION TO MULTIPITCH ANALYSIS

Research Article Extended Nonnegative Tensor Factorisation Models for Musical Sound Source Separation

Short-Time Fourier Transform and Chroma Features

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Sound 2: frequency analysis

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Feature extraction 1

Independent Component Analysis and Unsupervised Learning

IMISOUND: An Unsupervised System for Sound Query by Vocal Imitation

CEPSTRAL ANALYSIS SYNTHESIS ON THE MEL FREQUENCY SCALE, AND AN ADAPTATIVE ALGORITHM FOR IT.

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

An Efficient Posterior Regularized Latent Variable Model for Interactive Sound Source Separation


Dept. Electronics and Electrical Engineering, Keio University, Japan. NTT Communication Science Laboratories, NTT Corporation, Japan.

Feature extraction 2

Machine Learning for Signal Processing Non-negative Matrix Factorization

NON-NEGATIVE MATRIX FACTORIZATION WITH SELECTIVE SPARSITY CONSTRAINTS FOR TRANSCRIPTION OF BELL CHIMING RECORDINGS

A NEURAL NETWORK ALTERNATIVE TO NON-NEGATIVE AUDIO MODELS. University of Illinois at Urbana-Champaign Adobe Research

x 1 (t) Spectrogram t s

Proc. of NCC 2010, Chennai, India

A NEW DISSIMILARITY METRIC FOR THE CLUSTERING OF PARTIALS USING THE COMMON VARIATION CUE

Constrained Nonnegative Matrix Factorization with Applications to Music Transcription

SUPPLEMENTARY MATERIAL FOR THE PAPER "A PARAMETRIC MODEL AND ESTIMATION TECHNIQUES FOR THE INHARMONICITY AND TUNING OF THE PIANO"

Deep NMF for Speech Separation

Robust Speaker Identification

Identification and separation of noises with spectro-temporal patterns

FEATURE SELECTION USING FISHER S RATIO TECHNIQUE FOR AUTOMATIC SPEECH RECOGNITION

NONNEGATIVE matrix factorization (NMF) is a

PIECEWISE CONSTANT NONNEGATIVE MATRIX FACTORIZATION

Design Criteria for the Quadratically Interpolated FFT Method (I): Bias due to Interpolation

An Evolutionary Programming Based Algorithm for HMM training

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Ken O Hanlon and Mark B. Sandler. Centre for Digital Music Queen Mary University of London

LECTURE NOTES IN AUDIO ANALYSIS: PITCH ESTIMATION FOR DUMMIES

Speech enhancement based on nonnegative matrix factorization with mixed group sparsity constraint

CS 188: Artificial Intelligence Fall 2011

Transcription:

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization Tuomas Virtanen and Anssi Klapuri Tampere University of Technology, Institute of Signal Processing Korkeakoulunkatu 1, FI-33720 Tampere, Finland tuomas.virtanen@tut.fi, anssi.klapuri@tut.fi Abstract This paper proposes a method for analysing polyphonic audio signals based on a signal model where an input spectrogram is modelled as a linear sum of basis functions with time-varying gains. Each basis is further represented as a product of a source spectrum and the magnitude response of a filter. This formulation reduces the number of free parameters needed to represent realistic audio signals and leads to a more reliable parameter estimation. Two novel estimation algorithms are proposed, one extended from non-negative matrix factorization and the other from non-negative matrix deconvolution. In preliminary experiments with singing signals, both algorithms have been found to converge towards meaningful analysis results. 1 Introduction Model-based analysis of audio signals has received increasing attention in recent years and even quite complex generative models have been proposed [2, 3]. Applications of this include for example sound separation, audio coding, music transcription, and sound source recognition. This paper proposes a method for analysing the component sounds in a polyphonic audio signal based on the source-filter model of sound production. Here source refers to a vibrating object such as a guitar string, and filter represents the resonance structure of the rest of the instrument which colors the produced sound. The source varies with pitch and the degree of periodicity, whereas the filter varies with timbre. This framework has been used for decades in speech coding [7] and sound synthesis [9], but has not been properly adopted in signal analysis and classification problems. Usually a less structured approach is employed, modeling sound spectra directly with the Fourier transform or with Mel-frequency cepstral coefficient (MFCC). These have certain limitations that are addressed in this paper. We have recently proposed an algorithm for modeling the time-varying spectral energy distribution of musical sounds when presented in isolation [5]. In that work, the sourcefilter model was found to lead to a clear improvement over MFCC-like models. Here the This work was supported by the Academy of Finland, project No. 5213462 (Finnish centre of Excellence program (2006-2011).

analysis is performed using a completely different estimation algorithm which is suitable for the analysis of polyphonic signals. Also, several different filters are allowed per one sound source, which makes the model more suitable for singing, for example, were the vocal tract filter varies over time. As a framework for polyphonic signal analysis we use a linear signal model for the magnitude spectrum x t (k) of the mixture signal, where k is frequency index and t is frame index. The signal x t (k) is modelled as a sum of basis functions: N ˆx t (k) = g n,t b n (k) (1) n=1 where g n,t is the gain of basis function n in frame t, and b n (k), n = 1,...,N are the bases. The bases represent the magnitude spectra of different musical tones. Several active bases are allowed at each time, making the model suitable for polyphonic signals. Existing unsupervised learning methods for estimating the basis functions and gains include independent component analysis (ICA) [1], sparse coding [10], and non-negative matrix factorization (NMF) [8]. When used for sound separation, currently the best results have been obtained using NMF [11]. 2 Proposed signal model A problem with the model (1) is that the mixture signal is represented as a sum of fixed spectra: each pitch value of each instrument requires a distinct basis function. The large number of free parameters makes the estimation less reliable, and also, clustering the estimated bases to sound sources is difficult. In the proposed signal model, each basis b n (k) is described as a product of the magnitude spectra of an excitation (source) e i (k) and a filter h j (k). This leads to the model ˆx t (k) = g i,j,t e i (k)h j (k) (2) i,j which assigns one excitation per pitch value and one filter per instrument (or phoneme in singing). A polyphonic signal consists of several excitation and filter combinations occurring simultaneously or in sequence. We denote the number of excitations by I and the number of filters by J. Note that the number of bases, (I J), is significantly larger than that of excitations or filters. As a consequence, the number of parameters to estimate is smaller (bases are restricted to b n (k) = e i (k)h j (k)) which makes the estimation more reliable. Also, the proposed signal model associates components with the same timbre (resp. pitch), leading to an automatic clustering of bases to sound sources (resp. musical notes). In Section 4, we also introduce a signal model which allows the pitch of an individual excitation signal to vary. This makes it more suitable for singing, where pitch values are not quantized and therefore cannot be well represented with a countable set of excitations. In that latter model, a singing signal can be represented with a single pitch-varying excitation and multiple filters (one per phoneme). 3 Non-negative matrix factorization algorithm for parameter estimation When the bases are magnitude or power spectra, it is natural to restrict them to be entrywise non-negative. Furthermore, the model can be restricted to be purely additive by limiting the gains to be non-negative. NMF estimates the bases and their gains by minimizing

the reconstruction error between the observed spectrogram and the model while restricting the parameters to non-negative values. It has turned out that the non-negativity restrictions alone are sufficient for sound source separation [8]. Commonly used measures for the reconstruction error are the Euclidean distance, and divergence d, defined as d(x, ˆx) = k,t x t (k)log x t(k) ˆx t (k) x t(k) + ˆx t (k) (3) The divergence is always non-negative, and zero only when x t (k) = ˆx t (k) for all k and t. It can be minimized for example using the multiplicate updates proposed by Lee and Seung [6]: the parameters are initialized to random non-negative values, and updated by applying multiplicative update rules iteratively. Each update decreases the value of the divergence, until the algorithm converges. We propose an augmented NMF algorithm for estimating the parameters of the model (2). Multiplicative updates which minimize the divergence (3) are given by k g i,j,t g r t(k)e i (k)h j (k) i,j,t k e, (4) i(k)h j (k) j,t e i (k) e i (k) r t(k)g i,j,t h j (k) j,t g, (5) i,j,th j (k) and i,t h j (k) h j (k) r t(k)g i,j,t e i (k) i,t g, (6) i,j,te i (k) where r t (k) = xt(k) ˆx t(k) is evaluated using (2) before each update. The overall estimation algorithm is given as follows: 1. Choose the number of excitations and filters. Initialize each parameter g i,j,t, e i (k), and h j (k) with a random positive values. 2. Update the gains using (4). 3. Update the excitations using (5). 4. Update the filters using (6). 5. Repeat steps 2-4 until the algorithm converges. It can be shown that the divergence (3) is non-increasing under each update. When prior knowledge about the sources is available, it can be used to initialize the excitations and filters. 4 Representing several pitch values with a single excitation The linear model (1) requires multiple basis functions to represent tones with different pitch values. This limitation has been addressed, for example, by FitzGerald [4] and Virtanen [12, pp. 57-65], who translated basis functions on logarithmic frequency axis to produce different fundamental values. In this model, each gain g n,t in (1) is replaced by gain g n,t,τ, which denotes the amount of contribution of the n th basis function, which is translated by τ frequency bins. The model can be written as ˆx t (k) = g n,t,τ b n (k τ). (7) n,τ

In this model, the translation affects the entire basis function, i.e. the product of the excitation and the filter, and therefore the filter becomes translated, too. A more realistic model is obtained by producing different fundamental frequency values by translating a single harmonic excitation, and keeping the filter fixed. When the spectrum is modeled as a product of excitation e i (k) and filter h j (k), the model can be written as ˆx t (k) = i,j,τ g i,j,t,τ e i (k τ)h j (k). (8) Parameters of the model (7) have been estimated by algorithms extended from NMF. Similar approach can be used to estimate the parameters of the proposed model (8). Multiplicative updates which minimize the divergence (3) for model (8) are given by k,z g i,j,t,τ g r t(k + z)e i (k)h j (k) i,j,t,τ k,z e i(k + z)h j (k + z), (9) j,t,z e i (k) e i (k) r t(k + z)g i,j,t,z h j (k + z) j,t,z g, (10) i,j,t,zh j (k + z) and i,t,z h j (k) h j (k) r t(k)g i,j,t,z e i (k z) i,t,z g. (11) i,j,t,ze i (k z) The overall estimation is similar to the algorithm for NMF. The parameters are initialized with random positive values, and updated sequentially using Equations (9)-(11). The model is especially suitable for singing voice, since only a single excitation is required to model all harmonic tones, and different phonemes can be modeled using different filters. 5 Conclusions Estimation algorithms were proposed for two signal models (2) and (8), both of which aim at reducing the amount of free parameters needed to represent realistic audio signals. In preliminary experiments with singing signals, both algorithms were found to converge and to find more meaningful bases than the conventional NMF which uses the signal model (1). References [1] M. A. Casey and A. Westner. Separation of mixed audio sources by independent subspace analysis. In International Computer Music Conference, Berlin, Germany, 2000. [2] A.T. Cemgil, B. Kappen, and D. Barber. A generative model for music transcription. IEEE Trans. Speech and Audio Processing, 13(6), 2005. [3] M. Davy, S. Godsill, and J. Idier. Bayesian analysis of polyphonic Western tonal music. Journal of the Acoustical Society of America, 119(4):2498 2517, 2006. [4] Derry FitzGerald, Matt Cranitch, and Eugene Coyle. Generalised prior subspace analysis for polyphonic pitch transcription. In Proceedings of International Conference on Digital Audio Effects, Madrid, Spain, 2005. [5] A. Klapuri. Analysis of musical instrument sounds by source filter decay model. In International conference of acoustics, speech and signal processing, Honolulu, Hawaii, USA, 2007. Submitted for review. [6] D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Neural Information Processing Systems, pages 556 562, Denver, USA, 2001.

[7] M.R. Schroeder and B.S. Atal. Code-excited linear prediction (CELP): High-quality speech at very low bit rates. In IEEE ICASSP, pages 937 940, Tampa, Florida, 1985. [8] Paris Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music transcription. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, 2003. [9] V. Välimäki, J. Pakarinen, C. Erkut, and M. Karjalainen. Discrete-time modelling of musical instruments. Reports on progress in physics, 69:1 78, 2006. [10] Tuomas Virtanen. Sound source separation using sparse coding with temporal continuity objective. In International Computer Music Conference, Singapore, 2003. [11] Tuomas Virtanen. Monaural sound source separation by non-negative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 2006. Accepted for publication. [12] Tuomas Virtanen. Sound Source Separation in Monaural Music Signals. PhD thesis, Tampere University of Technology, 2006. available at http://www.cs.tut.fi/ tuomasv.