Single Channel Signal Separation Using MAP-based Subspace Decomposition

Similar documents
Blind Machine Separation Te-Won Lee

A Maximum Likelihood Approach to Single-channel Source Separation

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

REVIEW OF SINGLE CHANNEL SOURCE SEPARATION TECHNIQUES

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

On Spectral Basis Selection for Single Channel Polyphonic Music Separation

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540

MONAURAL ICA OF WHITE NOISE MIXTURES IS HARD. Lars Kai Hansen and Kaare Brandt Petersen

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

TWO METHODS FOR ESTIMATING OVERCOMPLETE INDEPENDENT COMPONENT BASES. Mika Inki and Aapo Hyvärinen

ORTHOGONALITY-REGULARIZED MASKED NMF FOR LEARNING ON WEAKLY LABELED AUDIO DATA. Iwona Sobieraj, Lucas Rencker, Mark D. Plumbley

Soft-LOST: EM on a Mixture of Oriented Lines

x 1 (t) Spectrogram t s

Acoustic Vector Sensor based Speech Source Separation with Mixed Gaussian-Laplacian Distributions

Sound Recognition in Mixtures

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

Nonlinear reverse-correlation with synthesized naturalistic noise

Machine Recognition of Sounds in Mixtures

Lecture Notes 5: Multiresolution Analysis

Introduction to Independent Component Analysis. Jingmei Lu and Xixi Lu. Abstract

CIFAR Lectures: Non-Gaussian statistics and natural images

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Analysis of polyphonic audio using source-filter model and non-negative matrix factorization

Nonnegative Matrix Factorization with Markov-Chained Bases for Modeling Time-Varying Patterns in Music Spectrograms

Gaussian Processes for Audio Feature Extraction

Lecture 7: Con3nuous Latent Variable Models

Independent Component Analysis and Its Applications. By Qing Xue, 10/15/2004

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

Proc. of NCC 2010, Chennai, India

Independent Component Analysis of Incomplete Data

Independent Component Analysis and Unsupervised Learning

Bayesian ensemble learning of generative models

Lecture'12:' SSMs;'Independent'Component'Analysis;' Canonical'Correla;on'Analysis'

Monaural speech separation using source-adapted models

On Information Maximization and Blind Signal Deconvolution

Gatsby Theoretical Neuroscience Lectures: Non-Gaussian statistics and natural images Parts I-II

A two-layer ICA-like model estimated by Score Matching

AN INVERTIBLE DISCRETE AUDITORY TRANSFORM

STA 4273H: Statistical Machine Learning

A Coupled Helmholtz Machine for PCA

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

ICA. Independent Component Analysis. Zakariás Mátyás

Environmental Sound Classification in Realistic Situations

Bayesian Methods for Sparse Signal Recovery

An Improved Cumulant Based Method for Independent Component Analysis

Is early vision optimised for extracting higher order dependencies? Karklin and Lewicki, NIPS 2005

Cochlear modeling and its role in human speech recognition

FREQUENCY-DOMAIN IMPLEMENTATION OF BLOCK ADAPTIVE FILTERS FOR ICA-BASED MULTICHANNEL BLIND DECONVOLUTION

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis

Higher Order Statistics

"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction"

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs

Lecture 16 Deep Neural Generative Models

Independent Component Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Factor Models. Sargur N. Srihari

Single-channel source separation using non-negative matrix factorization

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

Dictionary Learning for photo-z estimation

PROPERTIES OF THE EMPIRICAL CHARACTERISTIC FUNCTION AND ITS APPLICATION TO TESTING FOR INDEPENDENCE. Noboru Murata

Adapted Feature Extraction and Its Applications

Separation of Different Voices in Speech using Fast Ica Algorithm

Speech Signal Representations

Automatic Speech Recognition (CS753)

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

Lecture 6: April 19, 2002

Speaker recognition by means of Deep Belief Networks

Principal Component Analysis (PCA)

Introduction to Graphical Models

SUPPLEMENTARY INFORMATION

Independent Components Analysis

HST.582J/6.555J/16.456J

Exemplar-based voice conversion using non-negative spectrogram deconvolution

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

A SEMI-BLIND TECHNIQUE FOR MIMO CHANNEL MATRIX ESTIMATION. AdityaKiran Jagannatham and Bhaskar D. Rao

Least Squares Regression

CS281 Section 4: Factor Analysis and PCA

From Last Meeting. Studied Fisher Linear Discrimination. - Mathematics. - Point Cloud view. - Likelihood view. - Toy examples

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

Dimension Reduction. David M. Blei. April 23, 2012

Probabilistic Inference of Speech Signals from Phaseless Spectrograms

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dominant Feature Vectors Based Audio Similarity Measure

STA 414/2104: Machine Learning

1 Introduction Blind source separation (BSS) is a fundamental problem which is encountered in a variety of signal processing problems where multiple s

DS-GA 1002 Lecture notes 12 Fall Linear regression

Least Squares Regression

NONLINEAR INDEPENDENT FACTOR ANALYSIS BY HIERARCHICAL MODELS

Response-Field Dynamics in the Auditory Pathway

Tutorial on Blind Source Separation and Independent Component Analysis

Modifying Voice Activity Detection in Low SNR by correction factors

Review: Learning Bimodal Structures in Audio-Visual Data

Deep NMF for Speech Separation

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Generative models for amplitude modulation: Gaussian Modulation Cascade Processes

Transcription:

Single Channel Signal Separation Using MAP-based Subspace Decomposition Gil-Jin Jang, Te-Won Lee, and Yung-Hwan Oh 1 Spoken Language Laboratory, Department of Computer Science, KAIST 373-1 Gusong-dong, Usong-gu, Daejon 305-701, South Korea Phone: +82-42-869-3556, Fax: +82-42-869-3510 Email: {jangbal,yhoh}@speech.kaist.ac.kr Institute for Neural Computation, University of California, San Diego 9500 Gilman Dr., La Jolla, CA 92093-0523, USA Email: tewon@ucsd.edu An algorithm for single channel signal separation is presented. The algorithm projects the observed signal to given subspaces, and recovers the original sources by probabilistic weighting and recombining the subspace signals. The results of separating mixtures of two different natural sounds are reported. Introduction: Extracting multiple source signals from a single channel mixture is a challenging research field with numerous applications. Conventional methods are mostly based on splitting mixtures observed as a single stream into different acoustic objects, by building an active scene analysis system for the acoustic events that occur simultaneously in the same spectro-temporal regions. Recently Roweis presented a refiltering technique to estimate time-varying masking filters that localize sound streams in a spectro-temporal region [1]. In his work, sources are supposedly disjoint in the spectrogram and a mask whose value is binary, 0 or 1, exclusively divides the mixed streams completely. Our work, while motivated by the concept of spectral masking, is free of the assumption that the spectrograms should be disjoint. The main novelty of the proposed method is that the masking filters can have any real value in [0, 1], and that the filtering is done in the more discriminative, statistically independent subspaces obtained by independent component analysis (ICA). The 1

algorithm recovers the original auditory streams by searching for the maximized log likelihood of the separated signals, computed by the pdfs (probability density functions) of the projections onto the subspaces. Empirical observations show that the projection histogram is extremely sparse, and the use of generalized Gaussian distributions [2] yields a good approximation. Subspace Decomposition: Let us consider a monaural separation of a mixture of two signals observed in a single channel, such that the observation is given by y(t) = x 1 (t) + x 2 (t), t [1, T ], (1) where x i (t) is the tth observation of the ith source. It is convenient to assume all the sources to have zero mean and unit variance. The goal is to recover all x i (t) given only single sensor input y(t). The problem is too ill-conditioned to be mathematically tractable since the number of unknowns is 2 T given only T observations. Our approach, illustrated in fig. 1, begins with decomposing the mixture signals into N disjoint subspace projections v k (t), each filtered to contain only energy from a small portion of the whole space: N v k (t) = P(y(t); w k, d k ) = w kn y(t d k + n). (2) where P is a projection operator, N is the number of subspaces, and w kn is the nth coefficient of the kth coordinate vector w k whose lag is d k. Suppose the appropriate subparts of an audio signal lie on a specific subspace over short times. The separation is then equivalent to searching for subspaces that are close to the individual source signals. More generally, u ik (t) is approximated by modulating the mixed projections v k (t): u 1k (t) = λ k v k (t), u 2k (t) = (1 λ k )v k (t), (3) where a latent variable λ k is a weight on the projection of subspace k, which is fixed over time. We can adapt the weights to bring projections in and out of the source as needed. The original sources x i (t) are then reconstructed by recombining {u ik (t) k = 1,..., N} and performing the inverse transform of the projection. Proper choices of the weights λ k enable the isolation of a single source from the input signal and the suppression of all other sources and background noises. A set of subspaces that effectively split independent streams is essential in the success of the separation algorithm. Fig. 2 shows an example of desired subspaces. Two ellipses represent two different source distributions, whose energy concentrations are directed by the arrows. If we project the 2 n=1

mixture onto the arrows (1-dim subspaces), the original sources can be recovered with the error minimized by the principle of orthogonality. To obtain an optimal basis, we adopt ICA, which estimates the inverse-translation-operator such that the resulting coordinate can be statistically as independent as possible [3]. Estimating Source Signals: The estimation of λ k can be accomplished by simply finding the values that maximize the probability of the subspace projections. The success of the separation algorithm for our purpose depends highly on how closely the ICA density model captures the true source coefficient density. The histograms of natural sounds reveal that p(u ik ) is highly super-gaussian [3]. Therefore we use a generalized Gaussian prior [2] that provides an accurate estimate for symmetric non-gaussian distributions in modeling the underlying distribution of the source coefficients, with a varying degree of normality in the following general form: p(u) exp ( u q ). We approximate the log probability density of the projections according to eq. 3: log p (u 1k ) u 1k q 1k = λ q 1k k v k q 1k log p (u 2k ) u 2k q 2k = (1 λk ) q 2k v k q 2k. (4) We define the object function Ψ k of subspace k by the sum of the joint log probability density of u 1k (t) and u 2k (t), over the time axis: Ψ k def = t log p (u 1k(t), u 2k (t)) = λ q 1k k t v k(t) q 1k (1 λ k ) q 2k t v k(t) q 2k. (5) The problem is equivalent to constrained maximization in the closed interval [0, 1]; we can find a unique value of λ k at either boundaries (0 or 1), or local maximum by Newton s method. Evaluation: We have tested the performance of the proposed method on single channel mixtures of four different sound types; monaural signals of rock and jazz music, and male and female speech. Audio files for all the experiments are accessible at http://speech.kaist.ac.kr/ jangbal/rbss1el/. We used different sets of sound signals for generating mixtures and for learning ICA subspaces (w k ), and estimating generalized Gaussian parameters (q ik, variances, etc.) to model the subspace component pdfs. While learning the subspaces, all the training data of source 1 and source 2 are used to reflect the statistical properties of both sound sources upon the resultant subspaces. The pdf parameters are estimated separately for each source. The weighting factors are computed block-wise; that is, we chop the input signals into blocks of fixed length and assign different weighting filters for the 3

individual blocks. The computation of the weighting filter at each block is done independently of the other blocks; hence the weighting becomes more accurate as the block length shrinks. However if the block length is too short, the computation becomes unreliable. The optimal block length was 25ms in our experiments. From the testing data set, two sources out of the four are selected and added sample-by-sample to generate a mixture signal. The proposed separation algorithm was applied to recover the original sources. Table 1 reports the separation results when Fourier and the learned ICA bases are used. The proposed method deals with binary mixtures only. The performances are measured by signalto-noise ratio (SNR). With learned ICA subspaces, the performances were improved more than 1dB on the average, compared to Fourier basis. In terms of the sound source types, generally mixtures containing music were recovered more cleanly than the male-female mixture for both bases. Conclusion: The algorithm can separate the single channel mixture signals of two different sound sources. The original source signals are recovered by projecting the input mixture onto the given subspaces, modulating the projections, and recombining the projected signals. The subspaces learned by the ICA algorithm achieve good separation performance. Experimental results showed successful separations of the simulated mixtures of rock and jazz music, and male and female speech signals. The proposed method has additional potential applications including suppression of environmental noise for communication systems and hearing aids, enhancing the quality of corrupted recordings, and preprocessing for speech recognition systems. References [1] S. T. Roweis, One microphone source separation, Advances in Neural Information Processing Systems, vol. 13, pp. 793 799, 2001. [2] T.-W. Lee and M. S. Lewicki, The generalized Gaussian mixture model using ICA, in International Workshop on Independent Component Analysis (ICA 00), (Helsinki, Finland), pp. 239 244, June 2000. [3] A. J. Bell and T. J. Sejnowski, Learning the higher-order structures of a natural sound, Network: Computation in Neural Systems, vol. 7, pp. 261 266, July 1996. 4

u λ k y B C A x ˆ ( t ) v ( 1 λk ) u C x ˆ ( t ) Figure 1: Block diagram of subspace weighting. (A) Input signal y(t) is projected onto N, 1-dimension subspaces. (B) The projections v k (t) are modulated by weighting factors λ k, 1 λ k [0, 1]. (C) The separation process finally terminates with summing up the N modulated signals. 4 2 0-2 -4-10 -5 0 5 10 Figure 2: Illustration of desired subspaces. The ellipses represent the distributions of two different classes. The arrows are the 1-dimensional subspaces along the maximum energy concentrations of the classes. Projecting the mixtures onto each subspace provides minimum error separation. 5

Table 1: Computed SNRs of the separation results. The first row lists the two symbols of the sources that are mixed to the input. (R, J, M, F) stand for rock, jazz music, male, and female speech. The last column is the average SNR. Audio files for all the results are accessible at http://speech.kaist.ac.kr/ jangbal/rbss1el/. basis RJ RM RF JM JF MF Avg. Fourier 8.3 3.4 5.3 7.3 6.4 3.8 5.74 ICA 10.2 4.6 6.1 8.2 6.7 5.1 6.82 6