FUNDAMENTAL LIMITATION OF FREQUENCY DOMAIN BLIND SOURCE SEPARATION FOR CONVOLVED MIXTURE OF SPEECH

Similar documents
Simultaneous Diagonalization in the Frequency Domain (SDIF) for Source Separation

A METHOD OF ICA IN TIME-FREQUENCY DOMAIN

A Robust and Precise Method for Solving the Permutation Problem of Frequency-Domain Blind Source Separation

ROBUSTNESS OF PARAMETRIC SOURCE DEMIXING IN ECHOIC ENVIRONMENTS. Radu Balan, Justinian Rosca, Scott Rickard

Fundamentals of Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Independent Vector Analysis (IVA)

Sparse filter models for solving permutation indeterminacy in convolutive blind source separation

TRINICON: A Versatile Framework for Multichannel Blind Signal Processing

Musical noise reduction in time-frequency-binary-masking-based blind source separation systems

x 1 (t) Spectrogram t s

Blind Machine Separation Te-Won Lee

MLSP 2007 DATA ANALYSIS COMPETITION: FREQUENCY-DOMAIN BLIND SOURCE SEPARATION FOR CONVOLUTIVE MIXTURES OF SPEECH/AUDIO SIGNALS

Tutorial on Blind Source Separation and Independent Component Analysis

AN APPROACH TO PREVENT ADAPTIVE BEAMFORMERS FROM CANCELLING THE DESIRED SIGNAL. Tofigh Naghibi and Beat Pfister

Verification of contribution separation technique for vehicle interior noise using only response signals

Acoustic Vector Sensor based Speech Source Separation with Mixed Gaussian-Laplacian Distributions

ICA [6] ICA) [7, 8] ICA ICA ICA [9, 10] J-F. Cardoso. [13] Matlab ICA. Comon[3], Amari & Cardoso[4] ICA ICA

Acoustic Source Separation with Microphone Arrays CCNY

Using Multiple Frequency Bins for Stabilization of FD-ICA Algorithms

On Information Maximization and Blind Signal Deconvolution

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

POLYNOMIAL SINGULAR VALUES FOR NUMBER OF WIDEBAND SOURCES ESTIMATION AND PRINCIPAL COMPONENT ANALYSIS

Chapter 6 Effective Blind Source Separation Based on the Adam Algorithm

BLIND SOURCE SEPARATION ALGORITHM FOR MIMO CONVOLUTIVE MIXTURES. Kamran Rahbar and James P. Reilly

Acoustic MIMO Signal Processing

A Generalization of Blind Source Separation Algorithms for Convolutive Mixtures Based on Second-Order Statistics

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Convolutive Blind Source Separation based on Multiple Decorrelation. Lucas Parra, Clay Spence, Bert De Vries Sarno Corporation, CN-5300, Princeton, NJ

Robotic Sound Source Separation using Independent Vector Analysis Martin Rothbucher, Christian Denk, Martin Reverchon, Hao Shen and Klaus Diepold

FREQUENCY-DOMAIN IMPLEMENTATION OF BLOCK ADAPTIVE FILTERS FOR ICA-BASED MULTICHANNEL BLIND DECONVOLUTION

Recognition of Convolutive Speech Mixtures by Missing Feature Techniques for ICA

STOCHASTIC APPROXIMATION AND A NONLOCALLY WEIGHTED SOFT-CONSTRAINED RECURSIVE ALGORITHM FOR BLIND SEPARATION OF REVERBERANT SPEECH MIXTURES

Estimating Correlation Coefficient Between Two Complex Signals Without Phase Observation

Diffuse noise suppression with asynchronous microphone array based on amplitude additivity model

SUPERVISED INDEPENDENT VECTOR ANALYSIS THROUGH PILOT DEPENDENT COMPONENTS

Massoud BABAIE-ZADEH. Blind Source Separation (BSS) and Independent Componen Analysis (ICA) p.1/39

Akihide HORITA, Kenji NAKAYAMA, and Akihiro HIRANO

BLIND SEPARATION OF NON-STATIONARY CONVOLUTIVELY MIXED SIGNALS IN THE TIME DOMAIN

Blind separation of instantaneous mixtures of dependent sources

Speech Source Separation in Convolutive Environments Using Space-Time-Frequency Analysis

CONVOLUTIVE NON-NEGATIVE MATRIX FACTORISATION WITH SPARSENESS CONSTRAINT

A Local Model of Relative Transfer Functions Involving Sparsity

Dept. Electronics and Electrical Engineering, Keio University, Japan. NTT Communication Science Laboratories, NTT Corporation, Japan.

TinySR. Peter Schmidt-Nielsen. August 27, 2014

Analytical solution of the blind source separation problem using derivatives

Recursive Generalized Eigendecomposition for Independent Component Analysis

where A 2 IR m n is the mixing matrix, s(t) is the n-dimensional source vector (n» m), and v(t) is additive white noise that is statistically independ

ON THE NOISE REDUCTION PERFORMANCE OF THE MVDR BEAMFORMER IN NOISY AND REVERBERANT ENVIRONMENTS

Independent Component Analysis

Independent Component Analysis. Contents

A LOCALIZATION METHOD FOR MULTIPLE SOUND SOURCES BY USING COHERENCE FUNCTION

SINGLE CHANNEL SPEECH MUSIC SEPARATION USING NONNEGATIVE MATRIX FACTORIZATION AND SPECTRAL MASKS. Emad M. Grais and Hakan Erdogan

Independent Component Analysis on the Basis of Helmholtz Machine

Soft-LOST: EM on a Mixture of Oriented Lines

Estimation of the Optimum Rotational Parameter for the Fractional Fourier Transform Using Domain Decomposition

ESTIMATION OF RELATIVE TRANSFER FUNCTION IN THE PRESENCE OF STATIONARY NOISE BASED ON SEGMENTAL POWER SPECTRAL DENSITY MATRIX SUBTRACTION

Speech Signal Representations

Convolutive Transfer Function Generalized Sidelobe Canceler

On the INFOMAX algorithm for blind signal separation

Acoustic Signal Processing. Algorithms for Reverberant. Environments

Underdetermined Instantaneous Audio Source Separation via Local Gaussian Modeling

New Insights Into the Stereophonic Acoustic Echo Cancellation Problem and an Adaptive Nonlinearity Solution

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540

a 22(t) a nn(t) Correlation τ=0 τ=1 τ=2 (t) x 1 (t) x Time(sec)

"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction"

Nonnegative Matrix Factor 2-D Deconvolution for Blind Single Channel Source Separation

BLIND SEPARATION OF INSTANTANEOUS MIXTURES OF NON STATIONARY SOURCES

ORIENTED PCA AND BLIND SIGNAL SEPARATION

To Separate Speech! A System for Recognizing Simultaneous Speech

Title algorithm for active control of mul IEEE TRANSACTIONS ON AUDIO SPEECH A LANGUAGE PROCESSING (2006), 14(1):

NMF WITH SPECTRAL AND TEMPORAL CONTINUITY CRITERIA FOR MONAURAL SOUND SOURCE SEPARATION. Julian M. Becker, Christian Sohn and Christian Rohlfing

An Iterative Blind Source Separation Method for Convolutive Mixtures of Images

Independent Component Analysis and Unsupervised Learning

Source localization and separation for binaural hearing aids

Non-Negative Matrix Factorization And Its Application to Audio. Tuomas Virtanen Tampere University of Technology

Machine Learning. A Bayesian and Optimization Perspective. Academic Press, Sergios Theodoridis 1. of Athens, Athens, Greece.

Covariance smoothing and consistent Wiener filtering for artifact reduction in audio source separation

Signal Modeling Techniques in Speech Recognition. Hassan A. Kingravi

A NATURAL GRADIENT CONVOLUTIVE BLIND SOURCE SEPARATION ALGORITHM FOR SPEECH MIXTURES

DIRECTION ESTIMATION BASED ON SOUND INTENSITY VECTORS. Sakari Tervo

FREQUENCY DOMAIN BLIND SOURCE SEPARATION FOR ROBOT AUDITION USING A PARAMETERIZED SPARSITY CRITERION

An Information-Theoretic View of Array Processing

Enhancement of Noisy Speech. State-of-the-Art and Perspectives

USING STATISTICAL ROOM ACOUSTICS FOR ANALYSING THE OUTPUT SNR OF THE MWF IN ACOUSTIC SENSOR NETWORKS. Toby Christian Lawin-Ore, Simon Doclo

Upper Bounds on MIMO Channel Capacity with Channel Frobenius Norm Constraints

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Experimental Upper Bound for the Performance of. Convolutive Source Separation Methods

Blind Source Separation with a Time-Varying Mixing Matrix

Discovering Convolutive Speech Phones using Sparseness and Non-Negativity Constraints

1 Introduction Blind source separation (BSS) is a fundamental problem which is encountered in a variety of signal processing problems where multiple s

IEEE TRANSACTIONS SPEECH AND AUDIO PROCESSING, VOL. XX, NO. Y, MONTH that allow convolutive cross talk removal. Thi and Jutten [19] gave simple

(3) where the mixing vector is the Fourier transform of are the STFT coefficients of the sources I. INTRODUCTION

MITIGATING UNCORRELATED PERIODIC DISTURBANCE IN NARROWBAND ACTIVE NOISE CONTROL SYSTEMS

Informed algorithms for sound source separation in enclosed reverberant environments

An Improved Cumulant Based Method for Independent Component Analysis

Phase Aliasing Correction For Robust Blind Source Separation Using DUET

MULTI-RESOLUTION SIGNAL DECOMPOSITION WITH TIME-DOMAIN SPECTROGRAM FACTORIZATION. Hirokazu Kameoka

BLIND SOURCE EXTRACTION FOR A COMBINED FIXED AND WIRELESS SENSOR NETWORK

System Identification and Adaptive Filtering in the Short-Time Fourier Transform Domain

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Bayesian Regularization and Nonnegative Deconvolution for Room Impulse Response Estimation

Transcription:

FUNDAMENTAL LIMITATION OF FREQUENCY DOMAIN BLIND SOURCE SEPARATION FOR CONVOLVED MIXTURE OF SPEECH Shoko Araki Shoji Makino Ryo Mukai Tsuyoki Nishikawa Hiroshi Saruwatari NTT Communication Science Laboratories - Hikaridai, Seika-cho, Soraku-gun, Kyoto 69-7, Japan Email: shoko@cslab.kecl.ntt.co.jp Nara Institute of Science and Technology 896-5 Takayama-cho, Ikoma, Nara 6-, Japan ABSTRACT Despite several recent proposals to achieve Blind Source Separation (BSS) for realistic acoustic signals, the separation performance is still not enough. In particular, when the length of an impulse response is long, the performance is highly limited. In this paper, we consider the reason for the poor performance of BSS in a long reverberation environment. First, we show that it is useless to be constrained by the condition P T, where T is the frame size of FFT and P is the length of a room impulse response. We also discuss the limitation of frequency domain BSS, by showing that the frequency domain BSS framework is equivalent to two sets of frequency domain adaptive beamformers.. INTRODUCTION Blind Source Separation (BSS) is an approach to estimate original source signals s i (t) using only the information of the mixed signals x j (t) observed in each input channel. This technique is applicable to the realization of noise robust speech recognition and highquality hands-free telecommunication systems. It may also become a cue for auditory scene analysis. To achieve BSS of convolutive mixtures, several methods have been proposed [, ]. In this paper, we consider the BSS of convolutive mixtures of speech in the frequency domain [, ], for the sake of mathematical simplicity and the reduction of computational complexity. There have been a lot of proposals to achieve BSS in a realistic room environment, however, the separation performance is still not enough. In this paper, we consider the reason for the poor performance of BSS in a long reverberation environment. First, we discuss the frame size of FFT used in frequency domain BSS. It is commonly believed that the must be P T to estimate an unmixing matrix for a P -point room impulse response [5, 6]. We point out that this is not the case for BSS, and show that a smaller frame size is much better, even for long room reverberation. Next, we discuss the limitation of frequency domain BSS, by showing the equivalence between frequency do- S S H H H H mixing system X mic. X mic. W W W W unmixing system Figure : BSS system configuration. main BSS framework and two sets of beamformers. To date, signal separation by using a noise cancellation framework with signal leakage into the noise reference has been discussed in [7, 8]. In these papers it is shown that the least squares criterion is equivalent to the decorrelation criterion of a noise free signal estimate and a signal free noise estimate. Inspired by their discussions, but apart from the noise cancellation framework, we attempt to see the frequency domain BSS problem with a frequency domain adaptive microphone array, i.e., Adaptive Beamformer (ABF) framework.. FREQUENCY DOMAIN BSS OF CONVOLUTIVE MIXTURES OF SPEECH The signals recorded by M microphones are given by N P x j (n) h ji (p)s i (n p+) (j,,m), () i p where s i is the source signal from a source i, x j is the received signal by a microphone j, and h ji is a P -point impulse response from source i to microphone j. In this paper, we consider a two-input, two-output convolutive BSS problem, i.e., N M (Fig. ). The frequency domain approach to convolutive mixtures is to transform the problem into an instantaneous BSS problem in the frequency domain [, ]. Using T -point short time Fourier transformation for (), we obtain, X(ω, m) H(ω)S(ω, m). () Y Y

. m.5 m.56 m cm microphones (height :.5 m).5 m.5 m 5.7 m loudspeakers (height :.5 m) room height :.7 m Figure : Layout of a room used in experiments. where S(ω, m) [S (ω, m),s (ω, m)] T. We assume that a ( ) mixing matrix H(ω) is invertible, and H ji (ω). The unmixing process can be formulated in a frequency bin ω: Y (ω, m) W (ω)x(ω, m), () where X(ω, m) [X (ω, m),x (ω, m)] T is the observed signal at frequency bin ω, Y (ω, m) [Y (ω, m),y (ω, m)] T is the estimated source signal, and W (ω) represents a ( ) unmixing matrix. W (ω) is determined so that Y (ω, m) and Y (ω, m) become mutually independent. The above calculations are carried out in each frequency independently.. EXPERIMENTS It is commonly believed that the must be P T to estimate an unmixing matrix for a P - point room impulse response [5, 6]. In this section, we investigate this point for BSS, and show that there is an optimal frame size for BSS [9]... Conditions for experiments... Learning algorithm For the calculation of the unmixing matrix W (ω) in (), we use an algorithm based on the minimization of the Kullback-Leibler divergence [, ]. The optimal W (ω) is obtained by using the following iterative equation: W i+(ω) W i(ω) +η [ diag ( Φ(Y )Y H ) Φ(Y )Y H ] W i(ω),() where Y Y (ω, m), denotes the averaging operator, i is used to express the value of the i-th step in the iterations, and η is the step size parameter. In addition, we define the nonlinear function Φ( ) as Φ(Y ) + exp( Y (R) ) + j + exp( Y (I) ), (5) where Y (R) and Y (I) are the real part and the imaginary part of Y, respectively. [db].6.. -. -. - - - - -5-6...... time (s) Figure : An example of measured impulse response h used in experiments (T R ms).... Conditions for experiments Separation experiments were conducted using speech data convolved with impulse responses recorded in three environments specified by different reverberation times: T R ms, 5 ms (P ), and ms (P ). The layout of the room we used to measure the impulse responses is shown in Fig.. We used a twoelement array with inter-element spacing of cm. The speech signals arrived from two directions, and. An example of a measured room impulse response used in our experiments is shown in Fig.. In these experiments, we changed the frame size T from to 8 and investigated the performance for each condition. The sampling rate was 8 khz, the frame shift was half of, and the analysis window was a Hamming window. To solve the permutation problem, we used the blind beamforming algorithm proposed by Kurita et al [].... Evaluation measure In order to evaluate the performance for different frame sizes T with different reverberation times T R,we used the noise reduction rate (NRR), defined as the output signal-to-noise ratio (SNR) in db minus the input SNR in db. NRR i SNR Oi SNR Ii ω SNR Oi l og Aii(ω)Si(ω) (6) ω Aij(ω)Sj(ω) SNR Ii l og ω Hii(ω)Si(ω) ω Hij(ω)Sj(ω) (7) where A(ω) W (ω)h(ω) and i j. These values were averaged for the whole six combinations with respect to the speakers, and NRR and NRR were averaged for the sake of convenience.

NRR [db] NRR [db] NRR [db] 8 6 8 6 9 8 7 6 5 8.5 8 7 6 5 (a) T R ms 8 s learning s learning 6 8 56 5 8 (b) T R 5 ms 8 s learning s learning 6 8 56 5 8 (c) T R ms s learning 8 s learning 6 8 56 5 8 Figure : Results of NRR for different frame sizes. The solid lines are for 8 seconds learning, and the dotted lines are for seconds learning. Separation was executed for the 8 seconds long data... Experimental results The experimental results are shown in Fig.. The lengths of the mixed speech signals were about eight seconds each. We used the beginning three seconds (dotted lines) or entire eight seconds (solid lines) of the mixed data for learning according to (), and the entire eight seconds data for separation. In the case of the three seconds learning, the maximum NRR was obtained when T 8 [Fig. (a)] in non-reverberant tests. In reverberant tests, the maximum NRR was obtained using T 5 when T R 5 ms [Fig. (b)], and T R ms [Fig. (c)]. A short frame was found to function far better than a long frame, even for long room reverberation. For the longer learning data, i.e., the eight seconds data, the results were slightly different. In this case, we obtained a better separation performance than the three seconds learning case. Furthermore, in comparison with the three seconds learning case, the peak performance appeared when we used a longer frame size T. With an overly long frame size, however, the performance become poor even when we used longer learning data. Even for long room reverberation, the condition P T is useless, and a shorter is best. correlation coefficient J(T).. Y and Y S and S X and X s 8 s 6 8 56 5 8 Figure 5: Relationship between T and the correlation coefficient. The solid lines are for data of 8 seconds, and the dotted lines are for data of seconds. T R ms.. DISCUSSION.. Optimum frame size for frequency domain BSS In the previous section, we showed that a longer fails. In this section, we discuss the reason why both a short frame and a long frame fail. In the frequency domain BSS framework, the signal we can use is not x(n) but X(ω, m). If the frame size T is long, the number of data in each frequency bin becomes few. This causes assumptions to collapse, like the zero mean assumption. As an example of such a collapse, we observed that the independency of two original signals went down when the frame size became longer. Figure 5 shows the relationship between the and the correlation coefficient: J(T ) T T r ω, (8) where (U(ω, m) U(ω))(U(ω, m) U(ω)) m r ω. [U(ω, m) m U(ω)] [U(ω, m) m U(ω)] (9) where U represents a mean value, and U is original signal S, observed signal, X or separated signal Y. Although a correlation coefficient does not show independency directly, we use this measure as an index of independency. As Fig. 5 shows, the independency decreases when the becomes longer. In these cases, because the data length gets shorter, the assumption of independency does not hold for the two original sources. This is a reason why the long frame failed. On the other hand, if we use a short frame, the frame cannot cover the reverberation, therefore, the separation performance is limited. In frequency domain BSS, the optimum frame size is decided by the trade-off between maintaining the as- ω

sumption of independency and covering the whole reverberation... Length oflearning data and separation performance In., we obtained a better performance using eight seconds data than using three seconds data. The reason for this result is also explained by Fig. 5. With eight seconds data, independency is better maintained than with three seconds data. Therefore, we can obtain a better performance using the former. Furthermore, the optimum frame size changes when we use learning data of different length, because the optimum frame size is decided by the trade-off we mentioned in.. 5. EQUIVALENCE BETWEEN FREQUENCY DOMAIN BSS AND FREQUENCY DOMAIN ABF In this section, in order to discuss the fundamental limitation of frequency domain BSS, we show that frequency domain BSS is equivalent to two sets of frequency domain adaptive beamformers (ABF). From the equivalence between BSS and ABF, we can conclude that the performance of BSS is upper bounded by that of ABF []. 5.. Frequency domain BSS ofconvolutive mixtures using Second Order Statistics (SOS) In this section, we use a second order statistics (SOS) BSS algorithm for convenience. It is well known that a decorrelation criterion is insufficient to solve problems. In [8], however, it is pointed out that nonstationary signals can provide enough additional information to estimate all W ij. Some authors have utilized SOS for convolutively mixed speech signals [5, 6]. Source signals S (ω, m) and S (ω, m) are assumed to be zero mean and mutually uncorrelated: M R S(ω, k) S(ω, Mk + m)s (ω, Mk + m) M m Λ s(ω, k), () where denotes the conjugate transpose, and Λ s (ω, k) is a different diagonal matrix for each block k. In order to determine W (ω) so that Y (ω, m) and Y (ω, m) become mutually uncorrelated, we seek a W (ω) that diagonalizes the covariance matrices R Y (ω, k) simultaneously for all k, R Y (ω, k) W (ω)r X(ω, k)w (ω) W (ω)h(ω)λ s(ω, k)h (ω)w (ω) Λ c(ω, k), () where R X is the covariance matrix of X(ω) as follows, R X(ω, k) M M m X(ω, Mk + m)x (ω, Mk + m), () and Λ c (ω, k) is an arbitrary diagonal matrix. S S H H X X W W Y (a) ABF for a target S and a jammer S. S S H X H W W X Y (b) ABF for a target S and a jammer S. Figure 6: Two sets of ABF system configurations. The diagonalization of R Y (ω, k) can be written as an overdetermined least-squares problem, arg min off-diagw (ω)r X(ω, k)w (ω) () W (ω) k s.t., diag W (ω)r X(ω, k)w (ω), k where x is the squared Frobenius norm. 5.. Frequency domain adaptive beamformer Here, we consider frequency domain adaptive beamformers (ABF), which can remove jammer signals. Since our aim is to separate two signals S and S with two microphones, two sets of ABF are used (Fig. 6). Note that an ABF can be adapted only when a jammer exists but a target does not exist. 5... ABF null towards S First, we consider the case of target S and jammer S [Fig. 6(a)]. When target S, output Y (ω, m) is expressed as Y (ω, m) W (ω)x(ω, m), () where W (ω) [W (ω),w (ω)], X(ω, m) [X (ω, m),x (ω, m)] T. To minimize jammer S (ω, m) in output Y (ω, m) when target S, mean square error J(ω) is introduced as J(ω) E[Y (ω, m)] W (ω)e[x(ω, m)x (ω, m)]w (ω) W (ω)r(ω)w (ω), (5) where E is the expectation and 5

[ X(ω, m)x (ω, m) X R(ω) E (ω, m)x (ω, m) X (ω, m)x (ω, m) X (ω, m)x (ω, m) ]. (6) By differentiating cost function J(ω) with respect to W and setting the gradient equal to zero J(ω) W RW, (7) we obtain the equation to solve as follows [(ω, m), etc., are omitted for convenience], [ XX X E X ][ W ] [ ] X X X X W. (8) Using X H S, X H S,weget W H + W H. (9) With (9) only, we have trivial solution W W. Therefore, an additional constraint should be added to ensure target signal S in output Y. With this constraint, output Y is expressed as Y W X + W X W H S + W H S c S, () which leads to W H + W H c, () where c is an arbitrary complex constant. Since H and H are unknown, the minimization of (5) with adaptive filters W and W is used to derive (9) with constraint (). This means that the ABF solution is derived from simultaneous equations (9) and (). 5... ABF null towards S Similarly for target S, jammer S, and output Y [Fig. 6(b)], we obtain W H + W H () W H + W H c. () 5... Two sets of ABF By combining (9), (), (), and (), the simultaneous equations for the two sets of ABF are summarized as [ ][ ] [ ] W W H H c W W H H c. () 5.. Equivalence between BSS and ABF As we showed in (), the SOS BSS algorithm works to minimize off-diagonal components in [ ] Y Y E Y Y Y Y Y Y, (5) [see ()]. Using H and W, outputs Y and Y are expressed in each frequency bin as follows, where [ a b c d ] Y as + bs (6) Y cs + ds, (7) [ ][ ] W W H H. (8) W W H H 5... When S and S We now analyze what is going on in the BSS framework. After convergence, the expectation of the off-diagonal component E[Y Y ] is expressed as E[Y Y ] ad E[S S ]+bc E[S S ]+(ac E[S]+bd E[S]). (9) Since S and S are assumed to be uncorrelated, the first term and the second term become zero. Then, the BSS adaptation should drive the third term of (9) to be zero. By squaring the third term and setting it equal to zero (ac E[S]+bd E[S]) a c (E[S ]) +abc d E[S ]E[S ]+b d (E[S ]) () () is equivalent to ac bd, abc d. () CASE : a c,c,b,d c [ ][ ] [ ] W W H H c W W H H c () This equation is exactly the same as that of the ABF (). CASE : a,c c,b c,d [ ][ ] W W H H W W H H [ c c ] () This equation leads to the permutation solution which is Y c S,Y c S. Note that the undesirable solutions (i.e., a,c c,b,d c and a c,c,b c,d)donot appear, since we assume that H(ω) is invertible and H ji (ω). If the uncorrelated assumption between S (ω) and S (ω) collapses, the first and second terms of (9) become the bias noise to get the correct coefficients a, b, c, d. 5... When S and S The BSS can adapt, even if there is only one active source. In this case, only one set of ABF is achieved. 6

- - (a) -9-8 -6 - (c) - - -9-8 -6 - - 6 8 9-6 8 9 - - (b) -9-8 -6 - (d) - - -9-8 -6 - - 6 8 9-6 8 9 Figure 7: Directivity patterns (a) obtained by an NBF, (b) obtained by BSS (T R ms), (c) obtained by BSS (T R5 ms), and (d) obtained by BSS (T R ms). DOA means direction of arrival. T 56, three second learning. 5.. Fundamental limitation offrequency domain BSS Frequency domain BSS and frequency domain ABF are shown to be equivalent [see () and ()] if the independent assumption ideally holds [see (9)]. We can form only one null towards the jammer in the case of two microphones. Figure 7 shows directivity patterns obtained by a null beamformer (NBF) and BSS; Fig. 7(a) shows a directivity pattern obtained by an NBF that forms a steep null directivity pattern towards a jammer under the assumption of the jammer s direction being known. In Fig. 7, (b),(c), and (d) are drawn by W by BSS: (b) is drawn by W when T R,(c) is drawn by W when T R 5 ms, and (d) is drawn by W when T R ms. When T R, a sharp null is obtained like with an NBF. When T R is long, the directivity pattern is comparatively duller; however, we can draw a directivity pattern. Although BSS and ABF can reduce reverberant sounds to some extent [], they mainly remove the sounds from the jammer direction. This understanding clearly explains the poor performance of BSS in a real room with long reverberation. Moreover, as we have shown in section, a long frame size works poorly in frequency domain BSS for speech data of a few seconds. This is because when we use a long frame, the assumption of independency between S (ω) and S (ω) does not hold in each frequency; this is caused by the lack of the number of data in each frequency bin. Therefore, the performance of BSS is upper bounded by that of ABF. Our discussion here is essentially also true for BSS with Higher Order Statistics (HOS), and will be extended to it shortly. 6. CONCLUSIONS In this paper, we discuss why the separation performance is poor when there is long reverberation. First, we show that it is useless to be constrained by the condition P T, where T is the frame size of FFT and P is the length of a room impulse response. This is because the lack of data causes the collapse of the assumption of independency between the two original signals in each frequency bin when the data length is short, or when a longer is used. Next, we show that frequency domain BSS is equivalent to two sets of frequency domain ABF. Because ABF (and BSS) mainly considers the direct sound by making a null towards jammer direction, the separation performance is fundamentally limited. This understanding clearly explains the poor performance of BSS in a real room with long reverberation. ACKNOWLEDGEMENTS We would like to thank Dr. Shigeru Katagiri for his continuous encouragement. REFERENCES [] A. J. Bell and T. J. Sejnowski, An informationmaximization approach to blind separation and blind deconvolution, Neural Computation, vol. 7, no. 6, pp. 9 59, 995. [] S. Haykin, Unsupervised adaptive filtering. John Wiley & Sons,. [] S. Ikeda and N. Murata, A method of ICA in timefrequency domain, Proc. ICA99, pp. 65-7, Jan. 999. [] P. Smaragdis, Blind separation of convolved mixtures in the frequency domain, Neurocomputing, vol., pp. -, 998. [5] L. Parra and C. Spence, Convolutive blind separation of non-stationary sources, IEEE Trans. Speech Audio Processing, vol. 8, no., pp. -7, May. [6] M. Z. Ikram and D. R. Morgan, Exploring permutation inconsistency in blind separation of speech signals in a reverberant environment, Proc. ICASSP, pp. -, June. [7] S. Garven and D. Compernolle, Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness, IEEE Trans. Signal Processing, vol., no. 7, pp. 6-6, July 995. [8] E. Weinstein, M. Feder, and A. V. Oppenheim, Multi-channelsignalseparation by decorrelation, IEEE Trans. Speech Audio Processing, vol., no., pp. 5-, Oct. 99. [9] S. Araki, S. Makino, T. Nishikawa, and H. Saruwatari, Fundamentallimitation of frequency domain blind source separation for convolutive mixture of speech, Proc. ICASSP, May. [] S. Kurita, H. Saruwatari, S. Kajita, K. Takeda, and F. Itakura, Evaluation of blind signal separation method using directivity pattern under reverberant conditions, Proc. ICASSP, pp. -, June. [] S. Araki, S. Makino, R. Mukai, and H. Saruwatari, Equivalence between frequency domain blind source separation and frequency domain adaptive null beamformers, Proc. Eurospeech, Sept.. [] R. Mukai, S. Araki, and S. Makino, Separation and dereverberation performance of frequency domain blind source separation for speech in a reverberant environment, Proc. Eurospeech, Sept.. 7