REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca

Size: px

Start display at page:

Download "REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca"

Ashlie Miller
5 years ago
Views:

1 REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEARATION Scott Rickard, Radu Balan, Justinian Rosca Siemens Corporate Research rinceton, NJ ABSTRACT e present a real-time version of the DUET alorithm for the blind separation of any number of sources usin only two mixtures. The method applies when sources are - disjoint orthoonal, that is, when the supports of the windowed Fourier transform of any two sinals in the mixture are disjoint sets, an assumption which is justified in the Appendix. The online alorithm is a Maximum Likelihood (ML) based radient search method that is used to track the mixin parameters. The estimates of the mixin parameters are then used to partition the time-frequency representation of the mixtures to recover the oriinal sources. The technique is valid even in the case when the number of sources is larer than the number of mixtures. The method was tested on mixtures enerated from different voices and noises recorded from varyin anles in both anechoic and echoic rooms. In total, over mixtures were tested. The averae SNR ain of the demixin was db for anechoic room mixtures and db for echoic office mixtures. The alorithm runs times faster than real time on a 7MHz laptop computer. Sample sound files can be found here: srickard/bss.html 1. INTRODUCTION In [1] a blind source separation technique was introduced that allows the separation of an arbitrary number of sources from just two mixtures provided the time-frequency representations of sources do not overlap. The key observation in the technique is that, for mixtures of such sources, each time-frequency point depends on at most one source and its associated mixin parameters. In anechoic environments, it is possible to extract the estimates of the mixin parameters from the ratio of the time-frequency representations of the mixtures. These estimates cluster around the true mixin parameters and, identifyin the clusters, one can partition the time-frequency representation of the mixtures to produce the time-frequency representations of the oriinal sources. The oriinal DUET alorithm involved creatin a twodimensional (weihted) historam of the relative amplitude and delay estimates, findin the peaks in the historam, and then associatin each time-frequency point in the mixture with one peak. The oriinal implementation of the method was offline and passed throuh the data twice; one time to create the historam and a second time to demix. In this paper, we present an online version of the DUET alorithm which avoids the need for the creation of the historam, which in turn avoids the computational load of updatin the historam and the tricky issue of findin and trackin peaks. The online DUET advantaes are, online ( times faster than real time), db averae separation for anechoic mixtures, db averae separation for echoic mixtures, and can demix 2 sources from 2 mixtures. In Section 2 we define the time delay mixin model, explain the concept of -disjoint orthoonality, and describe the mixin parameter trackin procedure. In Section 3 we describe the demixin method used for each alorithm. Section describes the mixin tests and ives detailed demixin results. Justification for the -disjoint orthoonality assumption can be found in the Appendix A and Appendix B contains the ML objective function derivation. 2. MIXING ARAMETER ESTIMATION 2.1. Source mixin Consider measurements of a pair of sensors where only the direct path is present. In this case, without loss of enerality, we can absorb the attenuation and delay parameters of the first mixture,, into the definition of the sources. The two mixtures can thus be expressed as, (1) (2) 61

2 O Q J N : r q " h where is the number of sources, is the arrival delay between the sensors resultin from the anle of arrival, and is a relative attenuation factor correspondin to the ratio of the attenuations of the paths between sources and sensors. e use to denote the maximal possible delay between sensors, and thus, Source Assumptions e call two functions and -disjoint orthoonal if, for a iven windowin function, the supports of the windowed Fourier transforms of and are disjoint. The windowed Fourier transform of is defined, (3) which we will refer to as where appropriate. The -disjoint orthoonalityassumption can be stated concisely,! #" () In Appendix A, we introduce the notion of approximate - disjoint orthoonality. hen!$&, we use the followin property of the Fourier transform, ' (" () e will assume that () holds for all,, even when has finite support[2] Amplitude-Delay Estimation Usin the above assumptions, we can write the model from (1) and (2) for the case with two array elements as, +-/.12 3 )!*,+-/.12 3 ) 9 ::(: 9 *6-/ ; AB ::(: ; C AEDF7GHIKJ. C -/.12 3!L M For -disjoint orthoonal sources, we note that at most one of the sources will be non-zero for a iven, thus, O,,Q 'R for some. (7) The oriinal DUET alorithm estimated the mixin parameters by analyzin the ratio of and. In liht of (7), it is clear that mixin parameter estimates can be obtained via, -S ; -/.12 3 TS *6-/ / *,+-/.12 3 VUX *,+-/ Im YZ [ U *6-/.12 36\]X\ (6) () The oriinal DUET alorithm constructed a 2-D historam of amplitude-delay estimates and looked at the number and location of the peaks in the historam to determine the number of sources and their mixin parameters. See [1, 3] for details. 2.. ML Mixin arameter Gradient Search For the online alorithm, we take a different approach. First, note that, 'R 6 (9) if source is the active source at time-frequency. Moreover, definin, ^ we can see that, a ^!_ 6"6"6" ^ 'R 6 (1) (11) because at least one ^ term will be zero at each frequency. In the Appendix, it is shown that the maximum likelihood estimates of the mixin parameters satisfy, d Be ' Be f f fe d D e ' D a where we have used ^ as shorthand for ^ ^ 6"6"6" ^ (12). e perform radient descent with (12) as the objective function to learn the mixin parameters. In order to avoid the discontinuous nature of the minimum function, we approximate it smoothly as follows, where, i ^ ^ jp Xm6!_F Xm6 ^ 1_ ^ 1_ kal c _ ^ ^ ^ (13) ^ ji ^ ^ (1) E Xm>n B k l c _!_F Xm>no Xm () (" (16) Generalizin (), the smooth ML objective function is, E d Be ' Be f f fe d D e ' D ksl c E Xm>n B _66(_ Xm>n D (17) 62

3 r G R r r S k which has partials, Xm>nR Xm>n E and, -E2 3!_ Im + o R o --- ; 9 3 Re " ; +-# *,+-/.12 3# " # *6-/.12 3# *,+-/.12 3 *>-/.12 A R! 33 : 'R (1) (19) e assume we know the number of sources we are searchin for and initialize an amplitude and delay estimate pair to random values for each source. The estimates $ & $ & for the current time (' (where (' is the time separatin adjacent time windows) are updated based on the previous estimate and the current radient as follows, )$ & $ F & +*-, $ & $ & $ &.*-, )$ & E E (2) (21) where * is a learnin rate constant and, $ & is a time and mixin parameter dependent learnin rate for time index for estimate. In practice, we have found it helpful to adjust the learnin rate dependin on the amount of mixture enery recently explained by the current estimate. e define, Xm>n( d R e 'R e e 1 (2 / )$ & 3 Xm>n( d( e ' e e 1 ( (22) and update the parameter dependent learnin rate as follows,, $ & where 7 is a forettin factor. / )$ & 6 q7 6 / )$ 9:& (23) 3. DEMIXING In order to demix the th source, we construct a time-frequency mask based on the ML parameter estimator (see (B) in the Appendix), ; -/.12 3 Y 9=< - ; T.12 3?> < - ;@ 3BACED +F otherwise (2) The estimate for the time-frequency representation of the th source is, H JI (" (2) e then reconstruct the source usin the appropriate dual window function[]. In this way, we demix all the sources by partitionin the time-frequency representation of one of the mixtures. Note that because the method does not invert the mixin matrix, it can demix all sources even when the number of sources is reater than the number of mixtures ( K ).. TESTS AND SUMMARY e tested the method on mixtures created in both an anechoic room and an echoic office environment. The alorithm used parameters * " 7 " LM 6 and a Hammin window of size 12 samples (with adjacent windows separated by 12 samples) in all the tests. For all tests, the method ran more than times faster than real time. For the anechoic test, the setup is pictured in Fiure 1. Separate recordins at 16kHz were made of six speech files ( female, 2 male) taken from the TIMIT database played from a loudspeaker placed at the X marks in the fiure. airwise mixtures were then created from all possible voice/anle combinations, excludin same voice and same anle combinations, yieldin a total of 63 mixtures (NO N:Q+MRQ QTNU ). 16 o 19 o o 13 o 12 1 o 9 o 1 o x o 1 x 2 Fi. 1. Experimental setup. Microphones are separated by V 1.7 cm centered alon the YX to X line. The X s show the source locations used in the anechoic tests. The O s show the locations of the sources in the echoic tests. The SNR ains of the demixtures were calculated as follows. Denote the contribution of source on microphone as. Thus we have, 7 o X_F (26) X_F (27) As we do not know the permutation of the demixin, we o 1 o 63

4 ` ` _ calculate the SNR ain conservatively, SNR SNR 61l 61l 61l 61l I ( I ( I I 661l 661l 661l 661l I ( I ( I 6 I 6 In order to ive the method time to learn the mixin parameters, the SNR results do not include the first half second of data. Fiure 2 shows the averae SNR ain results for each anle difference. For example, the 6 deree difference results averae all the 1-7, -1, 7-13, 1-16, and results. Each bar shows the maximum SNR ain, one standard deviation above the mean, the mean (which is labeled), one standard deviation below the mean, and the minimum SNR ain over all the tests (both SNR and SNR are included in the averaes). The separation results improve as the anle difference increases. Fiure 3 details the 3 deree difference results by anle comparison, averain 3 tests per anle comparison. The performance is a function of the delay. That is, the worst performance is achieved for the smallest delay (correspondin to the mixtures), the second worst performance is achieved for the second smallest delay (correspondin to the 1- mixtures), and so forth anle combination Fi. 3. Overall separation SNR ain by 3 deree anle pairin. Anechoic data. (see the O s in Fiure 1). Separation results for pairwise mixtures of voices ( female, male) are shown in Fiure. Separation results for pairwise mixtures of voices ( female, male) and noises (line printer, copy machine, and vacuum cleaner) are shown in Fiure. The results are considerably worse in the echoic case, which is not surprisin as the method assumes anechoic mixin. However, the method does achieve db SNR ain on averae and is real-time anle difference anle difference Fi.. Comparison of overall separation SNR ain by anle difference. Echoic office data. Voice vs. Noise. Fi. 2. Comparison of overall separation SNR ain by anle difference. Anechoic data. Recordins were also made in an echoic office with reverberation time of V ms, that is, the impulse response of the room fell to -6 db after ms. For the echoic tests, the sources were placed at, 9, 12,, and 1 derees Summary results for all three testin roups (anechoic, echoic voice vs. voice, and echoic voice vs. noise) are shown in the table. e have presented a real-time version of the DUET alorithm which uses radient descent to learn the anechoic mixin parameters and then demixes by partitionin the time-frequency representations of the mixtures. e have also introduced a measure of -disjoint orthoo- 6

5 C C J J raph that O &" L for all three, and thus say that the sinals used in the tests were L -disjoint orthoonal at 3 db. If we can correctly map time-frequency points with 3 db or more sinle source dominance to the correct correspondin output partition, we can recover the L of the enery of the oriinal sources. The fiure also demonstrates the -disjoint orthoonality of six speech sinals taken as a roup and the fact that independent Gaussian white noise processes are less than M -disjoint orthoonal at all levels anle difference Fi.. Comparison of overall separation SNR ain by anle difference. Echoic office data. Two Voices. AVV EVV EVN number of tests 63 6 mean SNR ain (db) std SNR ain (db) max SNR ain (db) min SNR ain (db) Table 1. Results Summary. AVV = Anechoic Voice vs. Voice. EVV = Echoic Voice vs. Voice. EVN = Echoic Voice vs. Noise nality and provided empirical evidence for the approximate -disjoint orthoonality of speech sinals. A. -DISJOINT ORTHOGONALITY OF SEECH Clearly, the -disjoint orthoonality assumption is not exactly satisfied for our sinals of interest. e introduce here a measure of -disjoint orthoonality for a roup sources and show that speech sinals are indeed nearly -disjoint orthoonal to each other. Consider the time-frequency mask, p 1l 6 U 6 otherwise (2) and the resultin enery ratio, p U (29) which measures the percentae of enery of source 1 for time-frequency points where it dominates source 2 by db. e propose as a measure of -disjoint orthoonality. For example, Fiure 6 shows averaed for pairs of sources used in the demixin tests. e can see from the power remainin Anchoic Voice vs. Voice Echoic Voice vs. Voice Echoic Voice vs. Noise Six Anechoic Voices Gaussian hite Noise db threshold Fi. 6. -Disjoint Orthoonality. The sinals used in the tests were L -disjoint orthoonal at 3 db and more than -disjoint orthoonal at 1 db. Comparin one source to the sum of five others, we still have more than S M - disjoint orthoonality at 3 db. Independent Gaussian white noise processes are less than M -disjoint orthoonal at all levels. B. ML ESTIMATION FOR DUET MODEL Assume a mixin model of type (1)-(2) to which we add measurement noise: *,+-/.12 3 *6-/ / ; < A R -/ /.12 3 " +-/ /.12 3 " 6-/.12 3 The ideal model (1)-(2) is obtained in the limit (3) (31). In practice, we make the computations assumin the existence of such a noise, and then we pass to the limit. e assume the noise and source sinals are Gaussian distributed and independent from one another, with zero mean and known variances: O sq V E 6

6 C 9 9 " V E ^ The Bernoulli random variables / s are NOT independent. To accommodate the -disjoint orthoonality assumption, we require that for each at most one of the / s can be unity, and all others must be zero. Thus the -tuple / 6"6"6" / takes values only in the set - G G ::: G 3-9 G ::(: G 3 ::: - G G ::(: 9 3! of cardinalitya_. e assume uniform priors for these R.V. s. The short-time stationarity implies different frequencies are decorrelated (and hence independent) from one another. e use this property in constructin the likelihood. The likelihood of parameters 6"6"6" iven the data and spectral powers, ^ at a iven, is iven by conditionin with respect to / s by: - ; + T +( ::: ; C T C 2 3 -/*,+- 3*6-3# ; + T + ::: ; C T C - " < -/. 3 -/ /.12 3 where: *,+-/.12 3 *>-/ " < -/. 3 -/. 33 = and ) 9 ; A R 7 9 ; A R"! < (32) + ) *,+-/.12 3 *6-/ ) 9 ; A R ; A R ; 7 and we have defined / q /, ^ q, and# q for notational simplicity in (32) in dealin with the case when no source is active at a iven. Next the Matrix Inversion Lemma (or an explicit computation) ives: " and$&(' " < -/. 3-9 " ; 3 - < -/. 3# ; < A R *,+-/.12 3 *6-/.12 3# -# *,+-/.12 3# " # *6-/.12 3#!_ ^ # _ ^ _ 33 Now we pass to the limit. The dominant terms from the previous two equations are: 'R 6!_ and ^!_ )+*, > 6 6"6"6" is the selection map defined by: Of thea_& terms in each sum of (32), only one term is dominant, namely the one of the larest exponent. Assume ) if ^ where: ^ and for /. # ^ q q V! ^ 6"6"6" : 'R Then the likelihood becomes: 6"6"6" 3 q / B 6 _ with K, the number of frequencies and: <; = o!_ ^ n 2 > d o 2 /. #, "- 6 6 : (33) 6"6"6" The dominant term in lo-likelihood remains the exponent. Thus: l h q 67 B ^ (3) and maximizin the lo-likelihood is equivalent to the followin (which is (12)): d Be ' Be f f fe d D e ' D a ^ 666 ^ 6. REFERENCES [1] A. Jourjine, S. Rickard, and O. Yilmaz, Blind Separation of Disjoint Orthoonal Sinals: Demixin N Sources from 2 Mixtures, in roceedins of the 2 IEEE International Conference on Acoustics, Speech, and Sinal rocessin, Istanbul, Turkey, June 2, vol., pp. 29. [2] R. Balan, J. Rosca, S. Rickard, and J. O Ruanaidh, The influence of windowin on time delay estimates, in roceedins of the 2 CISS, rinceton, NJ, March [3] A. Jourjine, S. Rickard, and O. Yilmaz, Blind Separation of Disjoint Orthoonal Sinals, IEEE Trans. Sinal roc., 2, Submitted. [] I. Daubechies, Ten Lectures on avelets, chapter 3, SIAM, hiladelphia, A,

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION. Scott Rickard, Radu Balan, Justinian Rosca. Siemens Corporate Research Princeton, NJ 08540

REAL-TIME TIME-FREQUENCY BASED BLIND SOURCE SEPARATION Scott Rickard, Radu Balan, Justinian Rosca Siemens Corporate Research Princeton, NJ 84 fscott.rickard,radu.balan,justinian.roscag@scr.siemens.com