SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS. Temujin Gautama & Marc M. Van Hulle

Size: px

Start display at page:

Download "SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS. Temujin Gautama & Marc M. Van Hulle"

Brenda Burke
6 years ago
Views:

1 SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS Temujin Gautama & Marc M. Van Hulle K.U.Leuven, Laboratorium voor Neuro- en Psychofysiologie Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, BELGIUM Tel.: , Fax: ftemu,marcg@neuro.kuleuven.ac.be Abstract Spectral modeling is an essential component in many signal processing applications, such as speech enhancement and sound monitoring. This paper will demonstrate its use in the separation of acoustic sources from a compound signal that is registered by one sensor. Our technique distinguishes itself from the popular blind source separation procedure by its much higher noise insensitivity and its ability to cope with varying as well as non-square mixing conditions. INTRODUCTION In the neural network literature, the separation of acoustic sources is mostly described in the context of blind source separation (BSS) (for references, see [4]). The source signals are assumed to be statistically independent and they are separated from the recorded mixture signals without knowing the mixture process. This process is generally assumed to be linear, with as many sensors as there are sources, and with a mixing matrix that is both invertible and constant over time. Under these conditions, the mixing process can be modeled and the source signals can be recovered. However, since the BSS algorithms are essentially based on amplitude information, they tend to be quite sensitive to noise [1]. If there is only a single sensor present, BSS of two sources becomes a nonsquare problem, and the current BSS algorithms cannot be used. One way to tackle this problem is to model the spectral characteristics of the acoustic sources. Single microphone speech enhancement [7] and sound monitoring [6] are well-known applications in this eld, where acoustic sources are rst modeled using spectral information, after which these models are applied either to cancel out noise, to detect changes in the acoustic sources, or to identify the presence of an unknown source. Modeling the sources using their spectral characteristics is expected to yield an increased resistance to noise and to variabilities in the mixing pro-

2 Source 1 Source 2 S 1 Compound S 2 C I 1 I 2 M 1 M 2 φ φ O 1 O 2 Output 1 Output 2 Figure 1: Diagram of the system. All signals are represented by their complex Fourier spectrum. cess. These reasons form the main motivations to adopt a spectral modeling approach to source separation. The models considered in this paper will be developed using self-organizing learning rules trained on spectral information of music signals. PROCEDURE Basic Strategy Figure 1 shows the overall system setup. The source signals S 1 and S 2 are linearly mixed so as to produce the compound signal C (note that all signals are represented by their complex short-time Fourier transforms (STFT) S, instead of by their amplitude spectra S). The compound signal C is applied to the two branches in the system, one for each source signal. In every branch, the estimated spectrum of the other source is subtracted from C and the result, I n, is quantized by the model M n, (n 2 f1; 2g). The quantized signal is then phase aligned () to the model's input signal I n (dashed line), so as to produce the source signal estimate O n after a number of iterations. The exact nature of both the quantization and the phase alignment will be elaborated upon in sections 2.2 and 2.3. The quantization models will be trained using self-organizing learning rules, as will be explained in section 2.3. As an example, we will consider signals taken from two musical instruments (oboe and piano). Quantization The spectral model of each instrument, M n, consists of a set of codebook vectors fq n;i g each of which represents a \prototypical" Fourier spectrum. These vectors should be translation (in time) invariant so as to avoid redundancy in the representation. Hence, M n will consist of amplitude spectra Q n;i, rather than complex Fourier spectra Q n;i. At time t, the winning codebook Q n;c for input (amplitude) spectrum I n will be selected using the dot

3 product criterion, i.e. jq n;c I n j jq n;i I n j; 8 Q n;i 2 M n : (1) The reconstructed amplitude spectrum O n is dened as the projection of I n onto the winner Q n;c and represents the estimated amplitude spectrum of one source signal. Next, O n has to be supplemented with a phase spectrum. If not, transformation to the time domain would be impossible, since the phase spectrum contains information about the harmonic structure of the signal. The matching process (eq. 1), however, should not be performed on complex codebook vectors, since these are not translation invariant. Hence, we have opted for a two-step procedure: rst the matching is performed on the amplitude spectrum only (Amplitude Quantization), after which the reconstruction is combined with the corresponding phase spectrum (Complex Quantization and Phase Alignment). This two-step procedure and the training of the models will be explained in section 2.3. Building and Applying the Quantization Models Amplitude Quantization Model. The codebook for model M n, fq n;i g, is determined, for every instrument separately, using self-organizing learning rules. We will compare three such rules: kmer [5], Kohonen's SOM [2] and the k-means algorithm [3]. In the kmer (kernel-based Maximum Entropy Rule, [5]), every neuron i is determined by its traditional weight vector w i, supplemented by a radius i. A given neuron is active if the Euclidean distance between its weight vector and the input pattern is smaller than the neuron's radius. The rule is aimed at building a topographic map in which all neurons have the same activation probability, the exact level of which is determined by the parameter. This is achieved by adjusting both the weight vectors and their radii in an iterative manner (note that the activation regions are allowed to overlap). The update rule incorporates a neighborhood function (in lattice space) whose width decreases over time in order to yield topographic maps. Training patterns for the learning rules consist of amplitude spectra coming from a single instrument. They are thresholded in energy in order to avoid training on irrelevant spectra by using the following criterion: input pattern will be considered for training when: E < E > training set ; (2) 4 where E is the signal energy of input pattern. Since the actual quantization will be performed by means of projections, all patterns are normalized prior to training. The result of this stage is a quantized amplitude spectrum for each instrument (computed as the projection of I n onto the winner Q n;c ). We now have to extend this representation with a phase spectrum in order to restore the harmonic structure of the time signal.

4 Complex Quantization Model. The prototype amplitude spectra Q n;i will be extended with representative phase spectra so as to preserve the phase relationships between frequency components. This will be done only once, after training, by combining every prototype spectrum Q n;i with the phase spectrum of its closest matching input spectrum I : jq n;i I j ki k jq n;i I j ki ; 8: (3) k The result of this stage is a reconstruction of the complex spectrum. However, the exact positioning in the time slice still has to be determined. Phase Compensation. The reconstructed, complex spectrum O n is the Fourier transform of a time slice o n (t). If we assume the phase relations between the frequency components to be correct, then o n (t) represents a time slice that corresponds to i n (t), except for a time shift within the time slice, t (the phase spectrum determines both the phase relations between components and the exact positioning in time). This t can be computed as the period of time over which o n (t) has to be circularly shifted over the time slice, so as to optimally correlate with input signal i n (t) (circularly shifted, since the discrete Fourier transform assumes the signals to be periodic over the given time slice): t = arg max[ji n (t) o n (t)j] = arg max[jf?1 fi n O ngj]; (4) where denotes the correlation function, F?1 the inverse Fourier transform, I n is the Fourier transform of i n (t) and O n is the complex conjugate of the Fourier transform of o n (t). As a result, these three stages yield time signals that approximate the original source signals in frequency content, phase relations and positioning in time. Source Separation Once the spectral models have been trained, the separation of the acoustic sources can be performed by means of a process that iteratively renes the spectral estimates of the sources at every time step t. Figure 1 shows that the input to quantization model M 1 is formed by the subtraction of the estimated spectrum of source 2 at iteration (k? 1) from the compound spectrum C(t), and vice versa. To improve performance, a competitive element has been included: at iteration k, the model M m (m 6= n) that nds the worse match to its input, will quantize (C(t)? O k n(t)), rather than (C(t)? O (k?1) n (t)). The separation will now be explained at iteration k. 1. The inputs to the quantization models, I n (t), are set to (C? O (k?1) m ), with m 6= n.

5 2. Amplitude quantization. The closest amplitude matches Q n;c are found for every model M n using eq. (1). The cosines of the angles between Q n;c and the input amplitude spectrum I n (t) are determined. M win denotes the model that yields the highest cosine. I win (t) is projected onto the model's best matching vector Q win;c. The length of this projection corresponds to a gain factor that is applied to Q win;c. 3. Complex quantization. Q win;c is combined with its corresponding phase spectrum. 4. Phase alignment on this complex spectrum yields O k win, using eq. (4). 5. The complex residue spectrum (C(t)? O k win (t)) is quantized by the remaining model M m and phase aligned resulting in O k m(t) (m 6= win). At iteration 0, the inputs to the quantization models M n are all set to C (i.e. the initial spectral estimates of the models are set to the null vector). The number of iterations is set to 3 in all simulations described in this paper. When the competitive element was not included, the performance decreased by 0:2 db and more than 3 iterations did not lead to better results. SIMULATIONS Signal Generation and Evaluation The signals that were used in this section have been generated on a Crystal 4232 audio controller (Yamaha OPL3 FM synthesizer) at a sampling rate of Hz, using the simulated sounds of an oboe and a piano. The compound signal was created by linearly mixing two separate time signals (9 seconds of music): c(t) = 0:5 s1(t) + 0:5 s2(t). The STFT is computed every 256 samples, resulting in a 3/4 overlap; the length of one time window is 1024 samples. At this stage, no windowing is applied so as not to interfere with the correlation strategy of the phase alignment (eq. 4). For the inverse transform, the consecutive complex spectra are transformed to the time domain. Every time window is then Hamming windowed and oset in time (spectrum i will have an oset of (256 i) samples), after which they are summed so as to yield the reconstructed time signal, o n (t). The two training sets consist of the short-time Fourier spectra of time signals of two octaves (15 notes, 10 seconds) played on the respective instruments (the same octaves for both oboe and piano). The resulting spectra are then thresholded using eq. 2 and normalized, resulting in training sets of 364 and 387 complex spectra for oboe and piano respectively. The compound signal consists only of notes included in the training sets. In all simulations, the signal-to-noise ratio (SNR) has been used to evaluate performances. It is dened for the original time signal s n (t) and its

6 reconstruction o n (t) as follows: X T?1 1 SN R(s n ; o n )[db] =?10 log 10 ( s n(t) T t=0 sn? o n(t) on ) 2 ; (5) where sn and on are the standard deviations of s n and o n, and T is the length of the shorter signal. The SNR of a round transformation (STFT, followed by its inverse) is db for the oboe and db for the piano training signal. Dimensionality Reduction In order to facilitate and speed up the training process of the quantization models, a dimensionality reduction has been performed on the amplitude spectra of both training signals separately, by means of the Karhunen-Loeve transform. First, the principal components of the amplitude spectra (512 dimensions) are computed, after which every data point is projected onto the rst N ( 512) principal components. To monitor the quality of the reconstruction of the training signals as a function of N, the amplitude spectra coming from the STFT were rst reduced in dimensionality, after which they were reconstructed, extended with their original phase spectra and, nally, transformed back to the time domain. Figure 2A shows the quality of the reconstructed time signals as a function of N. The piano model reaches its maximum for N 300 and the oboe model for N 100. The dierence in performance between instruments corresponds to the quality of the above mentioned round transformation (14.88 db and db for oboe and piano) and hence is not due to the Karhunen- Loeve transform. To monitor the inuence of the dimensionality reduction on the overall separation performance, all learning rules were trained for ve dierent values of N (20, 40, 100, 200 and 300). Training the Models The goal of training is the generation of a codebook of 49 vectors (kmer and SOM are congured as 7x7 lattices). During all training sessions, the mean squared error on the training set is monitored, so as to nd good parameter sets for the dierent learning rules (learning rates, length of training and, for the kmer and the SOM algorithm, the neighborhood cooling schemes). Both the kmer and the SOM algorithm are initialized with the same random congurations, while k-means is initialized with vectors randomly taken from the training set. Figure 2B shows the SNRs of the dierent models' reconstructed piano training signals, evaluated in function of the dimensionality reduction (oboe results not shown). The solid lines show the quality of the reconstruction when the original phase spectrum is used to reconstruct the complex Fourier spectrum (i.e. Amplitude Quantization only). Dashed lines represent

7 A 22.0 B kmer Quantization SOM Quantization Oboe k-means Quantization 1 Piano SNR [db] SNR [db] Number of Principal Components Number of Principal Components 320 Figure 2: Performance (S/N in db) in function of the number of principal components of A) the reconstruction of the training signals after dimensionality reduction (the phase is untouched); B) the reconstruction of the piano training signal after amplitude quantization (solid line) and complex quantization (dashed line) for the dierent learning rules. the SNRs of the reconstruction where the Complex Quantization and Phase Alignment of section 2.3 have been used. Figure 2B also demonstrates the robustness of the dierent learning rules with respect to the dimensionality of the training data. The k-means performance is prone to the sparseness that is due to a higher dimensionality. Thus, it cannot take advantage of an increasing number of eigenvectors which allows for better spectral reconstructions, whereas the kmer and the SOM can (cfr. Fig. 2A). Separation The complete system is tested and evaluated for the dierent models (ve dierent sets for all three learning rules: N = 20; 40; 100; 200; 300). Figure 3A shows the performances for the dierent models, evaluated in function of the number of principal components. Again, kmer and SOM's performances do not degrade with increasing dimensionality. In order to better evaluate the performances, we considered two more models: M tr and M sep. M tr corresponds to a \perfect" quantization of the training set: the models consist of all normalized spectra of the training sets. M sep corresponds to optimal codebooks with respect to the separation: they consist of all normalized spectra of the separate instruments in the compound signal. The results are M tr (7.97 db and db) and M sep (9.07 db and db). CONCLUDING REMARKS In this paper, spectral models have been applied in the context of the separation of acoustic sources. The models themselves have been developed

8 SNR [db] 10.0 A kmer SOM k-means Number of Principal Components B SNR Separation [db] Oboe Piano SNR Input [db] 16.0 Figure 3: A) Performances of the system for the dierent models, evaluated in function of the dimensionality reduction. Solid lines represent the performances for the piano, dashed lines the ones for the oboe. B) Performance of the system in a noisy environment using self-organizing learning rules. The recognition/separation is based on the projection of the compound signal onto the models' prototypical spectra. It is due to this projection that our models are invariant to noise: when uniform noise is added to the compound signal (Fig. 3B), the performance is still reasonable (5.59 db) for a SNR of 4.7 db of the compound signal (this corresponds to a noise amplitude of 30%). In applications in the eld of sound monitoring, these models would be able to recognize the expected acoustic events, even in a noisy environment. The complete system could also be used to detect and separate unknown events from the modeled ones and even perform on-line modeling. Contrary to blind source separation, our system does not assume a xed mixing. Simulations have been performed where the mixing was variable. The mixing coecients for the two instruments varied as a sine and a cosine, both with an oset of 1.1, with one cycle over the total signal length. Such a variable mixing could be caused by moving acoustic sources. When using the SOM models trained on 100 principal components, the separation performance hardly degraded and stayed at 5.83 db for the oboe and 9.78 db for the piano signal, compared to 6.17 db and 9.75 db when the mixing is constant (Fig. 3A). If two sensors were present, the sources could easily be located and tracked through time, since the mixing coecients can be determined. All these are topics for further investigation. ACKNOWLEDGMENTS T.G. is supported by a scholarship from the Flemish Institute for the promotion of Scientic Technological Research in Industry (I.W.T.). M.M.V.H. is a research associate of the Fund for Scientic Research { Flanders (Belgium)

9 and is supported by research grants received from the Fund for Scientic Research (G ), the National Lottery (Belgium) ( ), the Flemish Ministry of Education (GOA 95/99-06), and the Flemish Ministry for Science and Technology (VIS/98/012). REFERENCES [1] Herrmann, M., and Yang, H.H. (1996). Perspectives and limitations of self-organizing maps in blind separation of source signals. In S.-I. Amari, L. Xu, L.-W. Chan, I. King, and K.-S. Leung (Eds.), Progress in Neural Information Processing (Vol. 2, pp ). Proceedings of the International Conference on Neural Information Processing (ICONIP'96), London: Springer-Verlag. [2] Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, pp [3] Krishnaiah, P.R., and Kanal, L.N. (1982). Classication, Pattern Recognition, and Reduction of Dimensionality, Handbook of Statistics, vol. 2, Amsterdam: North Holland. [4] Van Hulle, M.M. (1998). Kernel-based equiprobabilistic topographic map formation. Neural Computation, 10, pp [5] Van Hulle, M.M. (1998). Clustering with kernel-based equiprobabilistic topographic maps. Proc. IEEE NNSP98 (Cambridge, U.K.), pp [6] Watanabe, H., Matsumoto, Y., Tanaka, S., & Katagiri, S. (1998). Sound Monitoring based on the generalized probabilistic descent method. Proc. IEEE NNSP98 (Cambridge, U.K.), pp [7] Xie, F., and Van Compernolle, D. (1994). A family of MLP based nonlinear spectral estimators for noise reduction. Proc. ICASSP'94, vol. II, pp

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Independent Component Analysis and Unsupervised Learning Jen-Tzung Chien TABLE OF CONTENTS 1. Independent Component Analysis 2. Case Study I: Speech Recognition Independent voices Nonparametric likelihood