7th European Signal Processing Conference (EUSIPCO 9) Glasgow, Scotland, August 4-8, 9 DIRECTION ESTIMATION BASED ON SOUND INTENSITY VECTORS Sakari Tervo Helsinki University of Technology Department of Media Technology P.O.Box 54, FI-5 TKK, Finland sakari.tervo@tkk.fi ABSTRACT The direction of a sound source in an enclosure can be estimated with a microphone array and some proper signal processing. Earlier, in applications and in research the use of time delay estimation methods, such as the cross correlation, has been popular. Recently, techniques for direction estimation that involve sound intensity vectors have been developed and used in applications, e.g. in teleconferencing. Unlike in time delay estimation, these methods have not been compared widely. In this article, five methods for direction estimation in the concept of sound intensity vectors are compared with real data from a concert hall. The results of the comparison indicate that the methods that are based on convolutive mixture models perform slightly better than some of the simple averaging methods. The convolutive mixture model based methods are also more robust against additive noise.. INTRODUCTION Direction or location of a sound source is of interest in several applications that aim to capture or to reproduce the sound [,, ]. This is of interest for example in teleconferencing []. Moreover, when exploring concert hall impulse responses it is of interest from where and when the direct sound and the so-called early reflections arrive [4]. Direction estimation with traditional methods, such as the cross correlation, has been researched widely over several decades (see e.g. [5] and references within). Nowadays, sound intensity vectors are used for direction estimation in increasingly many applications [,,, 6]. Not much research, if any, has been published in comparing the different methods used in direction estimation from sound intensity vectors. In this article, five direction estimation methods are compared in a real concert hall environment. A sound intensity vector has a length and a direction and it is a function of time and frequency. In this work, the focus is on the direction estimation over some set of frequencies during a certain time frame. Direction estimation methods can be broadly classified into two classes: ) direct ) and mixture estimation. The first class includes methods such as averaging. The second class contains convolutive mixture models where the direction of the sound source is found by fitting two or more probability distribution of a certain shape to the directional data, as for example in [] and [7]. Naturally, the methods in the second class are much more complex than in the first, since some optimization method has to be used to obtain the mixtures. Generally, sound intensity vectors are obtained either with microphone pair measurements or from B- format signals [4]. In practice both of these measurement techniques introduce a bias to the direction of the intensity vector. For example, in Soundfield microphones, the bias is caused by the non-idealities in the directivity patterns of the microphones [4]. In microphone pair measurements the direction estimation is biased, since the gradient of the sound pressure is not constant within the sensor array [6]. In the case of microphone pair measurement, also the non-idealities of the pressure microphones affect the measurement. The article is organized as follows. In Sec. the calculation of sound intensity vectors as well the bias compensation of the vectors are formulated. Section introduces the direction estimation methods. In Sec. 4 the experimental setup is described and in Sec. 5 the results from the experiments are presented and discussed. In Sec. 6 the final conclusions of this study are given.. THEORY In a room environment the sound s(t) traveling from the sound source to the receiver n is affected by the path h n (t): p n (t) = h n (t) s(t) + w(t), () where denotes convolution and w(t) is measurement noise, independent and identically distributed for each receiver.. Sound Intensity On a certain axis a, the sound intensity is given in the frequency domain as I a (ω) = Re{P (ω)u a (ω)}, () where P(ω) and U a (ω) are the frequency presentations of the sound pressure and of the particle velocity with angular frequency ω. In addition, Re{ } and is the real part of a complex number and denotes the complex conjugate [4]. Here, Cartesian coordinate system with x -and y coordinate axis, shown in Fig., is used. Corresponding polar coordinate presentation of the Cartesian coordinates is denoted with azimuth angle θ and radius r. The procedure for obtaining the sound intensity proceeds as follows and is similar for both axes. The sound pressure at the center point between four microphones, shown in Fig., can be approximated as EURASIP, 9 7
the average pressure of the microphones [6]: P(ω) 4 4 P n (ω). () n= For x-axis, in frequency domain the particle velocity is estimated as U x (ω) j ωρ d [P (ω) P (ω)], (4) where d is the distance between the two receivers and ρ =. kg/m is the median density of the air and j is the imaginary unit. Now, the sound intensity in () can be estimated with the approximations in () and (4). For obtaining the y-component of the sound intensity, the microphones and are replaced in (4) with microphones and 4.. Bias Compensation for the Square Grid Since the pressure signals are subtracted from each other in the approximation made in (4), the azimuth angle θ of the intensity suffers from a systematic bias [6]. This bias can be formulated for a four-microphone square grid as [6] θ biased = sin(ω d c sin(θ)) sin(ω d (5) c cos(θ)), where c = 4 m/s is the speed of sound. The bias can be compensated by finding the inverse of (5). However, this equation does not have any closed form solution. Kallinger et al. [6] estimate the inverse function with linear interpolation. In addition, the inverse function is known [6] to have an upper limit at f max = c/(d ), where ω = πf, i.e. < ω d c < π. In this work the linear interpolation for finding the solution to the inverse function of (5) is included and used for compensating the azimuth angles. The unbiased estimate, i.e. the compensated angle at i:th frequency bin is denoted with θ i.. METHODS FOR DIRECTION ESTIMATION In a certain time window, sound intensity vectors are estimated over a set of frequencies. Each intensity vector at frequency bin i, consists of a radial component r i and of an angular component θ i. In this article, the focus is on finding the direction of a sound source from a set of radial and angular components. Next, five methods for direction estimation are formulated.. Circular Mean and Median In direction estimation one has to take into account the fact that the data is circular. Therefore, for example the mean of a set of angles θ = [θ, θ,..., θ N ], θ i ( π, π] is defined as the circular mean (CME) [8]: { N ˆθ CME = arg w i e }, jθi (6) y x Figure : The square grid with four microphones and the coordinate system. d where arg{ } is the argument of a complex number, and w i = /N is the weighting function. In principle, the weighting function can be chosen freely. If the weighting function is chosen as w i = r i, with r i =, where r i is the corresponding radial component of an intensity vector, then CME is equal to the mean of the Cartesian presentation (MCA) of the sound intensity, i.e.: { N arg r i e jθi } = arctan 4 d { } E{Iy } := E{I x } ˆθ MCA, (7) where E{ } is the expected value, I x = [x, x,..., x N ], and I y = [y, y,..., y N ] are the Cartesian presentation of the vectors. Here, the circular median (CMD) is defined analogously to CME: { { } { ˆθ CMD = arg Me Re{w i e jθi } + jme Im{w i e }} } jθi, (8) where Me{ } is the median. The mode could also be used instead of the median, but this is not considered in this article. Again, the weighting function can be selected freely. Here, w i = /N is selected as earlier with CME.. Mixture Models When inspecting the distribution of the azimuth angle of the intensity vectors, for example in Fig., one can notice that the azimuth angle is a mixture of two distributions rather than one. In the example in Fig. the distribution that has smaller variance (or higher concentration) is caused by the sound source. The second distribution models the noise floor. The shape of these distributions is defined by the impulse response, i.e. the room and the frequency content of the source signal. A new sound source in the room always introduces a new distribution to the total azimuth distribution [, 7]. Since the azimuth angle is circular, wrapped distributions have to be used for the fitting. Here, two distributions are tested. The first one is the von Mises probability distribution (VM) [, 8]: 7 f VM (θ µ, κ) = eκcos(x µ)) πi(, κ), (9)
PDF pvmm(θ) pwgm(θ).4....4....4... -5 - -5 5 5 θ, [ ] Figure : Examples of the normalized histogram of the estimated azimuth angles (top, ), the von Mises mixture model (middle, - -), and the wrapped Gaussian mixture model (bottom, - -), with mixture components shown separately ( ). The vertical lines ( ) illustrate the true angle θ s = 9 (top), and estimated source directions ˆθ VMM = (middle), ˆθWGM = (bottom). Source position S and receiver position R was used (see Sec. 4). where κ is a measure of concentration, µ is the mean, and I(, κ) is the modified Bessel function of order. The second tested distribution is the wrapped Gaussian probability distribution (WG)[9]: f WG (θ µ, σ) = πσ K k= K e (θ µ πk) σ, () where σ is the variance, µ is the mean, and K + is the number of Gaussian to be wrapped, here K =. The mixture model is formulated as the sum of the distributions: p(θ µ, ρ) = M m = a m f (m) (θ µ m, ρ m ), () where, m indicates the index of a mixture, f = f WG for the wrapped Gaussian mixture model (WGM), and f = f VM for the von Mises mixture model (VMM). Parameters µ = [µ, µ,..., µ M ] and ρ = [ρ, ρ,..., ρ M ] depend also on the model. The weighting factor is here selected as a m = /M and the number of mixtures is M =, since there is only a single sound source. Thus, one of the mixture components models the noise floor and the other one the sound source. The direction of the source is then estimated as the mean parameter µ of the mixture component which has smaller variance (WGM) or higher concentration (VMM). These estimated directions are noted here with ˆθ WGM and ˆθ VMM according to the used model. In order to find the parameters of the mixture model, an optimization algorithm has to be used. The search criteria is the same as in maximum-likelihood estimation, where the goal is to maximize the likelihood L(µ, ρ) = N log p(θ i µ, ρ). () Here, the parameters are sought with MATLAB s fminsearch which uses the Nelder-Mead method []. Other optimization algorithms that are more optimal for this problem could be used as well, e.g. the expectationmaximization algorithm. Figure shows examples of distributions estimated with both of the models and an example of the original probability distribution function (PDF), i.e. the normalized histogram of the estimated directions. 4. EXPERIMENTAL SETUP A microphone array consisting of two four-microphone grids (see Fig. ) was used. The four-microphone grids share the same center point, having d = mm and d = mm. The smaller grid is used for the frequencies from khz to 5 khz and the larger for frequencies from Hz to khz. The bias compensation is done separately for both arrays. The methods were tested with real concert hall data. The impulse response measurement setup in the concert hall (Pori, located in Finland) is given in Fig.. This concert hall has 7 seats and a reverberation time of approximately. seconds. The signals were recorded at 48 khz with three sound source locations (S), and five receiving (R) locations for the array. Here, the first three receiver positions are used (R-R in Fig. ). The sound source was an omnidirectional loudspeaker of 6 cm diameter, consisting of driver elements. Two source signals, seconds of violin playing a signal tone and seconds of white noise, were convolved with the measured impulse responses. Then, signal-tonoise ratio (SNR) was varied by adding white noise (w(t) in ()) to the signals. Next, sound intensity vectors were calculated and compensated as stated in Sec.. The direction was estimated from the intensity vectors with the methods introduced in Sec. on a frame by frame basis. Frames with 4 samples in length and 5 % overlap were used. For each method and test condition, this leads to 9 87 = 68 direction estimates. Moreover, 89 bins was used in the fast Fourier transform. This implies that the direction was estimated from 87 frequency bins, since the frequency band was from to 5 khz. 5. RESULTS AND DISCUSSION In order to compare the methods, the estimated directions are processed as follows. Firstly, the anomalies are 7
4 CME MCA CMD VMM WGM µθ, [ ] Figure : Receiver (R) and source (S) positions in the concert hall of Pori. Receiver positions R, R, and R, and source positions S, S, and S were used in the experiments. removed from the direction estimates. Here, the threshold criteria for an anomaly is, i.e. if ˆθ i θ s >, where θ s is the true angle of the sound source, the estimate ˆθ i is considered to be an anomaly. The goodness of a method is evaluated with three measures. The first measure is the percentage of anomalies P AN, the second measure is the circular bias of the non-anomalous estimates θ i : { Ñ µ θ = arg e j( θ i θ s)} and the third is the circular standard deviation: [ ( σ θ = Ñ Ñ e j( θ i θ s) / )], where Ñ is the number of non-anomalous estimates, is the absolute value of a real number and is the length of a complex number. The results of the experiments for a white noise source signal are presented in Fig. 4 and for a violin source signal in Fig. 5. As expected, since white noise has energy in all frequencies it provides more robustness against additive noise and gives more accurate direction estimation results than a violin source signal. As one can see from Figs. 4 and 5, the mixture model based estimation methods, VMM and WGM, perform the best in all conditions, VMM having lowest number of anomalies in total, and therefore performing the best of all the methods. The best estimator from the first class (introduced in Sec..) is CME. However, the differences in the performance between the mixture model based estimation methods and two simple averaging methods, CME, and CMD, are not drastical. The worst estimator in all conditions is clearly MCA. Even with the highest SNR the percentage of anomalies is high (P AN 8 %). σθ, [ ] PAN, [%] 5 5 8 6 4 5 5 5 5 4 SNR, [db] Figure 4: Bias (top) and standard deviation (middle) of the non-anomalous estimates, as well as the percentage of anomalies (bottom) against SNR. White noise was used as the source signal and the reverberation time is about. seconds. VMM is the most robust method against additive noise. In Fig. 5, VMM has less than 5 % of anomalies in all the test conditions. Thus, even with high noise level VMM is able to find the distribution caused by the sound source. One reason for this might be that, although the total energy of the noise is higher than the energy of the sound source, the distribution caused by the noise has much smaller concentration than the distribution caused by the sound source. The bias of all methods is less than 4 in all cases. With noise source the bias increases in general when the SNR increases. Thus, the error distribution is biased more with high SNR. The reasons behind this unexpected behaviour are not clear. When the SNR is high the bias is introduced by the reverberation. Perhaps, this is caused by the non-idealities in the microphones. However, with violin source signal the trend of the bias is the opposite. The general trend of the standard deviation is that it decreases when the SNR increases, as expected. As the results indicated, MCA is not a good method to estimate the direction of arrival from a continuous signal. MCA gives very often anomalous estimates. This is caused by the weighting with the radial components, since when the weighting is not used, as in CME, the 7
ACKNOWLEDGEMENTS µθ, [ ] 4 CME MCA CMD VMM WGM The author wishes to thank Dr. Tapio Lokki for the supervision of this work. The research leading to these results has received funding from the Academy of Finland, project no. [99], the European Research Council under the European Community s Seventh Framework Programme (FP7/7-) / ERC grant agreement no. [66], and Helsinki Graduate School in Computer Science and Engineering. σθ, [ ] PAN, [%] 5 5 8 6 4 5 5 5 5 4 SNR, [db] Figure 5: Bias (top) and standard deviation (middle) of the non-anomalous estimates, as well as the percentage of anomalies (bottom) against SNR. The source signal was violin and the reverberation time is about. seconds. estimation is close to the actual direction. So, at least in thecase of circular mean it is not a good idea to use the radial components as the weighting function. Perhaps, if one would select only a certain set of the azimuth and of the radial components one would arrive to better result. Also, one could use a maximum-likelihood weighting (with respect to the spectrums of the signal and the noise) for the cross spectral components, as in time delay estimation [5]. 6. CONCLUSIONS Five methods for estimating the direction of a sound source from a set of sound intensity vectors was considered. There were two classes of methods in test. First class includes simple methods such as the circular mean, and the second class contains methods that are based on convolutive mixture models. The methods were tested in a real concert hall environment. The results indicate that the methods from the second class perform better and provide more robustness against additive noise than the methods from the first class. Especially, the von Mises distribution was found to suit well for the problem. References [] J. Merimaa and V. Pulkki, Spatial Impulse Response Rendering I: Analysis and Synthesis, J. Audio Eng. Soc., vol. 5, no., pp. 5 7, 5. [] B. Günel, H. Haclhabiboğlu, and A.M. Kondoz, Acoustic Source Separation of Convolutive Mixtures Based on Intensity Vector Statistics, IEEE Trans. Audio, Speech, and Language Processing, vol. 6, no. 4, pp. 748 756, 8. [] V. Pulkki, Spatial Sound Reproduction with Directional Audio Coding, J. Audio Eng. Soc., vol. 55, no. 6, pp. 5 56, 7. [4] J. Merimaa, Analysis, Synthesis and Perception of spatial sound-binaural Auditory modeling and multichannel loudspeaker reproduction, Ph.D. thesis, Helsinki University of Technology, 6. [5] J. Chen, J. Benesty, and Y. Huang, Time delay estimation in room acoustic environments: an overview, EURASIP J. Applied Signal Processing, vol. 6, no., pp. 7 7, 6. [6] M. Kallinger, F. Kuech, R. Schultz-Amling, G. del Galdo, J. Ahonen, and V. Pulkki, Enhanced Direction Estimation Using Microphone Arrays for Directional Audio Coding, in Proc. Hands-Free Speech Communication and Microphone Arrays, 8, pp. 45 48. [7] N. Madhu and R. Martin, A Scalable Framework for Multiple Speaker Localization and Tracking, in Proc. Int. Workshop on Acoustic Echo and Noise Cancellation, 8. [8] N.I. Fisher, Statistical Analysis of Circular Data, New York: Cambridge University Press, 99. [9] Y. Agiomyrgiannakis and Y. Stylianou, Stochastic Modeling and Quantization of Harmonic Phases in Speech using Wrapped Gaussian Mixture Models, in Proc. Int. Conf. Acoustics, Speech and Signal Processing, 7, vol. 4, pp. 4. [] J.A. Nelder and R. Mead, A Simplex Method for Function Minimization, The Computer Journal, vol. 7, no. 4, pp. 8, 965. 74