An Inverse-Gamma Source Variance Prior with Factorized Parameterization for Audio Source Separation

Size: px

Start display at page:

Download "An Inverse-Gamma Source Variance Prior with Factorized Parameterization for Audio Source Separation"

Everett Chapman
5 years ago
Views:

An Inverse-Gamma Source Variance Prior with Factorized Parameterization or Audio Source Separation Dionyssos Kounades-Bastian, Laurent Girin, Xavier Alameda-Pineda, Sharon Gannot, Radu Horaud To cite

<hal-025369> HAL Id: hal-025369 https://hal.inria.

1 An Inverse-Gamma Source Variance Prior with Factorized Parameterization or Audio Source Separation Dionyssos Kounades-Bastian, Laurent Girin, Xavier Alameda-Pineda, Sharon Gannot, Radu Horaud To cite this version: Dionyssos Kounades-Bastian, Laurent Girin, Xavier Alameda-Pineda, Sharon Gannot, Radu Horaud. An Inverse-Gamma Source Variance Prior with Factorized Parameterization or Audio Source Separation. 4st IEEE International Conerence on Acoustics, Speech and SIgnal Processing ICASSP 206, Mar 206, Shanghai, China. ICASSP Proceedings, pp.36-40, <0.09/ICASSP >. <hal > HAL Id: hal Submitted on 8 Jan 206 HAL is a multi-disciplinary open access archive or the deposit and dissemination o scientiic research documents, whether they are published or not. The documents may come rom teaching and research institutions in France or abroad, or rom public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diusion de documents scientiiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche rançais ou étrangers, des laboratoires publics ou privés.

2 AN INVERSE-GAMMA SOURCE VARIANCE PRIOR WITH FACTORIZED PARAMETERIZATION FOR AUDIO SOURCE SEPARATION Dionyssos Kounades-Bastian, Laurent Girin,2, Xavier Alameda-Pineda 3, Sharon Gannot 4, Radu Horaud INRIA Grenoble Rhône-Alpes, 2 GIPSA-Lab, Univ. Grenoble Alpes 3 University o Trento, 4 Faculty o Engineering, Bar-Ilan University ABSTRACT In this paper we present a new statistical model or the power spectral density PSD o an audio signal and its application to multichannel audio source separation MASS. The source signal is modeled with the local Gaussian model LGM and we propose to model its variance with an inverse-gamma distribution, whose scale parameter is actorized as a ran- model. We discuss the interest o this approach and evaluate it in a MASS tas with underdetermined convolutive mixtures. For this aim, we derive a variational EM algorithm or parameter estimation and source inerence. The proposed model shows a beneit in source separation perormance compared to a state-o-the-art LGM NMF-based technique. Index Terms Audio modeling, local Gaussian model, PSD model, audio source separation.. INTRODUCTION For the past decade, the statistical modeling o audio signals in the time-requency TF domain has been thoroughly investigated. Among the proposed models, the local Gaussian model LGM [] has become very popular because, among other reasons, it can be naturally coupled with models o the signal power spectral density PSD which identiies with the signal variance at each TF bin or a zero-mean signal. An important example is the use o non-negative matrix actorization NMF, which imposes a low-ran structure on the PSD matrix, namely the product o a spectral pattern matrix and a temporal activation matrix [2]. Intuitively, NMF is meant to eiciently represent the structure o the audio signal power in the TF domain with a reduced number o parameters. The NMF actors were irst treated as parameters [3, 4, 5, 2], and then as latent variables within a Bayesian ramewor [6, 2, 7, 8, 9]. These models have been successully applied to audio source separation. In MASS conigurations, the source signal models are combined with a mixing model, accounting or the source-to-sensor channels, e.g. [0,, 2, 3, 4]. In general, an EM algorithm is derived to estimate the source This research has received unding rom the EU-FP7 STREP project EARS # and ERC Advanced Grant VHIA #3403. and channel parameters, which are then used to construct demixing Wiener ilters. Besides, a general ramewor or inserting prior inormation about the sources in TF-domain MASS has been proposed in [5]. In current Bayesian NMF PSD models, the source PSD is irst modeled with NMF and, second, the NMF actors are assigned a prior distribution. In this paper, we propose to change the order o things: The source is still seen as a sum o components, but we irst assign a prior distribution to the component PSD and then assume a actorized model, reminiscent o NMF, on the parameters o this distribution. More precisely, we model the component PSD with an inverse-gamma IG distribution and assume that the scale parameter o this IG ollows a ran- NMF. As explained in Section 2, this enables to add exibility in the modeling o source PSD matrix, compared to conventional NMF, while preserving the ability to model structured source PSD. We apply the proposed model to MASS rom underdetermined convolutive mixtures. The proposed model is presented in Section 2. The associated variational EM VEM algorithm that we derived to estimate the model parameters and iner the source signals is described in Section 3. Experimental evaluation reported in Section 4 shows competitive perormances in comparison to the state-o-the-art LGM-NMF MASS method o [0]. 2.. The mixing model 2. MODELS As usually done in the MASS literature, we wor under the narrow-band assumption, which allows us to write a timeinvariant convolutive mixture in the short-term Fourier transorm STFT domain as: x = A s + b, where x = [x,,..., x I, ] C I is the I channel observation vector, s = [s,,..., s J, ] C J is the source vector to be inerred, A C I J is a mixing matrix parameter to be estimated, and b C I is the sensor noise. We assume pb = N c b ; 0, v I I, with v R + N cx; µ, Σ = πσ exp [x µ] H Σ [x µ] is the proper complex Gaussian distribution with x C I, µ C I and Σ C I I.

3 being a variance parameter to be estimated I I is the identity matrix o dimension I. The above assumption implies px s = N c x ; A s, v I I. Note that the above mixture can be underdetermined, i.e. we can have I < J The source model We embrace the LGM ramewor [, 0] where s j, C is assumed to ollow a zero-mean proper complex Gaussian distribution. Moreover, s j, is assumed to be the sum o elementary components c, C, also zero-mean proper complex Gaussian: s j, = c, s = Gc, 2 K j where K j is a subset o a nontrivial partition K = {K j } J j= nown in advance o the K components into the J sources, c = [c,,..., c K, ] C K is the vector o component coeicients, and G N J K is a binary matrix with entries G j = i K j and G j = 0 otherwise. Finally, as in [0], we assume that all {c, } F,L,K,l,= are independent, with: pc, u, = N c c, ; 0, u,. 3 In particular, the source PSD at each TF bin is the sum o individual component PSDs at that bin The component PSD model Traditionally, the component PSD or component variance u, is typically assumed to actorise over and l, i.e. an NMF model is applied on the source PSD directly. In a Bayesian ramewor, the NMF actors are assigned a prior distribution. The main contribution o this paper is to reverse the traditional order: We irst assume a prior distribution or the component variance and then impose a nonnegative actorized structure on its parameters. More precisely, we assume that each entry u, o the component PSD matrix ollows an inverse Gamma IG distribution 2 : pu, = IG u, ; γ, δ, with δ, = w h l, 4 where γ, w, h l R +. The choice or the IG distribution emerges naturally, as it is the conjugate prior o the variance o a Gaussian. The actorization o the scale parameter δ, into a ran- model is a ey point o our model. Indeed, modeling the parameters o the component PSD prior or instance δ, with a ran- model instead o the component PSD u, itsel allows the latter not to be constrained to have a low-ran structure. Thereore, with the proposed model, both the component PSD and the source PSD can be ull-ran, 2 The Inverse Gamma distribution is deined as IGu; γ, δ = δ γ Γγ u γ+ exp δ, with support u R u +, shape parameter γ R +, scale parameter δ R +, and Γ being the Gamma unction. u, γ, w, h l c, x v, A Fig.. Graphical reprsentation o the probabilisitc model. Latent variables are represented with circles, observations with double circles, deterministic parameters with rectangles. as opposed to conventional NMF. In the meantime, the proposed model eeps a limited number o parameters and an ability to represent structured signals, in the spirit o conventional NMF. Finally, we postulate that the proposed model has the potential to better represent natural audio signals, such as speech. As or the IG shape parameter γ, it intuitively acts as a measure o the relevance o the -th component: high resp. low values o γ decrease resp. increase the contribution o the -th component. 3. VARIATIONAL INFERENCE We propose an EM algorithm to perorm inerence o the hidden variables H = {c, u, } F,L,K,l,= and estimation o the parameters θ = {A, v, γ, w, h l } F,L,K,l,=. As the E-step does not admit a closed orm solution, we use variational inerence: Let qh 0 = p H 0 {x } F,L, ; θ denote the posterior distribution o a variable H 0 H. First qh is imposed to actorise as qh F,L, qc F,L,K Then qh 0 = ph 0 {x } F,L, qh 0 exp E qh/h0 ; θ is inered with:,l,= qu,. [ log ph, {x } F,L, ; θ], 5 where E qz [z] is the expectation o unctional z w.r.t. the distribution qz over the support o the variable z, and where qh/h 0 is the joint posterior distribution o all hidden variables except H 0. The resulting E-step is the alternating inerence o qc E-C, and qu, E-U,, l,. 3.. E-step Let the superscript r denote the VEM iteration index, i.e. θ r are the parameters computed at the r th iteration. E-U-step: First we consider the inerence o qu,. Using 5, one can easily identiy qu, to be also an IG: qu, pu, exp E qc [log pc, u, ] = IG u, ; g r, dr,, 6 with posterior parameters g r, dr, R + calculated as: g r =γr +, d r, = δr, where Σ cr, ĉ r, below. + Σ cr, + ĉ r 2, 7, R + is the th diagonal entry o Σ cr, and, both being calculated C is the th entry o ĉ r

4 E-C-step: Using 5, qc can be identiied to be complex-gaussian: K qc px Gc exp E qu, [log pc, u, ] = = N c c ; ĉ r, Σcr. 8 The posterior covariance matrix Σ cr C K K and component vector estimate ĉ r CK are given by: Σ cr = ĉ r diag K = Σcr g r d r, A r + H G A r x /v r HA r G G v r, 9 where diag K x is the K K diagonal matrix with entries x,..., x K. Eq. 9 corresponds to the Wiener iltering o the component, thus a similar result as in [0] except or the construction o the component posterior covariance matrix. Estimating the source coeicients: Now, using 2, it is easy to calculate the source posterior distribution, which as one expects, is a complex Gaussian with mean ŝ r C J, and 2 nd -order moment R sr ŝ r = Gĉr, 3.2. M-step Rsr C J J calculated as: = GΣ cr G + ŝ r ŝ r H. 0 As or the M step, the parameters maximizing the expected complete-data log-lielihood are computed. M-A step: The optimal value or the ilters is: A r = L x ŝ r H L R sr,, which is a standard orm o least square estimator [0]. M-v step: The optimal noise variance is: v r = { } x H LI x 2Re x H A r ŝr + { } HA tr R sr A r r, 2 where tr{.} is the trace operator. M-IG step: The IG parameters γ, w, h l are coupled in the objective unction and thus an alternating optimization strategy is required, i.e. ixing two parameters to estimate the third. The updates or w, h l are: w r = g r Lγ r h r l d r,, h r l = F γ r F g r = w r d r,. 3 Algorithm Separation o J static sound sources input {x } F,L,, binary matrix G, initial parameters θ0. { } F,L,K initialise IG parameters:, set r =. g 0, d0,,l,= repeat E-C step: Compute Σ cr and ĉ r with 9. Compute ŝ r and Rsr with 0. E-U step: Calculate g r and d r,, with 7. M-A step: Update A r with. M-v step: Update v r with 2. M-IG step: δ r, = wr hr l Update w r, hr l with 3. Calculate. Update γr with 5. set r = r +. until convergence return the estimated source images. Then we set δ r, = wr solution w.r.t γ r to: ψ g r ψ γ r hr l = F L, and the update or γr F = log is the r d,, 4 δ r, where ψ. is the digamma unction. Since 4 has no closed-orm solution, we propose to approximate g r with γ r +, relying on 7, and use the recurrence relation o the digamma unction ψx + = ψx + x. This leads to the ollowing update rule: γ r = F L F = log 3.3. Estimation o source images. 5 d r, δ r, Considering the inherent scale indeterminacy o the source separation problem, we rather measure the separation perormance using the time domain source images, i.e. the estimates o the source signals as recorded at the microphones [, 6]. These are calculated by applying inverse STFT with overlap-add on {a j, ŝ j, } F,L, a j, is the j-th column o A. The complete VEM procedure can be ound in Algorithm. 4. EXPERIMENTS To asses the perormance o the proposed algorithm, we simulated the challenging tas o separating J = 3 sources rom a convolutive stereo mixture I = 2. Source signals were 2s-speech signals randomly chosen rom the TIMIT database [7]. As mixing ilters, we used binaural room

5 Mix- Mix-2 Table. Average SDR and SIR scores. SDR db SIR db R db Algo. s s2 s3 s s2 s Prop Base Prop Base Prop Base Prop Base Fig. 2. Average SDR score as a unction o VEM iterations Mix-,R = 0dB. impulse responses BRIR rom [8] truncated to 52 taps with reverberation time o RT s. Two sets o BRIRs were used, corresponding respectively to azimuths 85, 20, 60 Mix-, and azimuths 45, 75, 0 Mix-2. Standard sound separation measures, namely signalto-distortion ratio SDR and signal-to-intererence-ratio SIR [9] were computed. All reported results are average measures over 8 sets o utterances or each mix. To ensure a air comparison with the baseline method [0], we provided both algorithms with the same initial inormation. The mixing ilters were blindly initialized to A 0 = a matrix illed with ones and we set v 0 = 03 F LI x H x,. As or the NMF parameters, each source was corrupted with the sum o the two other sources at two dierent SNRs R = 0dB or 0dB. An initial NMF decomposition {w init, hinit l },l, was then computed or each corrupted source PSD using the KL-NMF algorithm [2], with K j = 20 components per source. {w init, hinit l },l, were also used to initialize the NMF parameters o the baseline method. We set δ 0, = winit hinit l, γ = and d 0, = δ0, thus g 0 = 2 and E IGu, ;g 0,d0 [u,] = w inithinit l. We, run 00 iterations. Fig. 2 shows the average SDR obtained at each iteration or Mix- with R = 0dB. We observe a quite regular evolution o the SDR scores, which are quite stabilized ater 00 iterations ater some possible decrease, since the VEM does not guarantee monotonic evolution o the separation scores as opposed to the lielihood. For this mix, the proposed method shows a notable improvement over the baseline: up to 3.dB or s 2 remind that these scores are averaged over 8 experiments with same ilters but dierent sources. Final perormance at iteration 00 or other conigurations, and or SIR scores, are reported in Table. There we see that or Mix- and R = 0dB coniguration o Fig. 2, the SIR improvement is in line with the SDR improvement: the proposed model outperorms the baseline by 5.8dB or s 2, while the results or the two other sources are less impressive. Such quite substantial improvement o SDR and SIR may be due to the added exibility o the proposed PSD model compared with NMF see Section 2.3. As or Mix- with R = 0dB, all scores are higher because the NMF initialization is closer to true source PSDs. Here, the proposed method also notably and systematically outperorms the baseline method. The results are more mitigated with the Mix-2 coniguration. Here both the SDR and SIR scores o the two methods are more intricate. Note that the scores o the proposed method are remarably similar across the two mixes, as opposed to the scores o the baseline method. This seems to indicate that the proposed method is robust to the mixing coniguration, but urther investigation must be conducted to conclude on this. Globally, the overall results encourage us to urther investigate the potential o this ull-ran PSD modeling, or source separation and beyond. 5. CONCLUSION MASS experiments have shown the potential o the proposed model or TF-domain statistical signal modeling. Future research will concern an in-depth analysis o the proposed PSD model per se, i.e. to model audio signals independently o the MASS context. This should include a comparative study with conventional parametric and Bayesian NMF. Also, we will investigate the characterization o component relevance rom the estimated shape parameter, and its use within a model selection tas when the number o source components is unnown, see, e.g. [20]. As or the MASS tas, we will explore more realistic initialization techniques, e.g. using the output o existing source separation techniques, leading to a more realistic sound separation algorithm and, again, more systematic comparison with other source PSD models plugged into the LGM-based MASS ramewor.

6 6. REFERENCES [] A. Liutus, B. Badeau, and G. Richard, Gaussian processes or underdetermined source separation, IEEE Transactions on Signal Processing, vol. 59, no. 7, pp , 20. [2] C. Févotte, N. Bertin, and J.-L. Durrieu, Nonnegative matrix actorization with the Itaura-Saito divergence. With application to music analysis, Neural Computation, vol. 2, no. 3, pp , [3] D. Lee and H. Seung, Learning the parts o objects by non-negative matrix actorization, Nature, vol. 40, pp , 999. [4], Algorithms or non-negative matrix actorization, in Advances in neural inormation processing systems, 200. [5] P. Smaragdis and J. Brown, Non-negative matrix actorization or polyphonic music transcription, in IEEE Worshop on the Applications o Signal Processing to Audio and Acoustics, [6] T. Virtanen, S. Godsill et al., Bayesian extensions to non-negative matrix actorisation or audio signal modelling, in IEEE International Conerence on Acoustics, Speech and Signal Processing, 2008, pp [7] N. Bertin, R. Badeau, and E. Vincent, Enorcing harmonicity and smoothness in Bayesian non-negative matrix actorization applied to polyphonic music transcription, IEEE Transactions on Audio, Speech, and Language Processing, vol. 8, no. 3, pp , 200. [8] M. Homan, D. Blei, and P. Coo, Bayesian nonparametric matrix actorization or recorded music, in International Conerence on Machine Learning, 200, pp [9] N. Mohammadiha, J. Taghia, and A. Leijon, Single channel speech enhancement using bayesian NMF with recursive temporal updates o prior distributions, in IEEE Int. Con. Acoustics, Speech, Signal Processing, Kyoto, Japan, 202. [0] A. Ozerov and C. Févotte, Multichannel nonnegative matrix actorization in convolutive mixtures or audio source separation, IEEE Transactions on Audio, Speech and Language Processing, vol. 8, no. 3, pp , 200. [] N. Duong, E. Vincent, and R. Gribonval, Underdetermined reverberant audio source separation using a ull-ran spatial covariance model, IEEE Trans. on Audio, Speech, and Language Proc., vol. 8, no. 7, pp , 200. [2] S. Arberet, A. Ozerov, N. Q. K. Duong, E. Vincent, R. Gribonval, F. Bimbot, and P. Vandergheynst, Nonnegative matrix actorization and spatial covariance model or under-determined reverberant audio source separation, in International Conerence on Inormation Sciences, Signal Processing, and their Applications, 200. [3] T. Higuchi, N. Taamune, N. Tomohio, and H. Kameoa, Underdetermined blind separation and tracing o moving sources based on DOA-HMM, in IEEE International Conerence on Audio, Speech and Signal Processing, 204. [4] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, and R. Horaud, A variational EM algorithm or the separation o moving sound sources, in IEEE Worshop on the Applications o Signal Processing to Audio and Acoustics, 205. [5] A. Ozerov, E. Vincent, and F. Bimbot, A general exible ramewor or the handling o prior inormation in audio source separation, IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 4, pp. 8 33, 202. [6] N. Sturmel, A. Liutus, J. Pinel, L. Girin, S. Marchand, G. Richard, R. Badeau, and L. Daudet, Linear mixing models or active listening o music productions in realistic studio conditions, in Convention o the Audio Engineering Society AES, Budapest, Hungary, 202. [7] J. S. Garoolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, N. L. Dahlgren, and V. Zue, Timit acoustic-phonetic continuous speech corpus, 993, linguistic Data Consortium, Philadelphia. [8] C. Hummersone, R. Mason, and T. Brooes, A comparison o computational precedence models or source separation in reverberant environments, Journal o the Audio Engineering Society, vol. 6, no. 7/8, pp , 203. [9] E. Vincent, R. Gribonval, and C. Févotte, Perormance measurement in blind audio source separation, IEEE Transactions on Audio, Speech and Language Processing, vol. 4, no. 4, pp , [20] V. Y. Tan and C. Fevotte, Automatic relevance determination in nonnegative matrix actorization with the β-divergence, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp , 203.

Audio Source Separation Based on Convolutive Transfer Function and Frequency-Domain Lasso Optimization

Audio Source Separation Based on Convolutive Transfer Function and Frequency-Domain Lasso Optimization Xiaofei Li, Laurent Girin, Radu Horaud To cite this version: Xiaofei Li, Laurent Girin, Radu Horaud.