A Source Localization/Separation/Respatialization System Based on Unsupervised Classification of Interaural Cues

Size: px

Start display at page:

Download "A Source Localization/Separation/Respatialization System Based on Unsupervised Classification of Interaural Cues"

Regina Harvey
5 years ago
Views:

1 A Source Localization/Separation/Respatialization System Based on Unsupervised Classification of Interaural Cues Joan Mouba and Sylvain Marchand SCRIME LaBRI, University of Bordeaux 1 firstname.name@labri.fr

2 Outline 1 Overview 2 Backgrounds 3 CASA-EM Methods 4 Results 5 Summary and Future Works

3 Outline 1 Overview 2 Backgrounds 3 CASA-EM Methods 4 Results 5 Summary and Future Works

4 Overview Given binaural audio mixtures, the system detects more than 4 sources; localizes each source (azimuth); reconstructs each source. Given a mono audio source, the system: generates a stereo source; positions the source at any location. based on Interaural Cues (ILD, ITD) Expectation Maximization approach

5 Outline 1 Overview 2 Backgrounds 3 CASA-EM Methods 4 Results 5 Summary and Future Works

6 Motivation Why? Binaural manipulation of source in mix Underdeterminated (degenerated) case Applications Virtual reality, hearing aids, live music... CASA-EM Subject independent Automatic processing Time-frequency processing

7 Problem Statement I Hypothesis Sources do not overlap in the t-f plane Windowed Disjoint Orthogonality S i (l, f ) S j (l, f ) = 0 i, j = 1,..., K i j

8 Problem Statement II Consequences Detection/Localization of phantom sources Cumulate energy spreading Interferences and distortions

9 Related Works DUET: [Rickard (2002)] - Computes ILD(l, f ), ITD(l, f ) - 2-dimensional power histogram (ITD ILD) [Viste (2003,2004)] - Estimates azimuth θ given interaural cues - 1-dimensional power histogram (θ) [Avendano (2003)] - Interchannel metric: panning index - Separation based on Gaussian window [Kameoka (2004)] - Spectrum density with tied Gaussian mixture - Separation of harmonic structures

10 Head Model ILD with shadow cast L(θ, f ) = α f sin θ c [Viste & Evangelista (2003)] ITD with shadow cast T (θ, f ) = β f r(sin θ + θ) c r: head radius c: sound celerity [Viste & Evangelista (2003)]

11 Source Localization Computes interaural cues: X ILD(t, f ) = 20log R (t,f ) 10 ; ITD p (t, f ) = 1 X L (t,f ) 2πf ( X R(t,f ) X L (t,f ) + 2πp ) Computes azimuth ( based on ILD and ITD: ( ) θ L (t, f ) = arcsin c ILD(t,f ) c ITDp(t,f ) α f ); θ T,p (t, f ) = Π r β f with Π(x) = x x x 5 + O(x 5 ) Finds p that minimizes: θ(t, f ) = θ T,m (t, f ) with m = argmin p θl (t, f ) θ T,p (t, f ) Cumulates the power in a histogram using a binary mask: h(θ) = f M θ(t, f )X L (t, f )X R (t, f )

12 Outline 1 Overview 2 Backgrounds 3 CASA-EM Methods 4 Results 5 Summary and Future Works

13 Source Localization/Separation Method Build histogram h(θ) Binomial smoothing and thresholding Local maxima search Outputs - Mixture order estimate (K ) - Locations of detected sources (θ 1, θ 2, θ K ) Example 2-source mixture K = 6, before threshold K = 2, after threshold

14 Gaussian Mixture Model (GMM) Θ = {θ 1,..., θ N } Each source associated to a Gaussian Gaussian Mix: {Γ} = {µ j, σ j, π j j = 1,..., K } : mean, standard deviation, weight for source j f K (θ Γ) = K j=1 π j φ j (θ γ j ) h(θ) with K j=1 π j = 1 Find Γ that best matches the data: Maximum Likelihood-Expectation Maximization objective: Γ (t+1) = argmax Γ L(Γ Θ) L(Γ (t) Θ).

15 EM Updates 2-order mix s θ ori θ est θ err s s EM Updates P K (k θ, Γ) P K (θ, k Γ) P K (θ Γ) P θ π k h(θ) P K (k θ, Γ) P h(θ) θ P θ µ k h(θ) θ P K (k θ, Γ) P h(θ) θ P K (k θ, Γ) P σk 2 θ h(θ) (θ µ k ) 2 P K (k θ, Γ) P h(θ) θ P K (k θ, Γ)

16 EM Updates 2-order mix s θ ori θ est θ err s s EM Updates P K (k θ, Γ) P K (θ, k Γ) P K (θ Γ) P θ π k h(θ) P K (k θ, Γ) P h(θ) θ P θ µ k h(θ) θ P K (k θ, Γ) P h(θ) θ P K (k θ, Γ) P σk 2 θ h(θ) (θ µ k ) 2 P K (k θ, Γ) P h(θ) θ P K (k θ, Γ)

17 EM Updates 2-order mix s θ ori θ est θ err s s EM Updates P K (k θ, Γ) P K (θ, k Γ) P K (θ Γ) P θ π k h(θ) P K (k θ, Γ) P h(θ) θ P θ µ k h(θ) θ P K (k θ, Γ) P h(θ) θ P K (k θ, Γ) P σk 2 θ h(θ) (θ µ k ) 2 P K (k θ, Γ) P h(θ) θ P K (k θ, Γ)

18 EM Updates 2-order mix s θ ori θ est θ err s s EM Updates P K (k θ, Γ) P K (θ, k Γ) P K (θ Γ) P θ π k h(θ) P K (k θ, Γ) P h(θ) θ P θ µ k h(θ) θ P K (k θ, Γ) P h(θ) θ P K (k θ, Γ) P σk 2 θ h(θ) (θ µ k ) 2 P K (k θ, Γ) P h(θ) θ P K (k θ, Γ)

19 Unmixing with probabilistic t-f Mask Philosophy each t-f bin belongs to all K sources Build a probabilistic mask for each source k M k (t, f ) = P K (k θ(t, f ), Γ) Energy allocation according to posterior probability S L (t, f ) = M k (t, f ) X L (t, f ) S R (t, f ) = M k (t, f ) X R (t, f )

20 Binaural Spatialization Method 1 hrtf subject (ρ, θ, φ, f ) depends on: subject, position, frequency CIPIC hrtf database (45 subjects) [Algazi et al (2001)] Spatialization Disk space - Table of reals - Interpolation not trivial... x L = s mean-hrtf L (θ) x R = s mean-hrtf R (θ)

21 Binaural Spatialization Method 2 w(t) x(t) FFT X(t, f) Spatialization X L(t, f) SPATIALIZATION IFFT + OVERLAP ADD ILD(θ, f) SPATIAL ITD(θ, f) CUES X R(t, f) θ X L (t, f ) = X(t, f ) 10 a/2 e j φ/2 X R (t, f ) = X(t, f ) 10 + a/2 e +j φ/2 x L(t) x R(t) with Disk space - Array of 202 reals - Geometrical interpolation a = ILD(θ, f )/(20dB) φ = ITD(θ, f ) 2πf

22 Outline 1 Overview 2 Backgrounds 3 CASA-EM Methods 4 Results 5 Summary and Future Works

23 Source Separation Results: Signals xylophone ( 55 ) (top) and horn (30 ) 2 2 amplitude x x amplitude samples x samples x 10 4 Rhythm respected Shape preserved Unmix similar to original

24 Source Separation Results: Listening Tests 2-source mix Mix original eguitar -80 unmix eguitar original saxo 80 unmix saxo 3-source mix Mix original piano -30 unmix piano original xylo 0 unmix xylo original trumpet 30 unmix trumpet Mean Opinion Score: 3 on 5 levels

25 Source Spatialization Results ReSPA xylo -45 fhorn 80 saxo -30 tuba 0 eguitar -80 Mean HRTF xylo -45 fhorn 80 saxo -30 tuba 0 eguitar -80 MHRTF better lateralization SSPA good enough MHRTF sounds more natural

26 Outline 1 Overview 2 Backgrounds 3 CASA-EM Methods 4 Results 5 Summary and Future Works

27 Summary Summary Source localization (azimuth) Source separation Source spatialization Future Works Study the localization of moving sources Implement the system in real-time environment Improve source separation with processing inside each bin Study the brightness of spectra to weight distance Conduct further MOS listening tests for spatialization

28 References J. Blauert: Spatial Hearing, MIT Press, H. Viste, G. Evangelista: Binaural Source Localization, PhD Thesis, O. Yilmaz and S. Rickard: Blind Separation of Speech Mixtures via Time-Frequency Masking, IEEE Transactions On signal Processing, Vol.52, NO.7, July V.R. Algazi, R.O. Duda, D.P. Thompson: The CIPIC HRTF database, Proc. IEEE WASPAA01, NY, pp , A. Dempster, N. Laird and D. Rubin: Maximum Likelihood from Incomplete Data via EM Algorithm, Journal of the Royal statistical Society series B, vol. 39, no. 1, pp.1-38, 1977.

An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments

An EM Algorithm for Localizing Multiple Sound Sources in Reverberant Environments Michael I. Mandel, Daniel P. W. Ellis LabROSA, Dept. of Electrical Engineering Columbia University New York, NY {mim,dpwe}@ee.columbia.edu