Improved noise power spectral density tracking by a MAP-based postprocessor Aleksej Chinaev, Alexander Krueger, Dang Hai Tran Vu, Reinhold Haeb-Umbach University of Paderborn, Germany March 8th, 01 Computer Science, Electrical Engineering and Mathematics Communications Engineering Prof. Dr.-Ing. Reinhold Häb-Umbach
Table of Contents 1 Introduction and problem formulation MAP-based noise PSD estimation 3 Experimental framework 4 Performance evaluation 5 Summary and outlook A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 1 / 13
Introduction Motivation Noise PSD estimation is a key component to speech enhancement y t to robust automatic speech recognition Basic assumptions Many sophisticated algorithms rely on two assumptions: Noise more stationary than speech Noise-only time-frequency bins at regular intervals Here MAP-based () postprocessor Estimate of noise power even if speech is dominant in time-frequency bin Initial estimate of the current speech power required A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach / 13
as a postprocessor x t - clean speech signal ˆx (I) t ˆx (II) t n t - noise signal y t STDFT ISTDFT ISTDFT Y k,l frequency bin index frame index ˆX (I) k,l ˆX (II) k,l Noise σ N,k,l a priori ˆσ X,k,l estimation SNR first speech enhancement stage Gain Gain ˆσ N,k,l a priori SNR second speech enhancement stage A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 3 / 13
Problem formulation Observed Y l = N l + X l We consider N l as target process corrupted by clean speech X l of known variance σ X,l Our goal Estimation of noise variance σn,l+1 at frame l+1, from the noisy observation y l+1 given the current speech power σx,l+1 a priori PDF p σ (σ ) of variance σ N N,l Approach Maximum A Posteriori (MAP) estimation A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 4 / 13
MAP-based noise PSD estimation Bayesian variance estimation: Y l = N l If σx,l+1 = 0 (uncorrupted observ.) and σ N,l+1 = σ N (stationary) then the textbook problem: Scaled inverse chi-square distribution p σ (σ ) (σ ) ν l + e ν l λ l σ N with the degrees of freedom ν l and the scale factor λ l is conjugate prior to normal observation PDF p Yl+1 σ N (y l+1 σ ) = 1 y l+1 πσ e σ Parameter update ν l+1 = ν l + and λ l+1 = ν l + y l+1 + ν l ν l + λ l MAP-estimate of variance [ ] ˆσ N,l+1 = argmax p σ σ N Y l+1 (σ y l+1 ) = ν l+1 ν l+1 + λ l+1 A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 5 / 13
Extension to non-stationary noise Still uncorrupted noise: Y l = N l If σn,l+1 is time-variant then: The parameter ν l is kept at a constant value ν l+1 = ν l = ν 0 This results in recursive smoothing of variance estimate ˆσ N,l+1 = (1 α) ˆσ N,l +α y l+1, where α = ν 0 +4 Choice of ν 0 Trade-off between tracking ability and estimation error in stationary noise A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 6 / 13
General case Observation corrupted by speech: Y l = N l + X l If σx,l+1 0 and is known then the posterior PDF ( p σ N Y l+1 (σ y l+1 ) (σx,l+1 +σ ) 1 (σ ) ν l + e is no longer a conjugate prior for the observation PDF. y l+1 σ X,l+1 +σ + ν l λ l σ ), In order to maintain an efficient MAP estimation procedure we approximate the posterior PDF by a scaled inverse chi-squared distribution, and match its maximum (ˆσ N,l+1 ) with the maximum of the posterior PDF, which we calculate efficiently using a bisection and Newton approach. A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 7 / 13
Experimental framework as postprocessor First noise PSD estimator: Improved Minima Controlled Recursive Averaging () algorithm [Cohen, 003] Gain function: Optimally-Modified Log-Spectral Amplitude (OM-LSA) estimator [Cohen, Berdugo, 001] Y k,l+1 ˆX (I) k,l+1 ˆX (II) k,l+1 σ N,k,l+1 ˆσ a priori X,k,l+1 SNR first speech enhancement stage OM-LSA OM-LSA ˆσ N,k,l+1 a priori SNR second speech enhancement stage A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 8 / 13
Performance evaluation Setup Clean speech: TIMIT database, sentences concatenated to 3 minutes length Artificially added noise from Noisex9 database: Noise types: Stationary WGN, Triangular WGN, Babble and Factory-1 SNR values: 5,0,5,10,15dB estimator: we set ν 0 = 40 corresponding to a time constant of 0.164s Reference noise PSD Recursive temporal smoothing σ N,k,l = 0.95 σ N,k,l 1 +0.05 N k,l, with known noise periodogram N k,l A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 9 / 13
Sample trajectories of noise variance estimates Babbble noise at frequency bin k = 97 (3kHz) Triangular WGN (averaged over all frequency bins) 6 68 [db] 56 50 [db] 66 64 reference 4 5 6 7 8 Time [s] : continious update of noise variance estimate 4 5 6 7 8 Time [s] : faster response to rising noise power A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 10 / 13
Quantitative evaluation Performance measures adopted from [Taghia et al., 011] Fig.(a): minimum averaged log distance LE m between the reference and estimated PSD obtains lower error LE m for all noise types and SNRs less than or equal to 5dB or 10dB Fig.(b): variance of the logarithmic difference LE v yields lower variance LE v for all noise types and SNRs than the LEm LEv 3 1 0 4 3 1 0-5dB 0dB 5dB 10dB 15dB Stationary Triangular Babble Factory-1 WGN WGN (a) (b) A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 11 / 13
Speech enhancement Gain in perceptual speech quality (PESQ) scores PESQ Gain = PESQ PESQ PESQGain 0.1-5dB 0dB 5dB 10dB 15dB Stationary Triangular Babble Factory-1 WGN WGN 0.0 Female Male Female Male Female Male Female Male has a favourable effect on speech quality for non-stationary noise types A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 1 / 13
Summary and outlook Summary Proposed estimator is able to track the noise statistics even if the speech is dominant Low computational complexity Single parameter ν 0 Experimental evaluation: obtains lower estimation error under low SNR conditions lower fluctuation of the estimated values under all tested environments slightly improved speech quality for non-stationary noise types Outlook Investigations about dependence of algorithm performance: on the first noise PSD estimator on the degrees of freedom ν 0 A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 13 / 13
Thank you for your attention! Questions? Computer Science, Electrical Engineering and Mathematics Communications Engineering Prof. Dr.-Ing. Reinhold Häb-Umbach
Periodograms of cleen, noisy and enhanced signals File: /home/chinaev/desktop/icassp01/matlabcode/codecohen/datasforpesq/fem/factory1and5db/s_or57.wav Page: 1 of 1 Printed: Mo Mär 19 18:44:19 khz 7 6 5 4 3 Periodogramms of the spoken sentence Biblical scholars argue history for factory-1 noise at an SNR of 5 db (a) cleen speech signal (b) noisy speech signal (c) enhanced speech signal based on estimates (d) enhanced speech signal based on estimates 1 time (a) 1.0 1.1 1. (b) 1.0 1.1 1. 1.0 1.1 1. 1.0 1.1 1. 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 File: /home/chinaev/desktop/icassp01/matlabcode/codecohen/datasforpesq/fem/factory1and5db/s_noisy57.wav Page: 1 of Printed: Mo Mär 19 18:46:06 khz 7 6 5 4 3 1 time 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 File: /home/chinaev/desktop/icassp01/matlabcode/codecohen/datasforpesq/fem/factory1and5db/s_imcra57.wav Page: 1 of Printed: Mo Mär 19 18:47:00 khz 7 6 5 4 3 has a positive influence on periodogramm of enhanced signal particularly clearly seen in the highlighted window 1 time (c) 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 File: /home/chinaev/desktop/icassp01/matlabcode/codecohen/datasforpesq/fem/factory1and5db/s_mapb57.wav Page: 1 of 1 Printed: Mo Mär 19 18:47:36 khz 7 6 5 4 3 1 time 0.1 0. 0.3 0.4 0.5 (d) 0.6 0.7 0.8 0.9 A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 13 / 13
Minimum Statistics (MS) instead of Performance measures LE m and LE v for female speaker signals -5dB 0dB 5dB 10dB 15dB LEm 9 6 3 0 Stationary Triangular Babble Factory-1 WGN WGN (a) Fig.(a): obtains lower estimation error LE m than the MS for all noise types and SNRs LEv 4 Fig.(b): yields lower variance LE v than the MS for all tested setups 0 MS MS (b) MS MS A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 13 / 13
Minimum Statistics (MS) instead of PESQ Gain = PESQ PESQ NoiseEst with NoiseEst [ ; MS ] for female speaker signals PESQGain 0.1-5dB 0dB 5dB 10dB 15dB Stationary Triangular Babble Factory-1 WGN WGN 0.0 MS MS MS MS A favourable effect of estimator on speech quality for non-stationary noise types is for MS smaller than for A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach 13 / 13