Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments Andreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann Lehrstuhl für Multimediakommunikation und Signalverarbeitung Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany ICASSP 2015
Deep Neural Networks for Acoustic Modeling Trend: explicit feature processing implicit learning! MFCCs simple filterbank features [Mohamed et al. 2013]! Filterbanks raw time-domain signals [Jaitly, Hinton 2011]! Denoising noise-aware training [Seltzer et al. 2013] What about spatial information (microphone arrays)?! Stacked feature vectors from multiple channels [Swietojanski et al. 2013]! Phase information is not exploited! Raw multi-channel waveforms [Hoshen et al. 2015]! Hard to generalize for arbitrary acoustic scenarios! Spatial diffuseness features! Represent spatial information independently of source position and microphone array mh acoustics Eigenmike 2
Outline Signal Model Coherence-based Dereverberation in the STFT Domain Extraction of Spatial Diffuseness Features 3
Signal Model! Desired signal is fully coherent (only delayed between microphones)! Noise and reverberation is diffuse and uncorrelated to the desired signal! Coherence of the mixed sound field can be modeled as: Coherent-to-diffuse ratio (CDR) can be estimated from the complex spatial coherence of the mixture 4
Coherence-based STFT-Domain Dereverberation 1. Estimate short-time spatial coherence (quasi-instantaneous) 2. Estimate coherent-to-diffuse ratio (CDR) 3. Perform spectral subtraction to suppress diffuse components [Schwarz/Kellermann, Coherent-to-Diffuse Power Ratio Estimation for Dereverberation, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015] Only instantaneous signal properties are exploited No knowledge or estimation of source DOA required 5
Evaluation Word Error Rate for REVERB challenge evaluation set WER [%] 100 90 80 70 60 50 Clean speech-trained DNN 85.7 44.4 SimData RealData 69.3 WER [%] 100 90 80 70 60 50 Multi-condition-trained DNN SimData RealData 40 30 30.3 40 30 28.8 28.8 20 10 20 10 9.5 9.4 0 logmelspec enh. logmelspec 0 logmelspec enh. logmelspec Improvement for clean-trained DNN " Disappears with multi-condition training # Multi-condition training neutralizes the effect of dereverberation 6
Spatial Feature Extraction Instead of STFT-domain enhancement, extract spatial features! meldiffuseness:! 0 for purely directional sound, 1 for purely diffuse sound! computed from coherent-to-diffuse ratio: D(k,f)=1/(CDR(k,f)+1) 7! Naive approach: magnitude squared coherence (melmsc)! Depends not only on diffuse noise content, but also on microphone spacing, DOA
Visualization of Features logmelspec: enhanced logmelspec: meldiffuseness: 8
Evaluation Setup REVERB challenge two microphone task [Kinoshita et al. 2013]! noisy and reverberant signals created from WSJCAM0 corpus! varying direction of arrival! 2 microphones, 8cm spacing DNN-based Speech Recognizer! Kaldi toolkit! hybrid DNN-HMM acoustic model! maxout network (4 hidden layers, 2000 inputs, 400 outputs per layer)! 5#frame#splicing! training on#multi!condition noisy and reverberant data (17.5#hours) Feature vectors! noisy logmelspec features:! enhanced logmelspec features:! augmented with melmsc:! augmented with meldiffuseness: overall dimension: 72 logmelspec Δ ΔΔ enh. logmel Δ ΔΔ logmelspec Δ melmsc logmelspec Δ meldiffuseness 9
Evaluation Results SimData: measured impulse responses, additive noise RealData: real recordings in noisy environment 40 35 30 28.8 28.8 SimData RealData 27.7 27.0 WER [%] 25 20 15 10 9.5 9.4 9.0 8.5 5 0 logmelspec enh. logmelspec logmelspec +melmsc logmelspec +meldiffuseness 6% to 11% relative WER reduction by using spatial features 10
Summary Motivation! STFT-domain dereverberation has little effect on WER! Idea: exploit spatial information in the DNN Spatial Diffuseness Features! Can be extracted instantaneously! Blind, no knowledge or estimation of the source DOA required! Device-independent features! 6% to 11% relative WER reduction for REVERB challenge 2-channel task! MATLAB code available (see paper) Can we use a similar approach to deal with directional interferers? Thank you for your attention! 11
Results (Details) Recognizer Feature Evaluation Set SimData RealData Room 1 Room 2 Room 3 Room 1 Avg near far near far near far near far Development Set SimData RealData GMM-HMM MFCC-LDA-MLLT-fMLLR 6.6 7.5 9.4 16.6 11.1 20.7 12.0 31.2 30.2 30.7 12.1 31.6 logmelspec+ + 5.7 6.7 7.7 13.9 8.7 14.6 9.5 28.5 29.1 28.8 9.7 24.9 DNN-HMM enhanced logmelspec+ + 6.6 7.1 7.7 12.2 8.3 14.6 9.4 28.5 29.1 28.8 9.1 25.3 logmelspec+ +melmsc 6.2 6.3 7.0 12.3 8.2 13.9 9.0 27.3 28.0 27.7 8.7 24.7 logmelspec+ +meldiffuseness 5.9 6.1 6.9 11.0 8.2 12.9 8.5 27.8 26.3 27.0 7.9 24.2 Avg Avg Avg 12