Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Size: px

Start display at page:

Download "Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments"

Merilyn Hardy
6 years ago
Views:

1 Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments Andreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann Lehrstuhl für Multimediakommunikation und Signalverarbeitung Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany ICASSP 2015

Deep Neural Networks for Acoustic Modeling Trend: explicit feature processing implicit learning! MFCCs simple filterbank features [Mohamed et al. 2013]!

2 Deep Neural Networks for Acoustic Modeling Trend: explicit feature processing implicit learning! MFCCs simple filterbank features [Mohamed et al. 2013]! Filterbanks raw time-domain signals [Jaitly, Hinton 2011]! Denoising noise-aware training [Seltzer et al. 2013] What about spatial information (microphone arrays)?! Stacked feature vectors from multiple channels [Swietojanski et al. 2013]! Phase information is not exploited! Raw multi-channel waveforms [Hoshen et al. 2015]! Hard to generalize for arbitrary acoustic scenarios! Spatial diffuseness features! Represent spatial information independently of source position and microphone array mh acoustics Eigenmike 2

3 Outline Signal Model Coherence-based Dereverberation in the STFT Domain Extraction of Spatial Diffuseness Features 3

4 Signal Model! Desired signal is fully coherent (only delayed between microphones)! Noise and reverberation is diffuse and uncorrelated to the desired signal! Coherence of the mixed sound field can be modeled as: Coherent-to-diffuse ratio (CDR) can be estimated from the complex spatial coherence of the mixture 4

5 Coherence-based STFT-Domain Dereverberation 1. Estimate short-time spatial coherence (quasi-instantaneous) 2. Estimate coherent-to-diffuse ratio (CDR) 3. Perform spectral subtraction to suppress diffuse components [Schwarz/Kellermann, Coherent-to-Diffuse Power Ratio Estimation for Dereverberation, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015] Only instantaneous signal properties are exploited No knowledge or estimation of source DOA required 5

6 Evaluation Word Error Rate for REVERB challenge evaluation set WER [%] Clean speech-trained DNN SimData RealData 69.3 WER [%] Multi-condition-trained DNN SimData RealData logmelspec enh. logmelspec 0 logmelspec enh. logmelspec Improvement for clean-trained DNN " Disappears with multi-condition training # Multi-condition training neutralizes the effect of dereverberation 6

Spatial Feature Extraction Instead of STFT-domain enhancement, extract spatial features! meldiffuseness:! 0 for purely directional sound, 1 for purely diffuse sound!

7 Spatial Feature Extraction Instead of STFT-domain enhancement, extract spatial features! meldiffuseness:! 0 for purely directional sound, 1 for purely diffuse sound! computed from coherent-to-diffuse ratio: D(k,f)=1/(CDR(k,f)+1) 7! Naive approach: magnitude squared coherence (melmsc)! Depends not only on diffuse noise content, but also on microphone spacing, DOA

8 Visualization of Features logmelspec: enhanced logmelspec: meldiffuseness: 8

maxout network (4 hidden layers, 2000 inputs, 400 outputs per layer)! 5#frame#splicing! training on#multi!condition noisy and reverberant data (17.

9 Evaluation Setup REVERB challenge two microphone task [Kinoshita et al. 2013]! noisy and reverberant signals created from WSJCAM0 corpus! varying direction of arrival! 2 microphones, 8cm spacing DNN-based Speech Recognizer! Kaldi toolkit! hybrid DNN-HMM acoustic model! maxout network (4 hidden layers, 2000 inputs, 400 outputs per layer)! 5#frame#splicing! training on#multi!condition noisy and reverberant data (17.5#hours) Feature vectors! noisy logmelspec features:! enhanced logmelspec features:! augmented with melmsc:! augmented with meldiffuseness: overall dimension: 72 logmelspec Δ ΔΔ enh. logmel Δ ΔΔ logmelspec Δ melmsc logmelspec Δ meldiffuseness 9

10 Evaluation Results SimData: measured impulse responses, additive noise RealData: real recordings in noisy environment SimData RealData WER [%] logmelspec enh. logmelspec logmelspec +melmsc logmelspec +meldiffuseness 6% to 11% relative WER reduction by using spatial features 10

11 Summary Motivation! STFT-domain dereverberation has little effect on WER! Idea: exploit spatial information in the DNN Spatial Diffuseness Features! Can be extracted instantaneously! Blind, no knowledge or estimation of the source DOA required! Device-independent features! 6% to 11% relative WER reduction for REVERB challenge 2-channel task! MATLAB code available (see paper) Can we use a similar approach to deal with directional interferers? Thank you for your attention! 11

12 Results (Details) Recognizer Feature Evaluation Set SimData RealData Room 1 Room 2 Room 3 Room 1 Avg near far near far near far near far Development Set SimData RealData GMM-HMM MFCC-LDA-MLLT-fMLLR logmelspec DNN-HMM enhanced logmelspec logmelspec+ +melmsc logmelspec+ +meldiffuseness Avg Avg Avg 12

arxiv: v1 [cs.lg] 4 Aug 2016

arxiv: v1 [cs.lg] 4 Aug 2016 An improved uncertainty decoding scheme with weighted samples for DNN-HMM hybrid systems Christian Huemmer 1, Ramón Fernández Astudillo 2, and Walter Kellermann 1 1 Multimedia Communications and Signal