Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Similar documents
arxiv: v1 [cs.lg] 4 Aug 2016

Deep Learning for Speech Recognition. Hung-yi Lee

Uncertainty training and decoding methods of deep neural networks based on stochastic representation of enhanced features

MASK WEIGHTED STFT RATIOS FOR RELATIVE TRANSFER FUNCTION ESTIMATION AND ITS APPLICATION TO ROBUST ASR

USING STATISTICAL ROOM ACOUSTICS FOR ANALYSING THE OUTPUT SNR OF THE MWF IN ACOUSTIC SENSOR NETWORKS. Toby Christian Lawin-Ore, Simon Doclo

NOISE ROBUST RELATIVE TRANSFER FUNCTION ESTIMATION. M. Schwab, P. Noll, and T. Sikora. Technical University Berlin, Germany Communication System Group

Very Deep Convolutional Neural Networks for LVCSR

Shinji Watanabe Mitsubishi Electric Research Laboratories (MERL) Xiong Xiao Nanyang Technological University. Marc Delcroix

Making Machines Understand Us in Reverberant Rooms [Robustness against reverberation for automatic speech recognition]

Adapting Wavenet for Speech Enhancement DARIO RETHAGE JULY 12, 2017

A unified convolutional beamformer for simultaneous denoising and dereverberation

Why DNN Works for Acoustic Modeling in Speech Recognition?

Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Global SNR Estimation of Speech Signals using Entropy and Uncertainty Estimates from Dropout Networks

MULTI-FRAME FACTORISATION FOR LONG-SPAN ACOUSTIC MODELLING. Liang Lu and Steve Renals

A new method for a nonlinear acoustic echo cancellation system

Efficient Target Activity Detection Based on Recurrent Neural Networks

Recent advances in distant speech recognition

Feature-space Speaker Adaptation for Probabilistic Linear Discriminant Analysis Acoustic Models

DNN-based uncertainty estimation for weighted DNN-HMM ASR

Acoustic MIMO Signal Processing

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

"Robust Automatic Speech Recognition through on-line Semi Blind Source Extraction"

The THU-SPMI CHiME-4 system : Lightweight design with advanced multi-channel processing, feature enhancement, and language modeling

An exploration of dropout with LSTMs

Multichannel Online Dereverberation based on Spectral Magnitude Inverse Filtering

ESTIMATION OF RELATIVE TRANSFER FUNCTION IN THE PRESENCE OF STATIONARY NOISE BASED ON SEGMENTAL POWER SPECTRAL DENSITY MATRIX SUBTRACTION

MULTI-LABEL VS. COMBINED SINGLE-LABEL SOUND EVENT DETECTION WITH DEEP NEURAL NETWORKS. Emre Cakir, Toni Heittola, Heikki Huttunen and Tuomas Virtanen

Detection of Overlapping Acoustic Events Based on NMF with Shared Basis Vectors

JOINT DEREVERBERATION AND NOISE REDUCTION BASED ON ACOUSTIC MULTICHANNEL EQUALIZATION. Ina Kodrasi, Simon Doclo

Modeling Time-Frequency Patterns with LSTM vs. Convolutional Architectures for LVCSR Tasks

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

SPEECH recognition systems based on hidden Markov

arxiv: v1 [cs.sd] 30 Oct 2015

TRINICON: A Versatile Framework for Multichannel Blind Signal Processing

BLACK BOX OPTIMIZATION FOR AUTOMATIC SPEECH RECOGNITION. Shinji Watanabe and Jonathan Le Roux

Environmental Sound Classification in Realistic Situations

BLIND SOURCE EXTRACTION FOR A COMBINED FIXED AND WIRELESS SENSOR NETWORK

Deep clustering-based beamforming for separation with unknown number of sources

COLLABORATIVE SPEECH DEREVERBERATION: REGULARIZED TENSOR FACTORIZATION FOR CROWDSOURCED MULTI-CHANNEL RECORDINGS. Sanna Wager, Minje Kim

Beyond Cross-entropy: Towards Better Frame-level Objective Functions For Deep Neural Network Training In Automatic Speech Recognition

Speaker recognition by means of Deep Belief Networks

Robust Sound Event Detection in Continuous Audio Environments

Independent Component Analysis and Unsupervised Learning

Noise Compensation for Subspace Gaussian Mixture Models

AN APPROACH TO PREVENT ADAPTIVE BEAMFORMERS FROM CANCELLING THE DESIRED SIGNAL. Tofigh Naghibi and Beat Pfister

Detection-Based Speech Recognition with Sparse Point Process Models

arxiv: v2 [cs.sd] 15 Nov 2017

MAXIMUM LIKELIHOOD BASED MULTI-CHANNEL ISOTROPIC REVERBERATION REDUCTION FOR HEARING AIDS

SPEECH signals captured using a distant microphone within

A comparative study of time-delay estimation techniques for convolutive speech mixtures

BIDIRECTIONAL LSTM-HMM HYBRID SYSTEM FOR POLYPHONIC SOUND EVENT DETECTION

Multiple Sound Source Counting and Localization Based on Spatial Principal Eigenvector

Automatic Speech Recognition (CS753)

Feature-Space Structural MAPLR with Regression Tree-based Multiple Transformation Matrices for DNN

arxiv: v1 [cs.cl] 23 Sep 2013

A Variance Modeling Framework Based on Variational Autoencoders for Speech Enhancement

REGULARIZING DNN ACOUSTIC MODELS WITH GAUSSIAN STOCHASTIC NEURONS. Hao Zhang, Yajie Miao, Florian Metze

ON ADVERSARIAL TRAINING AND LOSS FUNCTIONS FOR SPEECH ENHANCEMENT. Ashutosh Pandey 1 and Deliang Wang 1,2. {pandey.99, wang.5664,

Research Article Deep Neural Networks with Multistate Activation Functions

Musical noise reduction in time-frequency-binary-masking-based blind source separation systems

Deep NMF for Speech Separation

Session Variability Compensation in Automatic Speaker Recognition

ALTERNATIVE OBJECTIVE FUNCTIONS FOR DEEP CLUSTERING

BLSTM-HMM HYBRID SYSTEM COMBINED WITH SOUND ACTIVITY DETECTION NETWORK FOR POLYPHONIC SOUND EVENT DETECTION

SINGLE-CHANNEL BLIND ESTIMATION OF REVERBERATION PARAMETERS

ALGONQUIN - Learning dynamic noise models from noisy speech for robust speech recognition

Deep Neural Networks

SINGLE-CHANNEL SPEECH PRESENCE PROBABILITY ESTIMATION USING INTER-FRAME AND INTER-BAND CORRELATIONS

Robotic Sound Source Separation using Independent Vector Analysis Martin Rothbucher, Christian Denk, Martin Reverchon, Hao Shen and Klaus Diepold

Temporal Modeling and Basic Speech Recognition

SMALL-FOOTPRINT HIGH-PERFORMANCE DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION USING SPLIT-VQ. Yongqiang Wang, Jinyu Li and Yifan Gong

PHONEME CLASSIFICATION OVER THE RECONSTRUCTED PHASE SPACE USING PRINCIPAL COMPONENT ANALYSIS

LINEARLY AUGMENTED DEEP NEURAL NETWORK

FUNDAMENTAL LIMITATION OF FREQUENCY DOMAIN BLIND SOURCE SEPARATION FOR CONVOLVED MIXTURE OF SPEECH

Coupled initialization of multi-channel non-negative matrix factorization based on spatial and spectral information

Lecture 3: ASR: HMMs, Forward, Viterbi

SPEECH ENHANCEMENT USING PCA AND VARIANCE OF THE RECONSTRUCTION ERROR IN DISTRIBUTED SPEECH RECOGNITION

Investigate more robust features for Speech Recognition using Deep Learning

Single and Multi Channel Feature Enhancement for Distant Speech Recognition

RARE SOUND EVENT DETECTION USING 1D CONVOLUTIONAL RECURRENT NEURAL NETWORKS

Recurrent Poisson Process Unit for Speech Recognition

Presented By: Omer Shmueli and Sivan Niv

Deep Learning for Automatic Speech Recognition Part II

ROBUSTNESS OF PARAMETRIC SOURCE DEMIXING IN ECHOIC ENVIRONMENTS. Radu Balan, Justinian Rosca, Scott Rickard

Unfolded Recurrent Neural Networks for Speech Recognition

Time-Varying Autoregressions for Speaker Verification in Reverberant Conditions

Short-Time Fourier Transform and Chroma Features

Dept. Electronics and Electrical Engineering, Keio University, Japan. NTT Communication Science Laboratories, NTT Corporation, Japan.

Tensor-Train Long Short-Term Memory for Monaural Speech Enhancement

Loudspeaker Choice and Placement. D. G. Meyer School of Electrical & Computer Engineering

Source localization and separation for binaural hearing aids

Estimation of Cepstral Coefficients for Robust Speech Recognition

Integrating neural network based beamforming and weighted prediction error dereverberation

Noise Robust Isolated Words Recognition Problem Solving Based on Simultaneous Perturbation Stochastic Approximation Algorithm

Blind Machine Separation Te-Won Lee

Covariance Matrix Enhancement Approach to Train Robust Gaussian Mixture Models of Speech Data

Sparse Models for Speech Recognition

STFT Bin Selection for Localization Algorithms based on the Sparsity of Speech Signal Spectra

Exploring the Relationship between Conic Affinity of NMF Dictionaries and Speech Enhancement Metrics

Transcription:

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments Andreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann Lehrstuhl für Multimediakommunikation und Signalverarbeitung Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany ICASSP 2015

Deep Neural Networks for Acoustic Modeling Trend: explicit feature processing implicit learning! MFCCs simple filterbank features [Mohamed et al. 2013]! Filterbanks raw time-domain signals [Jaitly, Hinton 2011]! Denoising noise-aware training [Seltzer et al. 2013] What about spatial information (microphone arrays)?! Stacked feature vectors from multiple channels [Swietojanski et al. 2013]! Phase information is not exploited! Raw multi-channel waveforms [Hoshen et al. 2015]! Hard to generalize for arbitrary acoustic scenarios! Spatial diffuseness features! Represent spatial information independently of source position and microphone array mh acoustics Eigenmike 2

Outline Signal Model Coherence-based Dereverberation in the STFT Domain Extraction of Spatial Diffuseness Features 3

Signal Model! Desired signal is fully coherent (only delayed between microphones)! Noise and reverberation is diffuse and uncorrelated to the desired signal! Coherence of the mixed sound field can be modeled as: Coherent-to-diffuse ratio (CDR) can be estimated from the complex spatial coherence of the mixture 4

Coherence-based STFT-Domain Dereverberation 1. Estimate short-time spatial coherence (quasi-instantaneous) 2. Estimate coherent-to-diffuse ratio (CDR) 3. Perform spectral subtraction to suppress diffuse components [Schwarz/Kellermann, Coherent-to-Diffuse Power Ratio Estimation for Dereverberation, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015] Only instantaneous signal properties are exploited No knowledge or estimation of source DOA required 5

Evaluation Word Error Rate for REVERB challenge evaluation set WER [%] 100 90 80 70 60 50 Clean speech-trained DNN 85.7 44.4 SimData RealData 69.3 WER [%] 100 90 80 70 60 50 Multi-condition-trained DNN SimData RealData 40 30 30.3 40 30 28.8 28.8 20 10 20 10 9.5 9.4 0 logmelspec enh. logmelspec 0 logmelspec enh. logmelspec Improvement for clean-trained DNN " Disappears with multi-condition training # Multi-condition training neutralizes the effect of dereverberation 6

Spatial Feature Extraction Instead of STFT-domain enhancement, extract spatial features! meldiffuseness:! 0 for purely directional sound, 1 for purely diffuse sound! computed from coherent-to-diffuse ratio: D(k,f)=1/(CDR(k,f)+1) 7! Naive approach: magnitude squared coherence (melmsc)! Depends not only on diffuse noise content, but also on microphone spacing, DOA

Visualization of Features logmelspec: enhanced logmelspec: meldiffuseness: 8

Evaluation Setup REVERB challenge two microphone task [Kinoshita et al. 2013]! noisy and reverberant signals created from WSJCAM0 corpus! varying direction of arrival! 2 microphones, 8cm spacing DNN-based Speech Recognizer! Kaldi toolkit! hybrid DNN-HMM acoustic model! maxout network (4 hidden layers, 2000 inputs, 400 outputs per layer)! 5#frame#splicing! training on#multi!condition noisy and reverberant data (17.5#hours) Feature vectors! noisy logmelspec features:! enhanced logmelspec features:! augmented with melmsc:! augmented with meldiffuseness: overall dimension: 72 logmelspec Δ ΔΔ enh. logmel Δ ΔΔ logmelspec Δ melmsc logmelspec Δ meldiffuseness 9

Evaluation Results SimData: measured impulse responses, additive noise RealData: real recordings in noisy environment 40 35 30 28.8 28.8 SimData RealData 27.7 27.0 WER [%] 25 20 15 10 9.5 9.4 9.0 8.5 5 0 logmelspec enh. logmelspec logmelspec +melmsc logmelspec +meldiffuseness 6% to 11% relative WER reduction by using spatial features 10

Summary Motivation! STFT-domain dereverberation has little effect on WER! Idea: exploit spatial information in the DNN Spatial Diffuseness Features! Can be extracted instantaneously! Blind, no knowledge or estimation of the source DOA required! Device-independent features! 6% to 11% relative WER reduction for REVERB challenge 2-channel task! MATLAB code available (see paper) Can we use a similar approach to deal with directional interferers? Thank you for your attention! 11

Results (Details) Recognizer Feature Evaluation Set SimData RealData Room 1 Room 2 Room 3 Room 1 Avg near far near far near far near far Development Set SimData RealData GMM-HMM MFCC-LDA-MLLT-fMLLR 6.6 7.5 9.4 16.6 11.1 20.7 12.0 31.2 30.2 30.7 12.1 31.6 logmelspec+ + 5.7 6.7 7.7 13.9 8.7 14.6 9.5 28.5 29.1 28.8 9.7 24.9 DNN-HMM enhanced logmelspec+ + 6.6 7.1 7.7 12.2 8.3 14.6 9.4 28.5 29.1 28.8 9.1 25.3 logmelspec+ +melmsc 6.2 6.3 7.0 12.3 8.2 13.9 9.0 27.3 28.0 27.7 8.7 24.7 logmelspec+ +meldiffuseness 5.9 6.1 6.9 11.0 8.2 12.9 8.5 27.8 26.3 27.0 7.9 24.2 Avg Avg Avg 12