Integrating neural network based beamforming and weighted prediction error dereverberation

Size: px

Start display at page:

Download "Integrating neural network based beamforming and weighted prediction error dereverberation"

Elinor Maxwell
5 years ago
Views:

1 Interspeech September 2018, Hyderabad Integrating neural network based beamorming and weighted prediction error dereverberation Lukas Drude 1, Christoph Boeddeker 1, Jahn Heymann 1, Reinhold Haeb-Umbach 1, Keisuke Kinoshita 2, Marc Delcroix 2, Tomohiro Nakatani 2 1 Paderborn University, Department o Communications Engineering, Paderborn, Germany 2 NTT Communication Science Laboratories, NTT Corporation, Kyoto, Japan {drude, boeddeker, heymann, haeb}@nt.upb.de {kinoshita.k, marc.delcroix, nakatani.tomohiro}@lab.ntt.co.jp Abstract The weighted prediction error () algorithm has proven to be a very successul dereverberation method or the REVERB challenge. Likewise, neural network based mask estimation or beamorming demonstrated very good noise suppression in the CHiME 3 and CHiME 4 challenges. Recently, it has been shown that this estimator can also be trained to perorm dereverberation and denoising jointly. However, up to now a comparison o a neural beamormer and is still missing, so is an investigation into a combination o the two. Thereore, we here provide an extensive evaluation o both and consequently propose variants to integrate deep neural network based beamorming with. For these integrated variants we identiy a consistent word error rate (WER) reduction on two distinct databases. In particular, our study shows that deep learning based beamorming beneits rom a model-based dereverberation technique (i.e. ) and vice versa. Our key indings are: (a) Neural beamorming yields the lower WERs in comparison to the more channels and noise are present. (b) Integration o and a neural beamormer consistently outperorms all stand-alone systems. Index Terms: speech recognition, speech enhancement, denoising, dereverberation 1. Introduction Automatic speech recognition () has ound wide-spread acceptance and works astonishingly well in close-talking scenarios. Although a lot o research is devoted to ar-ield technologies and products are already commercially available, in reverberant and noisy environments remains challenging. These impairments can be addressed with a dereverberating and/or denoising ront-end. is a compelling algorithm to blindly dereverberate acoustic signals based on long-term linear prediction. Proposed as early as 2008 it gained constant attention in subsequent years [1, 2]. Possibly most prominent is its use in the Google Home speech assistant hardware in online conditions [3, 4]. Signal distortions due to noise are addressed quite dierently. Besides denoising auto-encoders [5, 6] and dictionary based approaches [7] beamorming is very successul [8, 9]. As o recently, deep neural networks (s) are employed to more robustly estimate masks which are then used to calculate speech and noise covariance matrices or beamorming. Most notably, all top perorming systems o the CHiME 4 challenge employed some orm o neural beamorming [10, 11, 12, 13]. Interestingly, when the mask estimator is trained to distinguish between early arriving speech vs. late arriving speech and noise, the beamormer dereverberates and denoises the observation [14]. However, there is limited research on integrating dereverberation and beamorming. Delcroix et al. compare consecutive execution o either minimum variance distortionless response (MVDR) beamorming ollowed by or vice versa [15] on REVERB challenge data [16] but do not consider any urther integration. They estimate noise covariance matrices on the ial and inal 10 rames o each utterance but do not use any kind o or mask estimation. Kinoshita et al. evaluate, how proits rom side inormation provided by a [17]. They conclude, that the does not improve overall dereverberation perormance over but allows to operate on much shorter block sizes without perormance degradation. Cohen et al. use and MVDR beamorming in an interesting twostep system: dereverberates the signal irst and provides a distortion covariance matrix based on the estimate o the reverberation tail. Then, an MVDR beamormer uses the covariance matrix o the reverberation tail instead o a noise covariance matrix [18]. Ito et al. elegantly integrate and model based blind source separation but again omit any discriminatively trained model or mask estimation [19]. To the best o our knowledge an integration o and neural network based beamorming is still missing. Thereore we propose to combine neural network based generalized eigenvalue () beamorming and dereverberation and assess consecutive and integrated combinations. To provide an in-depth analysis we evaluate all systems on the REVERB challenge data [16], which generally avors dereverberation algorithms, and on a database composed o Wall Street Journal (WSJ) utterances [20] with VoiceHome room impulse responses (RIRs) and noise [21, 22], which tends to avor beamorming approaches. 2. Scenario and signal model Let an observed signal vector y in the short time Fourier transormation (STFT) domain cover D microphone channels, where t and are the time rame index and requency bin index, respectively. The observation may be impaired by (convolutive) reverberation and additive noise n. We here assume, that the early part o the RIR is beneicial whereas the reverberation tail hinders understanding. Thereore, we consider the irst 50 ms ater the main peak o the RIR as h early and the remaining part as h (tail). Consequently, the mixing process in the STFT domain is modeled as ollows: y = x early + x tail + n, (1) where x early and x tail are the STFTs o the source signal convolved with the early RIR and with the late relections, respectively. In essence, we explicitly allow RIRs longer than the length o a DFT window /Interspeech

2 3. Baseline: The underlying idea o is to estimate the reverberation tail o the signal and subtract it rom the observation to obtain a maximum likelihood estimate o early arriving speech. Let us start by deining how to obtain a single channel estimate with given ilter weights g τ,,d,d : +K 1,d = y,d g τ,,d,d y t τ,,d τ= d ˆx early ˆx early = y G H ỹ t, (2) where > 0 is a minimum delay to avoid removing correlations caused by the speech source, K is the number o taps used or estimation and d is the sensor index. To simpliy notation G C DK D and ỹ t, C DK 1 are stacked representations o the ilter weights and the observations. maximizes the likelihood o the model under the assumption that each direct signal is a realization o a zero-mean complex (proper) Gaussian with an unknown time-varying variance λ : p(x early,d ; λ ) = CN (x early,d ; 0, λ ). (3) The maximum likelihood optimization does not lead to a closed orm solution. However, an iterative procedure alternates between estimating the ilter coeicients G and the time-varying variance λ : Step 1) λ = Step 2) R = t P = t 1 (δ δ) D t+δ τ=t δ d ˆx early τ,,d 2, (4) ỹ t, ỹ H t, λ C DK DK, (5) ỹ t, y H λ C DK D, (6) G = R 1 P C DK D. (7) Here, we use a context o (δ δ) rames to improve the variance estimate as proposed by Nakatani et al. [23]. 4. Baseline: Neural beamorming Neural beamorming combines a -based mask estimator with an analytic ormulation to obtain a beamorming vector or speech enhancement [14]. Firstly, a mask estimation network is trained on single channel magnitude spectra to guarantee independence rom the microphone array coniguration. The training minimizes a cross entropy loss with an oracle speech and distortion mask. For denoising, the mask estimator is trained to dierentiate between speech and noise. As demonstrated in an earlier study [14], the neural beamormer given the right training targets is also able to perorm denoising and dereverberation simultaneously to some degree. Thereore, we here opt to train it to distinguish between the early arriving speech on the one hand and the late arriving speech and noise on the other hand: M (s/n),d = 1, 0, else, x early,d 2 x tail,d +n,d 2 s th (s/n), n where th (s/n) are threshold values to improve robustness 1. The mask estimator consists o a bidirectional long short-term memory (BLSTM) layer with 512 orward and 512 backward units, 1 For details regarding the thresholds see [14] Sec (8) two linear layers with 1024 units and ELU activation unctions [24] and a inal linear layer with units and a sigmoid activation unction [14]. Thus, the inal layer yields the two masks or each channel. The masks or each channel are ˆM (s) (n) pooled with a median operation to obtain and ˆM. The beamormer has proven to be robust with respect to numerical instabilities and yields great improvements in terms o both signal to noise ratio (SNR) gain and WER reduction, while oten outperorming the requently used MVDR beamormer [8, 14]. It optimizes the expected output SNR: w = argmax w w H Φ (s) w w H Φ (n) w, (9) where the spatial covariance matrices are obtained by a weighted mean o dyadic products: Φ (s/n) = / y y H. (10) t The enhanced signal is then given by ˆx early = w H y. It is naturally constrained to linear eects and leaves all urther nonlinear enhancements to the acoustic model (AM). To reduce distortions caused by arbitrary scaling o each beamorming vector, we opt to apply blind analytic normalization (BAN) [8]. 5. Proposed systems We propose and analyze three speech enhancement ront-ends which can then be used with any single-channel back-end. The irst two combine with neural beamorming but avoid any interaction between the two algorithms. Both can be seen as the logical extension o [15] with a neural network mask estimator or the beamorming step. The third is a real integration o both algorithms with a eedback loop Neural beamorming ollowed by One option is to perorm neural beamorming irst as depicted in Fig. 1. A estimates masks which are then used to obtain speech and distortion covariance matrices with Eq. (10). ilter coeicients are then obtained with Eq. (9) and applied to the observation to obtain an intermediate enhanced singlechannel signal ˆx early = w H y. Subsequently three iterations o single channel according to Eqs. (4) (7) (omitting the summation over d) are perormed to obtain an estimate ˆx early Using single channel is signiicantly aster, but crosschannel inormation can not be exploited or urther dereverberation. In this constellation a context o δ > 0 is particularly important or a robust power estimate. y t Figure 1: Neural beamorming ollowed by. Power in each time-requency bin is estimated in the block. can just operate on a single channel but is exposed to less noise

3 5.2. ollowed by neural beamorming When applying irst as in Fig. 2, the power estimation is ialized with the observed signal plugged into Eq. (4). Consecutively three iterations are perormed. The algorithm has to estimate more parameters on each utterance but can potentially use cross-channel inormation or dereverberation to obtain. Then, the end result is obtained by applying the beamorming vector w to the dereverberated signal, although the mask estimator trained to select early arriving speech just saw the observation y. It is possible to perorm mask estimation on which yields urther improvements. y λ ˆM (s/n) Figure 2: ollowed by neural beamorming. The beamormer receives a dereverberated version o the observation. Power in each time-requency bin is estimated in the block Integration o neural beamorming and Alternatively, the beamorming step can be integrated into the loop as shown in Fig. 3. This way, the power estimates λ or are computed rom the beamorming result and are already more precise. Assuming a number o three iterations, the mask estimator has to run only once whereas beamorming is perormed our times. Since the beamorming is much aster than the neural network additional beamorming steps are cheap. It is possible to run the mask estimator on in each iteration which improves the WERs but is computationally more expensive. y λ ˆx early Figure 3: Proposed way to integrate neural beamorming and dereverberation. Masks or beamorming are provided by a. Power in each time-requency bin is estimated in the block. Ater the irst iteration, beamorming is applied to the dereverberated estimate instead o the input y. 6. Acoustic model To train an acoustic model or each database, we irst extracted alignments by processing the early arriving speech x early using a triphone GMM-HMM recognizer. We then extracted MFCC eatures rom the reverberated and noisy observations with deltas and delta-deltas to train a 7+1 layer acoustic model with a context o rames. The training recipe adopts the mechanics o the CHiME 3 baseline training recipe master/egs/chime3/s5/local/run_dnn.sh 7. Evaluation To assess the advantages and disadvantages o each approach we evaluate the proposed combinations and the stand-alone systems in terms o WERs on two distinct databases: (a) The REVERB challenge database is very reverberated with very little noise. However, the reverberation is realistic, since the database contains real recordings in reverberant rooms. (b) The WSJ+VoiceHome database is airly noisy. The room impulse responses are recorded and convolved with the utterances granting more control over the simulation setup. All presented WERs are obtained using the standard trigram language model available with the WSJ corpus. In all presented results language model weight {4,..., 15}, number o ilter taps K {1,..., 20}, delay {1, 2, 3} and context δ {0, 1} or power estimation were optimized on the development set o each database. The DFT window size was set to 1024 (64 ms) while the shit was set to 256 (16 ms) or all reported results Results on REVERB challenge data The REVERB challenge dataset [16] contains simulated and real utterances. The training data only consists o simulated recordings while we used only the real development and test recordings or cross-validation and test, respectively. For simulated data WSJCAM0 utterances [25] are convolved with measured RIRs. Noise is added with approximately 20 db SNR. Reverberation times (T60) are in the range o ms. The real dataset consists o utterances rom the MC-WSJ-AV corpus [26] which are recorded in a noisy reverberant room with a reverberation time o approximately 700 ms. First, we analyzed the eect o dierent numbers o ilter taps K on the WER. Previous studies ound, that longer utterances allow higher numbers o ilter taps since the parameters can be more precisely estimated. For the given test dataset with an average utterance duration o 6.5 s around 7 taps (160 ms ield o view) turned out to be optimal as visualized in Fig. 4. The integration as well as show a similar trend or higher K. However, the steep WER increase or K < 4 is gone. Already a very small number o taps yields an improvement over the dereverberating beamormer itsel. One possible interpretation is, that reduction o very late reverber- WER in % Unprocessed Integration Number o ilter taps K Figure 4: WER depending on number o taps K or dierent systems on the REVERB real evaluation set with 8 channels. Integrated methods tend to work better or lower number o taps. 3045

4 Table 1: WERs in % or all systems evaluated on the REVERB real evaluation dataset with dierent number o channels. Many systems coincide or the single channel track. Table 2: WERs in % or all systems evaluated on the WSJ+VoiceHome dataset with dierent number o channels. Many systems coincide or the single channel track. Front-end Number o channels Front-end Number o channels Unprocessed Integration Unprocessed Integration ation is already done by the dereverberating beamormer and just needs to cover shorter time dierences. The combination is airly constant in the range o 0 to 20 taps since this combination just needs to estimate K K instead o D K D K coeicients. Tbl. 1 summarizes WERs or all systems. The WER or unprocessed signals coincides with beamorming when just one channel is available since a BAN ilter is used. In combinations, the WER reduction is achieved due to. For up to two channels outperorms a dereverberating beamormer. The best perormance with 8 channels is achieved when using with a relative WER reduction o 51 % over the unprocessed system and 25 % over alone. Nevertheless, causes a relative WER reduction o 13 % over a dereverberating and is thereore crucial when thriving or best perormance Results on WSJ with VoiceHome RIRs and noise Similar to the simulation setup proposed by Bertin et al. [22] WSJ utterances (test eval92 5k) are convolved with VoiceHome RIRs and VoiceHome background noise [21] with reverberation times (T60) in the range o ms. Worth noting, the RIRs are recorded in three dierent houses, such that WER in % Unprocessed Integration SNR in db Figure 5: Comparison o WERs or dierent SNR conditions and with 8 channels. The WSJ+VoiceHome test dataset was recreated or each SNR condition. All parameters or each datapoint are obtained on the development dataset. training, cross-validation and test can use disjunct RIRs to ensure generalization. The VoiceHome background noise is very dynamic and contains i.e. vacuum cleaner, dish washing or interviews on television typically ound in households. On the WSJ+VoiceHome dataset best WERs were obtained without BAN. However, to ease comparison with the REVERB results, we opted to present results with BAN. Since [15] states that shows some noise robustness although the original derivation does not explicitly model additive noise, we irst analyze the eect o dierent SNR conditions. To do so, we sweep the SNR o every utterance o the test data in Fig. 5. It turns out that is able to improve WERs in all noise conditions. However, beamorming alone is the signiicant driver on this noisy database. Particularly in noisy conditions the integration o neural network based beamorming into the processing loop yields best perormance. Tbl. 2 summarizes WERs or each system with a ixed random SNR in the range o 0 to 10 db. The result does not coincide with the unprocessed signal, since our implementation may introduce phase changes when perorming an eigenvalue decomposition on a scalar. The integrated system works best or any number o D > 1 channels. 8. Future directions We are interested in analyzing how a single neural network can provide a power estimate or similar to [17] as well as a mask estimate or neural beamorming thus allowing to get rid o the iteration in a noise robust integrated system. 9. Conclusions Across both databases and a wide range o parameter combinations, a combination o and neural beamorming consistently improves WERs over (a) a dereverberating neural beamormer and (b) dereverberation. Neural beamorming is particularly important when many channels are available and the observations are very noisy. However, is crucial to obtain best WERs when combined with neural beamorming. 10. Acknowledgements Calculations leading to the results presented here were perormed on resources provided by the Paderborn Center or Parallel Computing. 3046

5 11. Reerences [1] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, Blind speech dereverberation with multi-channel linear prediction based on short time ourier transorm representation, in IEEE International Conerence on Acoustics, Speech and Signal Processing (ICASSP), [2] T. Yoshioka and T. Nakatani, Generalization o multi-channel linear prediction methods or blind MIMO impulse response shortening, IEEE Transactions on Audio, Speech, and Language Processing, [3] B. Li, T. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Sharan, H. Sak, G. Pundak, K. Chin et al., Acoustic modeling or Google home, INTERSECH, [4] J. Caroselli, I. Sharan, A. Narayanan, and R. Rose, Adaptive multichannel dereverberation or automatic speech recognition, in Interspeech, [5] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, Extracting and composing robust eatures with denoising autoencoders, in Proceedings o the 25th international conerence on Machine learning. ACM, [6] A. Maas, Q. Le, T. O Neil, O. Vinyals, P. Nguyen, and A. Ng, Recurrent neural networks or noise reduction in robust, in Thirteenth Annual Conerence o the International Speech Communication Association, [7] D. Lee and H. Seung, Learning the parts o objects by nonnegative matrix actorization, Nature, [8] E. Warsitz and R. Haeb-Umbach, Blind acoustic beamorming based on generalized eigenvalue decomposition, IEEE Transactions on Audio, Speech, and Language Processing, [9] M. Souden, J. Benesty, and S. Aes, A study o the LCMV and MVDR noise reduction ilters, IEEE Transactions on Signal Processing, [10] J. Du, Y.-H. Tu, L. Sun, F. Ma, H.-K. Wang, J. Pan, C. Liu, J.-D. Chen, and C.-H. Lee, The USTC-iFlytek system or CHiME-4 challenge, CHiME-4 workshop, [11] T. Menne, J. Heymann, A. Alexandridis, K. Irie, A. Zeyer, M. Kitza, P. Golik, I. Kulikov, L. Drude, R. Schlüter et al., The RWTH/UPB/FORTH system combination or the 4th CHiME challenge evaluation, in CHiME-4 workshop, [12] H. Erdogan, T. Hayashi, J. Hershey, T. Hori, C. Hori, W.-N. Hsu, S. Kim, J. Le Roux, Z. Meng, and S. Watanabe, Multi-channel speech recognition: LSTMs all the way through, in CHiME-4 workshop, [13] J. Heymann, L. Drude, and R. Haeb-Umbach, Wide residual BLSTM network with discriminative speaker adaptation or robust speech recognition, in CHiME-4 workshop, [14], A generic neural acoustic beamorming architecture or robust multi-channel speech processing, Computer Speech & Language, [15] M. Delcroix, T. Yoshioka, A. Ogawa, Y. Kubo, M. Fujimoto, N. Ito, K. Kinoshita, M. Espi, S. Araki, T. Hori et al., Strategies or distant speech recognition in reverberant environments, EURASIP Journal on Advances in Signal Processing, [16] K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, A. Sehr, W. Kellermann, and R. Maas, The REVERB challenge: A common evaluation ramework or dereverberation and recognition o reverberant speech, in IEEE Workshop on Applications o Signal Processing to Audio and Acoustics (WASPAA), [17] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, and T. Nakatani, Neural network-based spectrum estimation or online dereverberation, Interspeech, [18] A. Cohen, G. Stemmer, S. Ingalsuo, and S. Markovich-Golan, Combined weighted prediction error and minimum variance distortionless response or dereverberation, in IEEE International Conerence on Acoustics, Speech and Signal Processing (ICASSP), [19] N. Ito, S. Araki, T. Yoshioka, and T. Nakatani, Relaxed disjointness based clustering or joint blind source separation and dereverberation, in International Workshop on Acoustic Signal Enhancement (IWAENC), [20] J. Garoolo, D. Gra, D. Paul, and D. Pallett, CSR-I (WSJ0) complete LDC93S6A, Web Download. Philadelphia: Linguistic Data Consortium, [Online]. Available: https: //catalog.ldc.upenn.edu/ldc93s6a [21] N. Bertin, E. Camberlein, E. Vincent, R. Lebarbenchon, S. Peillon, E. Lamand, S. Sivasankaran, F. Bimbot, I. Illina, A. Tom, S. Fleury, and E. Jamet, A French corpus or distant-microphone speech processing in real homes, in Interspeech, [22] S. Sivasankaran, E. Vincent, and I. Illina, A combined evaluation o established and new approaches or speech recognition in varied reverberation conditions, Computer Speech & Language, [23] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction, IEEE Transactions on Audio, Speech, and Language Processing, [24] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (ELUs), arxiv preprint arxiv: , [25] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, WSJ- CAM0: a British English speech corpus or large vocabulary continuous speech recognition, in International Conerence on Acoustics, Speech, and Signal Processing (ICASSP), [26] M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, The multichannel Wall Street Journal audio visual corpus (MC-WSJ-AV): Speciication and ial experiments, in IEEE Workshop on Automatic Speech Recognition and Understanding,

Deep clustering-based beamforming for separation with unknown number of sources

Deep clustering-based beamforming for separation with unknown number of sources INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Deep clustering-based beamorming or separation with unknown number o sources Takuya Higuchi, Keisuke Kinoshita, Marc Delcroix, Kateřina Žmolíková,