Efficient Target Activity Detection Based on Recurrent Neural Networks

Size: px

Start display at page:

Download "Efficient Target Activity Detection Based on Recurrent Neural Networks"

Christopher Andrews
5 years ago
Views:

1 Efficient Target Activity Detection Based on Recurrent Neural Networks D. Gerber, S. Meier, and W. Kellermann Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU)

2 Motivation Target φ tar oise 1 / 15

3 Motivation Target Interferer Interferer φ tar oise 1 / 15

4 Motivation Target Interferer Interferer φ tar Noise 1 / 15

5 Motivation Interferer Target φ tar Interferer Noise Goal: Detect time frames m with dominant target source 1 / 15

6 Motivation Interferer Target φ tar Interferer Noise Goal: Detect time frames m with dominant target source Target Activity Detection (TAD) 1 / 15

7 Motivation Proposed method: Artificial neural networks (ANNs) Meier and Kellermann (2016) Feature vector f Hidden layers Decision 2 / 15

8 Motivation Proposed method: Artificial neural networks (ANNs) Meier and Kellermann (2016) Feature vector f Hidden layers Decision Questions: How to define the feature vector? What network topology? How to incorporate memory? 2 / 15

9 Outline Motivation Features for TAD ANN-based feature combination Experiments 3 / 15

10 Features for TAD (1) Feature 1: Beamforming-based SNR estimate Target source components equalized accounting for measured HRTFs 4 / 15

11 Features for TAD (1) Feature 1: Beamforming-based SNR estimate Beamformer towards target ˆσ 2 s Target source components equalized accounting for measured HRTFs Add up equalized mic signals Beamformer 4 / 15

12 Features for TAD (1) Feature 1: Beamforming-based SNR estimate Beamformer towards target ˆσ 2 s Nullsteering beamformer towards target ˆσ 2 n Target source components equalized accounting for measured HRTFs Add up equalized mic signals Beamformer Subtract equalized (frontal) mic signals Nullformer 4 / 15

13 Features for TAD (1) Feature 1: Beamforming-based SNR estimate Beamformer towards target ˆσ 2 s Nullsteering beamformer towards target ˆσ 2 n Target source components equalized accounting for measured HRTFs Add up equalized mic signals Beamformer Subtract equalized (frontal) mic signals Nullformer SNR estimate as feature: f SNR (t) = ˆσ2 s (t) ˆσ 2 n(t) 4 / 15

14 Features for TAD (1) Feature 1: Beamforming-based SNR estimate Beamformer towards target ˆσ 2 s Nullsteering beamformer towards target ˆσ 2 n [ f t = f SNR (t) Target source components equalized accounting for measured HRTFs Add up equalized mic signals Beamformer Subtract equalized (frontal) mic signals Nullformer SNR estimate as feature: f SNR (t) = ˆσ2 s (t) ˆσ 2 n(t) ] T 4 / 15

15 Feature 2: Crosscorrelation ratio r 13 ( k) Features for TAD (2) K +K k (t) [ f t = f SNR (t) ] T 5 / 15

16 Feature 2: Crosscorrelation ratio Features for TAD (2) r 13 ( k) r 13 ( k T (t), t) K +K Target source creates peak at TDOA k T (t) k k T (t) [ f t = f SNR (t) ] T 5 / 15

17 Features for TAD (2) Feature 2: Crosscorrelation ratio r 13 ( k) r 13 ( k T (t), t) K +K k k T (t) Target source creates peak at TDOA k T (t) Ratio with strongest peak at k k T (t) as feature: f corr (t) = r 13 ( k T (t), t) max k k T (t) r 13( k, t) [ f t = f SNR (t) ] T 5 / 15

18 Feature 2: Crosscorrelation ratio Features for TAD (2) max k k T (t) r 13 ( k) r 13 ( k T (t), t) r 13 ( k, t) K +K k T (t) k Target source creates peak at TDOA k T (t) Ratio with strongest peak at k k T (t) as feature: f corr (t) = r 13 ( k T (t), t) max k k T (t) r 13( k, t) [ f t = f SNR (t) ] T 5 / 15

19 Feature 2: Crosscorrelation ratio Features for TAD (2) max k k T (t) r 13 ( k) r 13 ( k T (t), t) r 13 ( k, t) K +K k T (t) [ f t = f SNR (t) k Target source creates peak at TDOA k T (t) Ratio with strongest peak at k k T (t) as feature: f corr (t) = r 13 ( k T (t), t) max k k T (t) r 13( k, t) Interpretation: Power ratio between target and strongest interferer ] T 5 / 15

20 Feature 2: Crosscorrelation ratio Features for TAD (2) max k k T (t) r 13 ( k) r 13 ( k T (t), t) r 13 ( k, t) K +K k T (t) k [ f t = f SNR (t), f corr (t) Target source creates peak at TDOA k T (t) Ratio with strongest peak at k k T (t) as feature: f corr (t) = r 13 ( k T (t), t) max k k T (t) r 13( k, t) Interpretation: Power ratio between target and strongest interferer ] T 5 / 15

21 Features for TAD (3) Feature 3: Adaptive differential beamformer φ diff Adaptive differential beamformer [Elko and Pong (1995)] steers null towards dominant sources [ f t = f SNR (t), f corr (t) ] T 6 / 15

22 Features for TAD (3) Feature 3: Adaptive differential beamformer φ diff Adaptive differential beamformer [Elko and Pong (1995)] steers null towards dominant sources [ f t = f SNR (t), f corr (t) ] T 6 / 15

23 Features for TAD (3) Feature 3: Adaptive differential beamformer φ diff Adaptive differential beamformer [Elko and Pong (1995)] steers null towards dominant sources [ f t = f SNR (t), f corr (t) ] T 6 / 15

24 Features for TAD (3) Feature 3: Adaptive differential beamformer φ diff Adaptive differential beamformer [Elko and Pong (1995)] steers null towards dominant sources Direction φ diff as feature: f diff (t) = [cos (φ diff (t)), sin (φ diff (t))] T [ f t = f SNR (t), f corr (t), f diff (t) T ] T 6 / 15

25 Features for TAD (4) Feature 4: Microphone signal variances Detect overall powers and unilateral scenarios f σ 2(t) = [ σ 2 v 1 (t), σ 2 v 3 (t) ] T [ f t = f SNR (t), f corr (t), f diff (t) T, f σ 2(t) T ] T 7 / 15

26 Features for TAD (4) Feature 4: Microphone signal variances Detect overall powers and unilateral scenarios f σ 2(t) = [ σ 2 v 1 (t), σ 2 v 3 (t) ] T Feature 5: Target source DoA Complementing f diff (t) f φ (t) = [cos (φ tar (t)), sin (φ tar (t))] T [ ] T f t = f SNR (t), f corr (t), f diff (t) T, f σ 2(t) T, f φ (t) T 7 / 15

27 ANN-based feature combination Mapping of the feature vector f t to a decision y t Feedforward Neural Network (FNN) [Meier and Kellermann (2016)] f t y t 8 / 15

28 ANN-based feature combination Mapping of the feature vector f t to a decision y t Feedforward Neural Network (FNN) [Meier and Kellermann (2016)] Subsequent decisions dependent How to incorporate memory? f t y t 8 / 15

29 ANN-based feature combination Mapping of the feature vector f t to a decision y t Feedforward Neural Network (FNN) [Meier and Kellermann (2016)] Subsequent decisions dependent How to incorporate memory? Sequential FNNs f t f t 1 f t 2 y t 8 / 15

30 ANN-based feature combination Mapping of the feature vector f t to a decision y t Feedforward Neural Network (FNN) [Meier and Kellermann (2016)] Subsequent decisions dependent How to incorporate memory? Sequential FNNs Recurrent Neural Networks (RNNs) f t y t 8 / 15

31 ANN-based feature combination Mapping of the feature vector f t to a decision y t Feedforward Neural Network (FNN) [Meier and Kellermann (2016)] Subsequent decisions dependent How to incorporate memory? Sequential FNNs Recurrent Neural Networks (RNNs) Long Short-Term Memory (LSTM) networks ( longer memory) f t y t 8 / 15

32 ANN-based feature combination Mapping of the feature vector f t to a decision y t Feedforward Neural Network (FNN) [Meier and Kellermann (2016)] Subsequent decisions dependent How to incorporate memory? Sequential FNNs Recurrent Neural Networks (RNNs) Long Short-Term Memory (LSTM) networks ( longer memory) Gated Recurrent Unit (GRU) networks ( less complex) f t y t 8 / 15

33 Experiments Table: Investigated network types. Feed-forward FNN α=0 Non-smoothed features FNN α=0.7 Recursively smoothed features (α = 0.7) FNN seq Sequential features Recurrent RNN Vanilla RNNs LSTM Long Short-Term Memory GRU Gated Recurrent Unit 9 / 15

34 Experiments Data set consisting of 38 recordings: 1 target speaker 1-4 interferers (same level as target speaker) Babble noise (SNR: 10 db) Various source positions, living room-like environment (T ms) Recordings with hearing aids on a KEMAR head Training set: 29 scenarios (20s each), test set: 9 scenarios (10s each) Sampling rate: 16kHz 10 / 15

35 Experiments Data set consisting of 38 recordings: 1 target speaker 1-4 interferers (same level as target speaker) Babble noise (SNR: 10 db) Various source positions, living room-like environment (T ms) Recordings with hearing aids on a KEMAR head Training set: 29 scenarios (20s each), test set: 9 scenarios (10s each) Sampling rate: 16kHz Ground truth: oracle SINR > 10dB 10 / 15

36 Experiments Data set consisting of 38 recordings: 1 target speaker 1-4 interferers (same level as target speaker) Babble noise (SNR: 10 db) Various source positions, living room-like environment (T ms) Recordings with hearing aids on a KEMAR head Training set: 29 scenarios (20s each), test set: 9 scenarios (10s each) Sampling rate: 16kHz Ground truth: oracle SINR > 10dB Implementation in Python (Theano/Lasagne) Regularization: FNNs dropout layers, RNNs synaptic noise Intel i7-920 CPU, Geforce GTX 970 GPU Number of layers L [1, 6] Number of nodes per layer N [1, 32] Best network topology chosen for each network type individually 10 / 15

37 Evaluation measures Matthews Correlation Coefficient: Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) 11 / 15

38 Evaluation measures Matthews Correlation Coefficient: True Positives Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) 11 / 15

39 Evaluation measures Matthews Correlation Coefficient: True Negatives Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) 11 / 15

40 Evaluation measures Matthews Correlation Coefficient: False Positives Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) 11 / 15

41 Evaluation measures Matthews Correlation Coefficient: False Negatives Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) 11 / 15

42 Evaluation measures Matthews Correlation Coefficient: Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) Perfect detection: MCC = 1, Random detection: MCC = 0, Total disagreement: MCC = 1, 11 / 15

43 Evaluation measures Matthews Correlation Coefficient: Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) Area Under Curve (AUC): Receiver operating curve (ROC) TP rate FP rate Perfect detection: MCC = 1, Random detection: MCC = 0, Total disagreement: MCC = 1, 11 / 15

44 Evaluation measures Matthews Correlation Coefficient: Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) Area Under Curve (AUC): Receiver operating curve (ROC) High TP rate at low FP rate AUC 1 TP rate AUC FP rate Perfect detection: MCC = 1, Random detection: MCC = 0, Total disagreement: MCC = 1, 11 / 15

45 Evaluation measures Matthews Correlation Coefficient: Tg MCC = TP TN FP FN (TP + FP)(TP + FN)(TN + FP)(TN + FN) Area Under Curve (AUC): Receiver operating curve (ROC) High TP rate at low FP rate AUC 1 TP rate AUC FP rate Perfect detection: MCC = 1, AUC = 1 Random detection: MCC = 0, AUC = 0.5 Total disagreement: MCC = 1, AUC = 0 11 / 15

46 Results Performance Complexity Network type ACC AUC MCC N L RTT FNN α=0 FNN α=0.7 FNN (seq) RNN LSTM GRU N # nodes per layer ACC TP+TN accuracy ( TP+TN+FP+FN ) L # layers RTT relative testing time 12 / 15

47 Results Performance Complexity Network type ACC AUC MCC N L RTT FNN α= FNN α= FNN (seq) RNN LSTM GRU N # nodes per layer ACC TP+TN accuracy ( TP+TN+FP+FN ) L # layers RTT relative testing time 12 / 15

48 Results Performance Complexity Network type ACC AUC MCC N L RTT FNN α= FNN α= FNN (seq) RNN LSTM GRU N # nodes per layer ACC TP+TN accuracy ( TP+TN+FP+FN ) L # layers RTT relative testing time 12 / 15

49 Results Performance Complexity Network type ACC AUC MCC N L RTT FNN α= FNN α= FNN (seq) RNN LSTM GRU N # nodes per layer ACC TP+TN accuracy ( TP+TN+FP+FN ) L # layers RTT relative testing time 12 / 15

50 Results Performance Complexity Network type ACC AUC MCC N L RTT FNN α= FNN α= FNN (seq) RNN LSTM GRU N # nodes per layer ACC TP+TN accuracy ( TP+TN+FP+FN ) L # layers RTT relative testing time 12 / 15

51 Summary ANN-based feature combination leads to good detection of target dominance intervals Exploiting memory is beneficial for TAD Vanilla RNNs outperform sequential FNNs even with smaller network depth LSTMs and GRUs do not lead to significant improvements no benefit from long-term memory 13 / 15

52 Thank you for your attention! 14 / 15

53 References Elko, G. W. and Pong, A.-T. N. (1995). A simple adaptive first-order differential microphone. In Proc. IEEE Workshop Applications Signal Process. Audio Acoustics (WASPAA), pages Meier, S. and Kellermann, W. (2016). Artificial neural network-based feature combination for spatial voice activity detection. In Proc. Annual Conf. Int. Speech Communication Assoc. (Interspeech), pages , San Francisco, USA. 15 / 15

54 Appendix: Complete results Performance Network type ACC AUC MCC FNN (nos) FNN (smo) FNN (seq) RNN LSTM GRU Complexity Network type N L P / P RRT RTT FNN (nos) / FNN (smo) / FNN (seq) / RNN / LSTM / GRU / B 1

Source localization and separation for binaural hearing aids

Source localization and separation for binaural hearing aids Mehdi Zohourian, Gerald Enzner, Rainer Martin Listen Workshop, July 218 Institute of Communication Acoustics Outline 1 Introduction 2 Binaural