Shinji Watanabe Mitsubishi Electric Research Laboratories (MERL) Xiong Xiao Nanyang Technological University. Marc Delcroix

Size: px

Start display at page:

Download "Shinji Watanabe Mitsubishi Electric Research Laboratories (MERL) Xiong Xiao Nanyang Technological University. Marc Delcroix"

Tracey Lester
5 years ago
Views:

1 Shinji Watanabe Mitsubishi Electric Research Laboratories (MERL) Xiong Xiao Nanyang Technological University Marc Delcroix NTT Communication Science Laboratories

2 2

4 List of abbreviations ASR Automatic Speech Recognition LSTM Long Short-Term Memory (network) AM Acoustic Model LVCSR Large Vocabulary Continuous Speech Recognition BF Beamformer MAP Maximum A Posterior BLSTM Bidirectional LSTM MBR Minimum Bayes Risk CMLLR Constrained MLLR (equivalent to fmllr) MCWF Multi-Channel Wiener Filter CNN Convolutional Neural Network ML Maximum Likelihood CE Cross Entropy MLLR Maximum Likelihood Linear Regression DAE Denoising Autoencoder MLLT Maximum Likelihood Linear Transformation DNN Deep Neural Network MMeDuSA Modulation of Medium Duration Speech Amplitudes DOC Damped Oscillator Coefficients MMSE Minimum Mean Square Error DSR Distant Speech Recognition MSE Mean Square Error D&S EM fdlr fmllr GCC-PHAT GMM GSC HMM IRM KL LCMV LDA LIN LHN LHUC LM LP Delay and sum (Beamformer) Expectation and Maximization Feature space Discriminative Linear Regression Feature space MLLR (equivalent to CMLLR) Generalized Cross Correlation with Phase Transform Gaussian Mixture Model Generalized Sidelobe Canceller Hidden Markov Model Ideal Ratio Mask Kullback Leibler (divergence/distance) Linear Constrained Minimum Variance Linear Discriminant Analysis Linear Input Network Linear Hidden Network Learning Hidden Unit Contribution Language Model Linear Prediction MVDR NMF PNCC RNN SE smbr SNR SRP-PHAT STFT TDNN TDOA TF VTLN VTS WER WPE Minimum Variance Distortionless Response (Beamformer) Non-negative Matrix Factorization Power-Normalized Cepstral Coefficients Recurrent Neural Network Speech Enhancement state-level Minimum Bayes Risk Signal-to-Noise Ratio Steered Response Power with the PHAse Transform Short Time Fourier Transform Time Delayed Neural Network Time Difference Of Arrival Time-Frequency Vocal Tract Length Normalization Vector Taylor Series Word Error Rate Weighted Prediction Error (dereverberation) 4 4

5 Notations Basic notation a a A A Scalar Vector Matrix Sequence Signal processing x j [n] X j (t, f) j {1,, J} t {1,, T} f {1,, F} h j [n] H j (t, f) Time domain j-microphone signal at sample n Frequency domain j-microphone coefficients at frame t and frequency bin f Channel index Frame index Frequency bin index Time domain room impulse response Frequency domain room impulse response 5 5

6 Notations ASR o t R D O {o t t = 1,, T} w n V W {w n n = 1,, N} s t {1,, J} S {s t t = 1,, T} D-dimensional speech feature vector at frame t T-length sequence of speech features Word at nth position in a vocabulary V N-length word sequence HMM state index at frame t T-length sequence of HMM states Model θ Model parameter PDF N( μ, Σ) N c ( μ, Σ) (Real value) Gaussian distribution Complex Gaussian distribution 6 6

7 Notations operation a A T A H trace( ) P a b or A B σ( ) softmax( ) tanh( ) Complex conjugate Transpose Hermitian transpose Trace Principal eigenvector Elementwise multiplication Sigmoid function Softmax function Tanh function 7 7

8 9

9 1.1 Toward multimicrophone speech recognition 10 10

From pattern matching to probabilistic approaches 50s-60s Initial attempts with template matching Recognition of digits or few phonemes 70s Recognition of 1000 words First National projects (DARPA)

10 From pattern matching to probabilistic approaches 50s-60s Initial attempts with template matching Recognition of digits or few phonemes 70s Recognition of 1000 words First National projects (DARPA) Introduction of beam search 80s Introduction of probabilistic model approaches (n-gram language models, GMM-HMM acoustic models) First attempts with Neural Networks Launch of initial dictation systems (Dragon Speech) (Juang 04) 11 11

technologies New popular toolkits such as KALDI Launch of large scale applications (Google

11 From research labs to outside world (Juang 04) 90s Discriminative training for acoustic models, MLLR adaptation, VTS Development of Common toolkits (HTK) 2000s Less breakthrough technologies New popular toolkits such as KALDI Launch of large scale applications (Google Voice search) 2010s Introduction of DNNs, RNN-LMs ASR used in more and more products (e.g. SIRI ) 12 12

12 Word error rate (WER) Evolution of ASR performance 100% (Pallett 03, Saon 15, Xiong 16) Deep learning By Microsoft 10% 6.3% 1% Switchboard task (Telephone conversation speech)

13 Impact of deep learning Great performance improvement DNNs are more robust to input variations bring improvements for all tasks including Large Vocabulary Continuous Speech Recognition (LVCSR) and Distant Speech Recognition (DSR) Robustness is still an issue (Seltzer 14, Delcroix 13) Speech enhancement/adaptation improve performance Microphone array, fmllr, Reshuffling the cards Some technologies relying on GMMs became obsolete, VTS, MLLR Some technologies became less effective, VTLN, Single channel speech enhancement, New opportunities, Exploring long context information for recognition/enhancement Front-end/back-end joint optimization, 14 14

14 Towards distant ASR (DSR) Close-talking microphone e.g., voice search Distant microphone e.g., Human-human comm., Human-robot comm

15 Interest for DSR - Industry Home assistants Game consoles Robots Voiced controlled appliances 16 16

16 Interest for DSR - Academia ICSI meeting RATS A lot of community-driven research projects 17 17

17 CHiME 1, 2 (Barker 13, Vincent 13) Distant speech recognition in living room Acoustic conditions Simulated distant speech SNR: -6dB to - 9dB # mics : 2 CHiME 1: Command (Grid corpus) + noise (living room) CHiME 2 (WSJ): WSJ (5k) + noise (living room) stereo 18 18

18 CHiME 3, 4 (Barker 15) Noisy speech recognition using a tablet Recording conditions Noise types: Bus, Café, Street, Pedestrian # mics: 6 (CHiME3); 1, 2, 6 (CHiME4) Simulated and real recordings Speech Read speech (WSJ (5k))

19 REVERB (Kinoshita 13, Lincoln 05) Reverberant speech recognition Recording conditions Reverberation (RT 0.2 to 0.7 s.) Noise type: stationary noise (SNR ~20dB) # mics: 1, 2, 8 Simulated and real recordings (MC-WSJ-AV) Speech Read speech (WSJ CAM0 (5k))

20 AMI (Carletta 05) Meeting recognition corpus Recording conditions Multi-speaker conversations Reverberant rooms # mics: 8 Real recordings Speech Spontaneous meetings (8k)

ASpIRE (Harper 15) Large vocabulary reverberant speech Recording conditions Reverberant speech 7 different rooms (classrooms and office rooms) with various shapes, sizes, surface properties, and

21 ASpIRE (Harper 15) Large vocabulary reverberant speech Recording conditions Reverberant speech 7 different rooms (classrooms and office rooms) with various shapes, sizes, surface properties, and noise sources # mics: 1 or 6 Speech Training data: Fisher corpus (2000 h of telephone speech)

22 Summary of tasks Vocab size Amount of training data Real/ Simu Type of distortions # mics Micspeaker distance ASpIRE 100K ~ 2000 h Real Reverberation 8/1 N/A N/A Ground truth AMI 11K 75 h Real Multi-speaker conversations Reverberation and noise 8 N/A Headset CHiME ,000 utt. Simu Non-stationary noise recorded in a living room (SNR -6dB 9dB) Reverberation from recorded impulse responses CHiME2 (WSJ) 2 2m Clean 5K 7138 utt. (~ 15 h) Simu Same as CHiME1 2 2m Clean CHiME3 5K 8738 utt. (~ 18 h) Simu + Real CHiME4 5K 8738 utt. (~ 18 h) Simu + Real REVERB 5K 7861 utt.. (~ 15 h) Simu + Real Non-stationary noise in 4 environments Non-stationary noise in 4 environments Reverberation in different living rooms (RT60 from 0.25 to 0.7 sec.) + stationary noise (SNR ~ 20dB) 6 0.5m Close talk mic. 6/2/1 0.5m Close talk mic. 8/2/1 0.5 m 2m Clean /Headset 24 24

23 25 25

24 Content of the tutorial Single Multi-microphone speech recognition X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multichannel speech enhancement front-end (Section 2): - Dereverberation - Noise reduction * We call multichannel instead of multi-microphone by following a convention of signal processing 26 26

25 Content of the tutorial Single Multi-microphone speech recognition Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multichannel speech enhancement front-end (Section 2): - Dereverberation - Noise reduction * We call multichannel instead of multi-microphone by following a convention of signal processing 27 27

26 Content of the tutorial Multi-microphone speech recognition Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Joint training paradigm (Section 3): - Unify speech enhancement and speech recognition 28 28

27 Single-microphone speech recognition (Conventional) X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model 29 29

28 1.2 Introduction of single-microphone speech recognition 30 30

29 Towards distant ASR (DSR) 31 31

30 Single-microphone speech recognition (Conventional) X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model 32 32

31 Speech recognition Feature extraction Recognizer That s another story X {x[n] n = 1,, N} O {o t t = 1,, T} W {w u V u = 1,, U} Feature extraction (deterministic) O = f(x) Bayes decision theory (MAP) W = arg max W p W O p W O : Posterior distribution Find most probable word sequence W given feature sequence O 33 33

32 Acoustic, lexicon, and language models Feature extraction O Recognizer W That s another story Acoustic model Lexicon Language model Use subword sequence S {s t t = 1 T} and factorize p W O W = arg max W p W O = arg max W S p O S p S W p(w) p O S : Acoustic model p S W : Lexicon model (obtained by hand-crafted pronunciation dictionary) p(w): Language model s t : subword composed of phonemes (/a/, /k/, ) or HMM states (/a_1/, ) 34 34

33 Feature extraction Log mel filterbank extraction Converts a speech signal to a sequence of speech features more suited for ASR, typically log mel filterbank coefficients x[n]: waveform Feature extraction O {o t t = 1,, T} Log mel feature STFT 2 Mel filtering log( ) 35 35

Acoustic model Hidden Markov model (per phoneme) Factorize the probabilistic distribution into the product form of t with conditional independence assumptions (1st order Markov assumption and p O s t

34 Acoustic model Hidden Markov model (per phoneme) Factorize the probabilistic distribution into the product form of t with conditional independence assumptions (1st order Markov assumption and p O s t p o t s t ) T p O S p o t s t p(s t s t 1 ) t=1 p o t s t : Likelihood function p(s t s t 1 ) : State transition probability Likelihood function p o t s t = k is computed by Gaussian mixture model p o t s t = k = ω kl N(o t L l=1 μ kl, Σ kl N(o μ, Σ): Gaussian distribution with mean vector μ and covariance matrix Σ, ω kl : Mixture weight Parameters are estimated by the Expectation Maximization (EM) algorithm Deep neural network p(s t s t 1 ) s t 1 s t s t+1 o t 1 o t o t+1 p(o t s t ) 36 36

35 Expectation Maximization (EM) algorithm Maximum likelihood criterion Θ = argmax Θ p O; Θ = argmax Θ p(o, S; Θ) Hard to estimate due to the summation over all possible S Auxiliary function S Q Θ, Θ = p S O; Θ log p(o, S; Θ) S Θ argmax Θ Q(Θ, Θ) p S O, Θ : posterior distribution of latent variable S obtained by the previously estimated parameter Θ The update rule is iteratively performed, i.e., Θ τ=1 Θ τ Θ τ+1, and always satisfies p O; Θ τ p O; Θ τ+1 Obtain a local optimum parameter Viterbi approximation: p S O; Θ δ S መS, where መS = argmax S Examples p(s O; Θ) 1. GMM: Q Θ, Θ = σ l,t γ l,t log ω l N(o t μ l, Σ l, where γ l,t = p(v t = l o t ; Θ) 2. HMM: Q Θ, Θ = σ k,l,t γ k,l,t log ω kl N(o t μ kl, Σ kl + σ k,t γ k,k,t log a k,k, where γ k,,l,t = p s t = k, v t = l o t ; Θ, γ k,k,t = p(s t = k s t = k, o t ; Θ) Note that we can avoidthe summation over all possible S 37 37

36 Deep neural network for acoustic model Hidden Markov model T p O S p o t s t p(s t s t 1 ) t=1 Likelihood function is computed by Pseudo likelihood trick p o t s t = p s t o t p(o t ) p(s t ) p s t o t p(s t ) p s t o t : posterior distribution of HMM state s t given feature o t, p(s t ) : prior distribution of s t We can easily build a frame-level classifier to obtain p s t o t when we prepare parallel data o t, s t t Log linear model or deep neural network (DNN) DNN: Very powerful classifier by considering correlations, and non-linearity of input feature vectors (discussed later) With the pseudo likelihood trick, we can combine a frame-level DNN classifier and hidden Markov model to obtain a sequence likelihood function p(o S) 38 38

37 Basics of deep neural networks Output layer (l = L) p s t = k o t L = h t,k = softmax(a t L ) k Hidden layers (l) h t l a t l = W l h t l 1 + b l h t l = σ a t l Activation function σ Sigmoid 1 1 Relu Input layer (l = 0) 0 0 Trained using error back-propagation Training criterion, cross entropy, MMSE, State-level MBR, 39 39

38 ~7hidden layers, 2048 units DNN-based acoustic model (Hinton 12, Mohamed 12) Output HMM state 1,000 ~ 10,000 units a i u w N Input speech features Log mel filterbank + 11 context frames Minimize cross entropy J(θ) = t k τ t,k log h L t,k (θ) τ t,k : Target label h L t,k : Network output θ = {W l, b l L } l=1 : Network parameters Use large amount of speech training data with the associated HMM state alignments Optimization using gradient decent θ = argmin θ J θ J θ θ = θ + η θ η: Learning rate Have to compute gradient Error backpropagation 40 40

39 DNN-training with error back-propagation a i u w N HMM State Forward Propagate feature through network h (L) Compute error signal δ L = (h L τ) Output: h (L) Back-propagate error sigmoid: δ l 1 = ( W l T δ l ) a l 1 (1 a l 1 ) softmax: δ L 1 = δ L a L 1 (1 a L 1 ) Compute gradient J θ = δ l h l 1 T (l) W k J θ l (l) = δ b Update parameters Input speech features W k (l) = W k (l) + η J θ W k (l) b (l) = b (l) J θ + η b (l) 41 41

40 DNN-training with error back-propagation a i u w N HMM State Label: τ Output: h (L) Forward Propagate feature through network h (L) Compute error signal Back-propagate error δ L = (h L τ) sigmoid: δ l 1 = ( W l T δ l ) a l 1 (1 a l 1 ) softmax: δ L 1 = δ L a L 1 (1 a L 1 ) Compute gradient J θ = δ l h l 1 T (l) W k J θ l (l) = δ b Input speech features Update parameters W k (l) = W k (l) + η J θ W k (l) b (l) = b (l) J θ + η b (l) 42 42

41 DNN-training with error back-propagation a i u w N HMM State Label: τ Output: h (L) Forward Propagate feature through network h (L) Compute error signal Back-propagate error δ L = (h L τ) sigmoid: δ l 1 = ( W l T δ l ) a l 1 (1 a l 1 ) softmax: δ L 1 = δ L a L 1 (1 a L 1 ) Compute gradient J θ = δ l h l 1 T (l) W k J θ l (l) = δ b Input speech features Update parameters W k (l) = W k (l) + η J θ W k (l) b (l) = b (l) J θ + η b (l) 43 43

42 Language model N-gram language model p W = w 1,, w U U = p w u w 1:u 1 u=1 N 1 order Markov assumption Often encounter the zero occupancy problem Kneser-Ney or other smoothing techniques Easily combined with a decoder U p(w u w u N:u 1 ) u=1 Recurrent neural network language model (Mikolov 10) U p W = p w u = w w 1:u 1 u=1 U = [softmax a L u (w 1:u 1 ] w u=1 No Markov assumption High performance RNN or LSTM Difficult to combine it with decoder, and N-best rescoring instead 44 44

43 Acoustic, lexicon, and language models Feature extraction O Recognizer W That s another story Acoustic model Lexicon Language model Use subword sequence S {s t t = 1 T} and factorize p W O W = arg max W p W O = arg max W S p O S p S W p(w) p O S : HMM-GMM or HMM-DNN hybrid p S W : Lexicon model p(w): N-gram or recurrent neural network language model Search most probable word sequence W by a decoder 45 45

44 1.3 Review of state-of-the-art single channel ASR 46 46

45 Challenges of DSR Interfering speaker Reverberation Distant mic Background noise 47 47

46 Recent achievements REVERB 2014 (WER) Baseline (GMM) 48.9 % 22.2 % Robust back-end 9.0 % Multi-mic front-end + robust back-end CHiME (WER) 11.5 % 7.5 % 3.0 % Baseline (DNN) Robust back-end Multi-mic front-end + robust back-end 48 48

47 Robust back-end (Single channel speech recognition) Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model - Feature extraction - Acoustic model - Model adaptation 49 49

48 Y j (t, f) SE front-end X(t, f) Feature extraction o t O Feature extraction Recognizer W My name is Model adaptation Acoustic model Lexicon Language model 50 50

49 Feature extraction Log mel filterbank y[n] STFT 2 Mel filtering log( ) o t CMVN o t Spectrum analysis Power extraction (disregard phase) Emphasize low-frequency power with perceptual knowledge (Mel scale) Dynamic range control Cepstrum Mean and Variance Normalization (CMVN) ETSI Advanced front-end (ETSI707) y[n] STFT PSD estimation Wiener Filter Design Mel filtering Mel IDCT Apply filter VAD Developed at the Aurora project Time domain Wiener-filtering (WF) based noise reduction y [n] 51 51

50 Gammatone Filtering based features Human auditory system motivated filter Power-Normalized Cepstral Coefficients (PNCC) (Kim 12) STFT 2 Gammatone Power bias filter (freq.) subtraction 0.1 Replace log to power 0.1, frequency-domain Gammatone filtering, Medium-duration Power bias subtraction Time-domain Gammatone filtering (e.g., Schulter 09, Mitra 14) Can combine amplitude modulation based features Gammatone filtering and amplitude modulation based features (Damped Oscillator Coefficients (DOC), Modulation of Medium Duration Speech Amplitudes (MMeDuSA)) showed significant improvement for CHiME3 task CHiME 3 Real Eval (MVDR enhanced signal) MFCC DOC MMeDuSA (Hori 15) 52 52

51 (Linear) Feature transformation Linear Discriminant Analysis (LDA) Concatenate contiguous features, i.e., x t = o T t L,, o t, T T, o t+l T o t LDA = A LDA x t Estimate a transformation to reduce the dimension with discriminant analysis Capture long-term dependency Semi-Tied Covariance (STC)/Maximum Likelihood Linear Transformation (MLLT) diag N(o t μ kl, Σ kl ) N full ot μ kl, Σ kl Σ full kl = A STC diag Σ kl A STC T with the following relationship Estimate A STC by using maximum likelihood During the recognition, we can evaluate the following likelihood function with diagonal covariance and feature transformation N o STC t A STC diag μ kl, Σ kl, where ot STC = A STC o t 53 53

52 (Linear) Feature transformation, Cont d Feature-space Maximum Likelihood Linear Regression (fmllr) Affine transformation: o t = A fm o t + b fm Estimate transformation parametera fm and b fm with maximum likelihood estimation Q A fm, b = σ k,t,l γ t,k,l (log A fm + log N(A fm o t + b fm μ kl, Σ kl )) LDA, STC, fmllr are cascadely combined, i.e., o t =A fm (A STC (A LDA o T t L,, o t, T T, o t+l T )) + b fm Effect of feature transformation with distant ASR scenarios GMM MFCC, Δ, ΔΔ LDA, STC, fmllr CHiME REVERB (Tachioka 13, 14) LDA, STC, and fmllr are cascadely used, and yield significant improvement All are based on GMM-HMM, but still applicable to DNN as feature extraction MFCC is more appropriate than Filterbank feature, as MFCC matches GMM 54 54

53 1.3.2 Robust acoustic models Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model 55 55

54 DNN acoustic model a i u w N Non-linear transformation of (long) context features by concatenating contiguous frames Very powerful for noise robust ASR Long context! (usually 11 frames) o t = [x t L,, x t,, x t+l ] Input speech features Cross entropy criterion J ce (θ) J ce (θ) = There are several other criteria t k τ t,k log h L t,k (θ) 56 56

55 Sequence discriminative criterion W u = My name was. (reference) Compute sequence level errors W = My name is. (hypothesis) LVCSR decoder Sequence discriminative criterion J seq (θ) J seq θ = E W, W u p(w O u ) u W E(W, W u ) is a sequence error between reference W u and hypothesis W State-level Minimum Bayes Risk (smbr) o t = [x t L,, x t,, x t+l ] CHiME3 baseline v2 GMM DNN CE DNN smbr Input speech features 57 57

56 Multi-task objectives a N Use both MMSE and CE criteria X as clean speech target T as transcription CE,L h t,k MSE,L h t,k Network tries to solve both enhancement and recognition ρ controls the balance between the two criteria Will be used to regularize a particular layer at joint training in Section 3 (Giri 15) CE Multi-task ρ = REVERB RealData

57 Toward further long context Time Delayed Neural Network (TDNN) Convolutional Neural Network (CNN) Recurrent Neural Network (RNN) Long Short-Term Memory (LSTM) 59 59

58 Time delayed neural network (TDNN) Deal with very long context (e.g., 17 frames) (Waibel 89, Peddinti 15) h t 5 x t 8 x t x t+8 Difficult to train the first layer matrix due to vanishing gradient 60 60

59 Time delayed neural network (TDNN) (Waibel 89, Peddinti 15) h t 5 Original TDNN Consider short context (e.g., [-2, 2]), but expand context at each layer h 1 t = σ(a 1 x t 2, x t 1, x t, x t+1, x t+2 + b 1 ) 1 + b 2 h 2 t = σ A h t 2, h t 1, h t, 1, h t+2 1 h t+1 h t 3 = Very large computational cost x t 8 x t x t

60 Time delayed neural network (TDNN) (Waibel 89, Peddinti 15) h t 5 x t 8 x t x t+8 Original TDNN Consider short context (e.g., [-2, 2]), but expand context at each layer h 1 t = σ(a 1 x t 2, x t 1, x t, x t+1, x t+2 + b 1 ) h 2 t = σ A h t 2, h t 1, h t, h t+1, h t+2 + b 2 h 3 t = Very large computational cost Subsampled TDNN (Peddinti 15) Subsample frames in the context expansion h 2 t = σ(a h t 2, h t+2 + b 2 ) Efficiently compute long context network DNN TDNN ASpIRE AMI

61 Convolutional Neural Network (CNN) Represents the input as time-frequency feature map o t,p,q (we can also use multiple maps one for static, delta and delta-delta features), where p and q are indexes along the time and frequency axes of the feature maps Input map o t Filters w p,q m, b (m) (m) a t,p,q Output map (m) h t,p,q Pooling map q o t,p,q p (m) a t,p,q (m) = wp,q ot,p,q + b (m) (m) h t,p,q (m) h t,q = max p (m) = σ at,p,q h m t,p,q (max pooling) Time-dimensional feature maps can capture long context information REVERB: 23.5 (DNN) 22.4 (CNN-DNN) (Yoshioka 15a) 63 63

62 RNN/LSTM acoustic model Output HMM state a i u w N a i u w N RNN can also capture the longterm distortion effect due to reverberation and noise RNN/LSTM can be applied as an acoustic model for noise robust ASR (Weng 14, Weninger 14) DNN RNN Aurora CHiME o t 1 Input: noisy speech features o t RNN or LSTM 64 64

63 Practical issues 65 65

Re-alignment after we obtain DNN several times Sequence discriminative training can mitigate this issue (however, since we use CE as an

64 The importance of the alignments DNN CE training needs frame-level label τ t,k obtained by Viterbi algorithm J CE (θ) = t k L τ t,k log h t,k However, it is very difficult to obtain precise label τ t,k for noisy speech sil How to deal with the issue? Re-alignment after we obtain DNN several times Sequence discriminative training can mitigate this issue (however, since we use CE as an initial model, it is difficult to recover this degradation) Parallel clean data alignment if available sil? Noisy alignment CHiME Clean alignment (Weng 14) 66 66

65 Degradation due to enhanced features Y j (t, f) SE front-end X(t, f) Feature extraction o t enh Y j (t, f) Feature extraction o t noisy Which features we should use for training acoustic models? CHiME 3 Real Eval Noisy features: o noisy t = FE(Y) Enhanced features: o enh t = FE( X) Training Testing WER (%) Noisy o t noisy Noisy o t noisy Noisy o t noisy Enhanced o t enh Enhanced o t enh Enhanced o t enh???? 67 67

66 Degradation due to enhanced features Y j (t, f) SE front-end X(t, f) Feature extraction o t enh Y j (t, f) Feature extraction o t noisy Which features we should use for training acoustic models? CHiME 3 Real Eval Noisy features: o noisy t = FE(Y) Enhanced features: o enh t = FE( X) Training Testing WER (%) Noisy o t noisy Noisy o t noisy Noisy o t noisy Enhanced o t enh Enhanced o t enh Enhanced o t enh Re-training with enhanced features degrades the ASR performance!! Noisy data training are robust for distorted speech (?) 68 68

67 Remarks Noise robust feature and linear feature transformation are effective Effective for both GMM and DNN acoustic modeling Deep learning is effective for noise robust ASR DNN with sequence discriminative training is still powerful RNN, TDNN, and CNN can capture the long-term dependency of speech, and are more effective when dealing with reverberation and complex noise We can basically use standard acoustic modeling techniques even for distant ASR scenarios However, need special cares for Alignments Re-training with enhanced features 69 69

68 1.3.3 Acoustic model adaptation Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model 70 70

69 Importance of acoustic model adaptation REVERB (Delcroix 15a) CHiME 3 (Yoshioka 15b) W/O adaptation W Adaptation WER (%) 71 71

70 Acoustic model adaptation DNN is very powerful so why do we need adaptation? Acoustic model Adapt Acoustic model Training data Various speakers & environments Testing data Specific speaker & environment Unseen test condition due to limited amount of training data Model trained on large amount of data may be good on average but not optimal for a specific condition 72 72

71 Supervised/Unsupervised adaptation Supervised adaptation We know what was spoken There are transcriptions associated with adaptation data Unsupervised adaptation We do not know what was spoken There are no transcriptions 73 73

72 Supervised/Unsupervised adaptation Supervised adaptation We know what was spoken There are transcriptions associated with adaptation data Unsupervised adaptation We do not know what was spoken There are no transcriptions 74 74

73 DNN adaptation techniques Model adaptation Retraining Linear transformation of input or hidden layers (fdlr, LIN, LHN, LHUC) Adaptive training (Cluster/Speaker adaptive training) Auxiliary features Auxiliary features Noise aware training Speaker aware training Context adaptive DNN 75 75

74 DNN adaptation techniques Model adaptation Retraining Linear transformation of input or hidden layers (fdlr, LIN, LHN, LHUC) Adaptive training (Cluster/Speaker adaptive training) Auxiliary features Auxiliary features Noise aware training Speaker aware training Context adaptive DNN 76 76

75 Unsupervised labels estimation 1 st pass Adaptation speech data y j [n] Decode adaptation data with an existing ASR system Obtain estimated labels, Ƹ τ t,k Speech enhancement x[n] Feature extraction o t O Recognizer τƹ t,k Acoustic model Lexicon Language model 77 77

76 Retraining Retrain/adapt acoustic model parameters given the estimated labels with error backpropagation (Liao 13) Prevent modifying too much the model Small learning rate Small number of epochs (early stopping) Regularization (e.g. L2 prior norm (Liao 13), KL (Yu 13)) For large amount of adaptation data, retraining all or part of the DNN (e.g. lower layers) a i u w N Ƹ HMM State Label: τ t,k Input speech features L Output: h t,k 78 78

77 Linear input network (LIN) (Neto 95) Add a linear layer that transforms the input features Learn the transform with error backpropagation a i u w N a i u w N Input speech features o t = A o t + b (no activation) 79 79

78 Linear hidden network (LHN) (Gemello 06) Insert a linear transformation layer inside the network a i u w N a i u w N መh t l = A h t l + b (no activation) Input speech features 80 80

79 Learning hidden unit contribution (LHUC) Similar to LHN but with diagonal matrix Fewer parameters a i u w N a i u w N (Swietojanski 14b) መh t l = A h t l A = a a N Input speech features 81 81

80 Room adaptation for REVERB (RealData) Results from (Delcroix 15a) Adap WER (%) st 21.7 All 22.1 LIN 22.1 Speech processed with WPE (1ch) Amount of adaptation data ~9 min Back-end: - DNN with 7 hidden layers - Trigram LM 82 82

81 Model adaptation Can adapt to conditions unseen during training Computationally expensive + processing delay Requires 2 decoding step Data demanding Relatively large amount of adaptation data needed 83 83

82 DNN adaptation techniques Model adaptation Retraining Linear transformation of input or hidden layers (fdlr, LIN, LHN, LHUC) Adaptive training (Cluster/Speaker adaptive training) Auxiliary features Auxiliary features Noise aware training Speaker aware training Context adaptive DNN 84 84

83 Auxiliary features based adaptation a i u w N Exploit auxiliary information about speaker or noise Simple way: Concatenate auxiliary features to input features Weights for auxiliary features learned during training Input speech features Auxiliary Features represents e.g., Speaker aware (i-vector, Bottleneck feat.) (Saon 13) Noise aware (noise estimate) (Seltzer 13) Room aware (RT60, Distance, ) (Giri 15) 85 85

84 Speaker adaptation Results from (Kundu 15) Auxiliary feature AURORA 4 REVERB % 20.1 % i-vector 9.0 % 18.2 % Speaker ID Bottleneck 9.3 % 17.4 % Speaker i-vectors or bottleneck features have shown to improve performance for many tasks Other features such as noise or room parameters have also been shown to improve performance 86 86

85 Auxiliary features-based adaptation Rapid adaptation Auxiliary features can be computed per utterance (~10 sec. or less) Computationally friendly No need for the extra decoding step (Single pass unsupervised adaptation) Does not extend to unseen conditions Requires training data covering all test cases 87 87

86 Summary of single-channel distant speech recognition Many effective techniques are developed for general speech recognition Feature transformation DNN acoustic model (DNN-based) model adaptation DNN is very powerful to handle spectral information by using deep nonlinear processing But it does not reach enough performance, yet 88 88

87 Recent achievements REVERB 2014 (WER) Baseline (GMM) 48.9 % 22.2 % Robust back-end 9.0 % Multi-mic front-end + robust back-end CHiME (WER) 11.5 % 7.5 % 3.0 % Baseline (DNN) Robust back-end Multi-mic front-end + robust back-end 89 89

88 Summary of single-channel distant speech recognition Many effective techniques are developed for general speech recognition Feature transformation DNN acoustic model (DNN-based) model adaptation DNN is very powerful to handle spectral information by using deep nonlinear processing But it does not reach enough performance, yet Multi-microphone speech recognition 90 90

89 Content of the tutorial Multi-microphone speech recognition Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multichannel speech enhancement front-end (Section 2): - Dereverberation - Noise reduction 91 91

90 Content of the tutorial Multi-microphone speech recognition Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Joint training paradigm (Section 3): - Unify speech enhancement and speech recognition 92 92

91 Topics not covered in this tutorial Voice activity detection Keyword spotting Multi-speaker / Speaker diarization Online processing Data simulation 93 93

92 References (Introduction) (Barker 13) Barker, J. et al. The PASCAL CHiME speech separation and recognition challenge, Computer Speech&Language (2013). (Barker 15) Barker, J. et al. The third CHiME speech separation and recognition challenge: Dataset, task and baselines, Proc. ASRU (2015). (Carletta 05) Carletta, J. et al. The AMI meeting corpus: A pre-announcement, Springer (2005). (Chen 15) Chen, Z., et al, Integration of Speech Enhancement and Recognition using Long-Short Term Memory Recurrent Neural Network, Proc. Interspeech (2015). (Chunyang 15) Chunyang, W., et al. Multi-basis adaptive neural network for rapid adaptation in speech recognition, Proc. ICASSP (2015). (Delcroix 13) Delcroix, M. et al. Is speech enhancement pre-processing still relevant when using deep neural networks for acoustic modeling? in Proc. Interspeech (2013). (Delcroix 15a) Delcroix, M., et al. Strategies for distant speech recognition in reverberant environments, Proc. CSL (2015). (Delcroix 15b) Delcroix, M., et al. Context adaptive deep neural networks for fast acoustic model adaptation, Proc. ICASSP (2015). (Delcroix 16a) (Delcroix 16b) (ETSI 07) (Gemmello 06) (Giri 15) Delcroix, M., et al. Context adaptive deep neural networks for fast acoustic model adaptation in noise conditions, Proc. ICASSP (2016). Delcroix, M., et al. Context adaptive neural network for rapid adaptation of deep CNN based acoustic models, Proc. Interspeech (2016). Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Adv. Front-end Feature Extraction Algorithm; Compression Algorithms, ETSI ES Ver (2007). Gemello, R., et al. Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training, Proc. ICASSP (2006). Giri, R., et al. Improving speech recognition in reverberation using a room-aware deep neural network and multi-task learning, Proc. ICASSP (2015). (Harper 15) Haper, M. The Automatic Speech recognition In Reverberant Environments (ASpIRE) challenge, Proc. ASRU (2015). (Hori 15) Hori, T., et al, The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition, Proc. ASRU (2015)

93 References (Introduction) (Hinton 12) Hinton, G., et al. Deep neural networks for acoustic modeling in speech recognition, IEEE Sig. Proc. Mag. 29, (2012). (Juang 04) Juang, B. H. et al. Automatic Speech Recognition A Brief History of the Technology Development, (2004). (Kim 12) Kim, C., et al. "Power-normalized cepstral coefficients (PNCC) for robust speech recognition." Proc. ICASSP (2012). (Kinoshita 13) Kinoshita, K. et al. The REVERB challenge: a common evaluation framework for dereverberation and recognition of reverberant speech, Proc. WASPAA (2013). (Kuttruff 09) Kuttruff, H. Room Acoustics, 5th ed. Taylor & Francis (2009). (Liao 13) Liao, H. Speaker adaptation of context dependent deep neural networks, Proc. ICASSP (2013). (Lincoln 05) (Matassoni 14) Lincoln, M. et al., The multichannel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments, Proc. ASRU (2005). Matassoni, M. et al. The DIRHA-GRID corpus: baseline and tools for multi-room distant speech recognition using distributed microphones, Proc. Interspeech (2014). (Mitra 14) Mitra, V., et al. Damped oscillator cepstral coefficients for robust speech recognition, Proc. Interspeech (2013). (Mohamed 12) Mohamed, A. etal. Acoustic modeling using deep belief networks, IEEE Trans. ASLP (2012). (Neto 95) Neto, J., et al. Speaker adaptation for hybrid HMM-ANN continuous speech recognition system, Proc. Interspeech (1995). (Ochiai 14) Ochiai, T., et al. Speaker adaptive training using deep neural networks, Proc. ICASSP (2014). (Pallett 03) Pallett, D. S. A look at NIST'S benchmark ASR tests: past, present, and future, ASRU (2003). (Parihar 02) Parihar, N. et al. DSR front-end large vocabulary continuous speech recognition evaluation, (2002). (Peddinti 15) Peddinti, V., et al, A time delay neural network architecture for efficient modeling of long temporal contexts." Proc. Interspeech (2015). (Saon 13) Saon, G., et al. Speaker adaptation of neural network acoustic models using i-vectors, Proc. ASRU (2013)

94 References (Introduction) (Saon 15) Saon, G. et al. The IBM 2015 English Conversational Telephone Speech Recognition System, arxiv: (2015). (Seltzer 13) Seltzer, M.L., et al. An investigation of deep neural networks for noise robust speech recognition, Proc. ICASSP (2013). (Seltzer 14) Seltzer, M. Robustness is dead! Long live Robustness! Proc. REVERB (2014). (Shluter 07) Schluter, R., et al. "Gammatone features and feature combination for large vocabulary speech recognition." Proc. ICASSP (2007). (Swietojanski 14b) Swietojanski, P., et al. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, Proc. SLT (2014). (Tachioka 13) (Tachioka 14) Tachioka, Y., et al. Discriminative methods for noise robust speech recognition: A CHiME Challenge Benchmark, Proc. CHiME, (2013). Tachioka, Y., et al. "Dual System Combination Approach for Various Reverberant Environments with Dereverberation Techniques," Proc. REVERB Workshop (2014). (Tan 15) Tan, T., et al. Cluster adaptive training for deep neural network, Proc. ICASSP (2015). (Waibel 89) Waibel, A., et al. "Phoneme recognition using time-delay neural networks." IEEE transactions on acoustics, speech, and signal processing (1989). (Weng 14) Weng, C., et al. "Recurrent Deep Neural Networks for Robust Speech Recognition," Proc. ICASSP (2014). (Vincent 13) (Vincent 16) Vincent, E. et al. The second CHiME speech separation and recognition challenge: Datasets, tasks and baselines, Proc. ICASSP (2013). Vincent, E., et al. An analysis of environment, microphone and data simulation mismatches in robust speech recognition, Computer Speech & Language, to appear. (Xiong 16) Xiong, W., et al. The Microsoft 2016 Conversational Speech Recognition System, arxiv: (2016) (Yoshioka 15a) Yoshioka, T., et al. "Far-field speech recognition using CNN-DNN-HMM with convolution in time," Proc. ICASSP (2015). (Yoshioka 15b) (Yu 13) Yoshioka, T., et al. The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multimicrophone devices, Proc. ASRU (2015). Yu, D., et al. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition, Proc. ICASSP (2013)

95 97

96 Challenges of DSR Interfering speaker Reverberation Distant mic Background noise 98 98

97 Signal model Time domain Speech captured with a distant microphone array x n h j n u j n y j n Microphone signal at j th microphone x n h j n u j n n y j n = h j l x n l + u j n = h j n x n + u j n l Target clean speech Room impulse response Additive noise (background noise, ) Time index 99 99

98 Signal model - STFT domain Speech captured with a distant microphone array X(t, f) H j t, f U j (t, f) Y j t, f Microphone signal at j th microphone: Y j t, f H j m, f X t m, f + U j t, f = H j t, f X t, f + U j t, f m X t, f Target clean speech H j t, f Room impulse response U j (t, f) Additive noise (t, f) time frame index and frequency bin index Approximate a long-term convolution in the time domain as a convolution in the STFT domain, because h i n is longer than the STFT analysis window

99 SE Front-end Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multi-channel dereverberation Multi-channel noise reduction Single-channel noise reduction

100 Speech enhancement (SE) Reduce mismatch between observed speech and ASR back-end due to noise/reverberation Single-channel Multi-channel (multiple microphones)

101 Type of processing Linear processing Linear filter constant for long segments Non-linear processing Linear filter changing for each time-frame Non-linear transformation With F( ) Non-linear function

102 Categorization of SE front-ends Single-channel Multi-channel Linear processing Non-linear processing WPE dereverberation (Nakatani 10) Beamforming (Van Trees 02) WPE dereverberation (Nakatani 10) Neural network-based enhancement (Heymann 15) Spectral subtraction (Boll 79) Wiener filter (Lim 79) Time-frequency masking(wang 06) NMF (Virtanen 07) Neural network-based enhancement (Xu 15, Narayanan 13, Weninger 15) Time-frequency masking (Sawada 04) NMF (Ozerov 10) Neural network-based enhancement (Xiao 16)

103 Categorization of SE front-ends Single-channel Multi-channel Linear processing Non-linear processing WPE dereverberation (Nakatani 10) Beamforming (Van Trees 02) WPE dereverberation (Nakatani 10) Neural network-based enhancement (Heymann 15) Spectral subtraction (Boll 79) Wiener filter (Lim 79) Time-frequency masking(wang 06) NMF (Virtanen 07) Neural network-based enhancement (Xu 15, Narayanan 13, Weninger 15) Time-frequency masking (Sawada 04) NMF (Ozerov 10) Neural network-based enhancement (Xiao 16) Focus on Multi-channel (multi-microphones) Linear processing Neural network-based enhancement Have been shown to interconnect well with ASR back-end

104 Single channel speech enhancement Classical single channel approaches Spectral subtraction (or Wiener filtering) (Boll 79, Lim 79) X t, f 2 = Y t, f 2 U t, f 2 Difficult to estimate noise term U t, f Noise reduction at the cost of introducing distortions Not targeted for ASR Vector Taylor Series (VTS) Estimate noise in the feature domain Targeted for ASR (GMM-HMM) 2 for non-stationary environment Limited effect with DNN back-ends (Seltzer 13, Li 13) ~0% relative improvement w/ VTS or T/F masking Neural network-based enhancement Can learn to reduce non-stationary noise Neural network Can be potentially merged into the DNN acoustic model See Section

105 2.1 Dereverberation Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multi-channel dereverberation Multi-channel noise reduction Single-channel noise reduction

106 Room impulse response Models the multi-path propagation of sound caused by reflections on walls and objects (Kuttruff 09) Length ms in typical living rooms Assumed to be time-invariant (no change in room or microphone/speaker position) h j n Direct path time Early reflections (duration: ms) Late reverberation (duration: 100 ms-1000ms)

107 Reverberant speech (Yoshioka 12b) X(t, f) H t, f Y t, f Y t, f = H t, f X t, f + U t, f = σd τ=0 H τ, f X t τ, f + σt τ=d+1 H τ, f X t τ, f + U t, f Direct + Early sound reflections D(t, f) Late reverberation L(t, f) Neglect noise for the derivations Dereverberation aims at suppressing late reverberation

108 Dereverberation Linear filtering Weighted prediction error Non-linear filtering Spectral subtraction using a statistical model of late reverberation (Lebart 01, Tachioka 14) Neural network-based dereverberation (Weninger 14)

109 Linear prediction (LP) (Haykin 96) Reverberation: linear filter Can predict reverberation from past observations using linear prediction (under some conditions) Prediction G τ, f Y t τ, f τ=1 Past signals Dereverberation: D t, f = Y t, f G τ, f Y t τ, f D t, f and L t, f are both reduced τ Current signal Y t, f = D t, f + L t, f Predictable

110 Estimation of prediction coefficients Model prediction residual (speech STFT coefficients) as a stationary complex Gaussian D t,f 2 φ D p D t, f = N c D t, f ; 0, φ D = 1 e πφ D φ D is the signal power (assumed to be constant) Probability of the observed signal given the past observation (Likelihood): p Y t, f Y t 1, f,, Y t T, f ; G τ, f = 1 πφ D e The prediction filters can be obtained by maximizing the log-likelihood as L G τ, f = log p Y t, f Y t 1, f,, Y t T, f = log πφ D t t G τ, f = argmin G τ,f t t Y t, f σ τ=1 G τ, f Y t τ, f 2 φ D Y t, f G τ, f Y t τ, f τ=1 Y t,f σ τ=1 G τ,f Y t τ,f 2 φ D Constant terms

111 Estimation of prediction coefficients The prediction filter can be obtain as, G τ, f = argmin G τ,f t Y t, f G τ, f Y t τ, f τ=1 2 This lead to the expression for the prediction filters as where,

112 Problem of LP-based speech dereverberation LP predicts both early reflections and late reverberation Speech signal exhibits short-term correlation (30-50 ms) LP suppresses also the short-time correlation of speech LP assumes the target signal follows a stationary Gaussian distribution Speech is not stationary Gaussian LP destroys the time structure of speech Solutions: Introduce a prediction delay (Kinoshita 07) Introduce better modeling of speech signals (Nakatani 10, Yoshioka 12, Jukic 14)

113 Delayed linear prediction (LP) (Kinoshita 07) Prediction G τ, f Y t τ, f τ=d Current signal Y t, f = D t, f + L t, f Past signals Delay d (=30-50 ms) Unpredictable Delayed LP can only predict L t, f from past signals Predictable Only reduce L t, f

114 Estimation of prediction coefficients Delayed LP: D t, f = Y t, f G τ, f Y t τ, f τ=d (Nakatani 10, Yoshioka 12) For non-stationary signal with time-varying power φ D t, f 1 p D t, f = N c D t, f ; 0, φ D t, f = πφ D t, f e φ D t, f is the time-varying signal power D t,f 2 φ D t,f Prediction filters: G τ, f = argmin G τ,f σ t Y t,f σ τ=d G τ,f Y t τ,f 2 φ D t,f Weighted prediction error (WPE) φ D t, f = argmin φ D t,f log πφ D (t, f) + Y t,f σ τ=d G τ,f Y t τ,f 2 φ D t,f

115 Multi-channel extension Exploit past signals from all microphones to predict current signal at a microphone G j τ, f Y j t τ, f τ=d Y 1 t, f J D t, f = Y 1 t, f G j τ, f Y j t τ, f j=1 τ=d = Y 1 t, f g f H y t d,f Prediction filter obtained as g f = argmin φ D t,f g f Can output multi-channel signals It preserves early reflections can be used as a front-end to beamformer σ t Y 1 t,f g f H yt d,f

116 Processing flow of WPE Dereverberation Power estimation Prediction filter estimation

117 Frequency (Hz) Sound demo from REVERB challenge (Delcroix 14) Headset Distant (RealData) WPE (8ch) WPE + beamformer (8ch) Time (sec.)

118 Results for REVERB and CHiME3 Front-end REVERB (8 ch) CHiME3 (6 ch) % 15.6 % WPE 12.9 % 14.7 % WPE + MVDR Beamformer 9.3 % 7.6 % Results for the REVERB task (Real Data, eval set) (Delcroix 15) - DNN-based acoustic model trained with augmented training data - Environment adaptation - Decoding with RNN-LM Results for the CHiME 3 task (Real Data, eval set) (Yoshioka 15) - Deep CNN-based acoustic model trained with 6 channel training data - No speaker adaptation - Decoding with RNN-LM

119 Remarks Precise speech dereverberation with linear processing Can be shown to cause no distortion to the target speech Particularly efficient as an ASR front-end Can output multi-channel signals Suited for beamformer pre-processing Relatively robust to noise Efficient implementation in STFT domain A few seconds of observation are sufficient to estimate the prediction filters Matlab p-code available at:

120 2.2 Beamforming Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multi-channel dereverberation Multi-channel noise reduction Single-channel noise reduction

121 Principle Pickup signals in the direction of the target speaker Attenuate signals in the direction of the noise sources Beam pattern microphone array gain as a function of the direction of arrival of the signal

122 Beampattern Beampattern shows the response/gain of the beamformer as a function of the angle for a given frequency Provides a way to check the design of a beamformer Main lobe: Directed in the direction of the speaker Capture the target signal Null In the direction of the point noise sources Cancel point noise source Sidelobe: Inevitable when designing beamformer Causes residual noise Delay-and-sum beampattern of circular array (8 elements, diameter = 20cm) pointing at 120 degrees for frequency 2500Hz

123 Microphone signal model Consider room impulse responses only within the STFT analysis window Late reverberation as diffusive noise and included into the noise term Y j t, f Using vector notations H j m, f X t m, f + U j t, f m = H j f X t, f O j t,f + U j t, f source image at microphone j X(t, f) hf yt,f = u t,f o t,f Y 1 t, f Y J t, f = h f X t, f o t,f + u t,f Source image at microphones h f = H 1 f,, H J f T Steering vector

124 Steering vector Represents the propagation from the source to the microphones, including Propagation delays (information about the source direction) Early reflections (reverberation within the analysis window) Propagation attenuation (microphone gains) Example of plane wave assumption with free field condition (no reverberation and speaker far enough from the microphones) g 1 f g j f Reference microphone g J f Δτ j Steering vector given as : g 1 f e 2πfΔτ 1 h f = g J f e 2πfΔτ J Y j t, f = g j f e 2πfΔτ jx t, f + U j t, f Δτ j time difference of arrival (TDOA), frequency independent. g j f signal gain, frequency dependent. U j t, f reverberation and noise

125 Beamformer W 1 (f) Y j t, f W 2 (f) W j (f) + X t, f W J (f) Output of beamformer Vector notations X t, f = W j f Y j (t, f) j X t, f = w f H y t,f w f = W 1 f,, W J f T y t,f = Y 1 t, f,, Y J t, f T The filters w f are designed to remove noise

126 Processing flow W 1 (f) Y j t, f W 2 (f) W j (f) + X t, f W J (f) Filter computation Delay and Sum (DS) Minimum variance distortionless response (MVDR) Spatial information extraction TDOAs Steering vectors Spatial correlation matrix

127 2.2.1 Delay and Sum beamformer

128 Delay and sum (DS) beamformer (Van Veen 88) Align the microphone signals in time Emphasize signals coming from the target direction Destructive summation for signals coming from the other directions τ j 1 j Y j t, f e 2πifΔτ 1 J e 2πifΔτ i J + X t, f Δτ j Requires estimation of TDOAs Δτ j

129 TDOA estimation Signal cross correlation peaks when signals are aligned in time Δτ ij = arg max ψ yi y j (τ) ψ yi y j τ τ = E y i t y j (t + τ) ψ yi y j Δτ ij ψ yi y j τ Δτ ij τ The cross correlation is sensitive to noise and reverberation Usually use GCC-PHAT* coefficients that are more robust to reverberation ψ PHAT Y i f Y j f (Knapp 76, Brutti 08) yi y j τ = IFFT Y i f Y j f *Generalized Cross Correlation with Phase Transform (GCC-PHAT)

130 BeamformIt a robust implementation of a weighted DS beamformer* (Anguera 07) BeamformIt: Used in baseline systems for several tasks, AMI, CHiME 3/4 Toolkit available : Single channel noise reduction Beamforming w f = α 1 J e2πifδτ 1,, α J Filter computation J e2πifδτ J T α 1,, α J Reference channel selection GCC-PHAT coeff. computation Channel weight computation Filtering noisy TDOAs Δτ 1,, Δτ J TDOA tracking with Viterbi decoding * Also sometimes called filter-and-sum beamformer

131 2.2.2 MVDR beamformer

132 Minimum variance distortionless response (MVDR*) beamformer Beamformer output: X t, f = w f H y t,f = w f H h f X(t, f) + w f H u t,f X(t, f) h f Speech X(t, f) is unchanged (distortionless): w f H h f = 1 Minimize noise at the output of the beamformer u t,f yt,f X t, f = X t, f + w f H u t,f Filter is obtained by solving the following: * MVDR beamformer is a special case of the more general linearly constrained minimum variance (LCMV) beamformer (Van Veen 88)

133 Expression of the MVDR filter MVDR filter given by w f MVDR = R f noise 1 hf h f H R f noise 1 h f Where R f noise is the spatial correlation matrix* of the noise, which measures the correlation among noise signals at the different microphones R f noise = t u t,f u H t,f = 1 T T t 1 T T t U 1 t, f U 1 t, f 1 T U 1 t, f U J t, f t U J t, f U 1 t, f 1 T U J t, f U J t, f t T T * The spatial correlation matrix is also called cross spectral density

134 Alternate expression It is possible to derive other expression for the MVDR filters 1. As a function of the microphone signal correlation matrix w f MVDR = R f obs 1 hf h f H R f obs 1 h f 2. Without explicit steering vector estimation (Souden 10) w f MVDR = R f noise 1 R f speech trace R f noise 1 R f speech u - u = 0, 0,1,0,, 0 T vector with 1 for the reference microphone - As it does not require steering vector estimation, it may be easier to use for joint optimization

obs = t R noise f = σ H t M(t, f)y t,f y t,f σ t M(t, f) Spectral masks H y t,f y t,f M(t, f) = ቊ 1 if noise > speech 0 otherwise

135 Steering vector estimation The steering vector h f can be obtained as the principal eigenvector of the spatial correlation matrix of the source image signals R f speech h f = P R f speech Microphone signal (speech + noise) Noise estimate M(t, f)y i (t, f) R f obs = t R noise f = σ H t M(t, f)y t,f y t,f σ t M(t, f) Spectral masks H y t,f y t,f M(t, f) = ቊ 1 if noise > speech 0 otherwise Source image spatial correlation matrix R f speech = R f obs R f noise (Souden 13, Higuchi 16, Yoshioka 15, Heymann 15)

Spectral mask estimation Clustering of spatial features for mask estimation Source models Watson mixture model (Souden 13) Complex Gaussian mixture model (Higuchi 16) E-step: update masks M t,f = p

136 Spectral mask estimation Clustering of spatial features for mask estimation Source models Watson mixture model (Souden 13) Complex Gaussian mixture model (Higuchi 16) E-step: update masks M t,f = p noise y t,f, R f noise, R f speech M t,f M-step: update spatial corr. matrix R f noise = σ t M(t, f)y t,f y t,f H σ t M(t, f) R f noise Neural network-based approach (Hori'15, Heymann 15) See section

137 Processing flow of MVDR beamformer Beamforming X t, f = w f H y t,f Time-frequency mask estimation w f MVDR = Filter estimation R f noise 1 hf h f H R f noise 1 h f M(t, f) = ቊ 1 if u t,f > o t,f 0 otherwise Spatial correlation matrix estimation R f noise Steering vector estimation h f = P R f speech R f speech = R f obs R f noise

138 Other beamformers Max-SNR beamformer* (VanVeen 88, Araki 07, Waritz 07) Optimize the output SNR without the distortionless constraint w maxsnr noise f = P R 1 obs f R f Multi-channel Wiener filter (MCWF) (Doclo 02) Preserves spatial information at the output (multi-channel output) w f MCWF = R f obs 1 R f speech Max-SNR beamformer and MCWF can also be derived from the spatial correlation matrices Generalized sidelobe canceller (GSC) linearly constrained minimum-variance (LCMV) beamformer such as MVDR can be expressed as a GSC Distortion-less filter + + Blocking matrix Noise estimation * Max-SNR beamformer is also called generalized eigenvalue beamformer

139 2.2.3 Experiments

140 CHiME 3 results No proc TF Masking Beamformit Max SNR MCWF MVDR Results for the CHiME 3 task (Real Data, eval set) - Deep CNN-based acoustic model trained with 6 channel training data - No speaker adaptation - Decoding with RNN-LM

141 Frequency (Hz) Sound demo Clean Observed (SimuData) MVDR 8000 MASK 0 Time (sec.)

142 remarks Delay-and-sum beamformer Simple approach Relies on correct TDOA estimation Errors in TDOA estimation may result in amplifying noise Not optimal for noise reduction in general Weighted DS beamformer (BeamformIt) Includes weights to compensate for amplitude differences among the microphone signals Uses a more robust TDOA estimation than simply GCC-PHAT Still potentially affected by noise and reverberation Not optimal for noise reduction MVDR beamformer Optimized for noise reduction while preserving speech (distortionless) Extracting spatial information is a key for success From TDOA Poor performance with noise and reverberation From signal statistics More robust to noise and reverberation More involving in terms of computations compared to DS beamformer

143 Conclusions for section 2 Multi-microphone speech enhancement can greatly improve ASR performance Dereverberation Beamformer Distortion-less processing is important Beamforming outperforms TF masking for ASR TF masking removes more noise Linear filtering causes less distortion (especially with the distortionless constraint) This leads to better ASR performance Future directions Online extension (source tracking) Multiple speakers Tighter interconnection of SE front-end and ASR back-end (See section 3.)

144 References (SE-Front-end 1/4) (Anguera 07) (Araki 07) (Boll 79) (Brutti 08) (Delcroix 14) (Delcroix 15) (Doclo 02) (Erdogan 15) Anguera, X., et al. Acoustic beamforming for speaker diarization of meetings, IEEE Trans. ASLP (2007). Araki, S., et al. Blind speech separation in a meeting situation with maximum snr beamformers, Proc. ICASSP (2007). Boll, S. Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. ASSP (1979). Brutti, A., et al. Comparison between different sound source localization techniques based on a real data collection, Proc. HSCMA (2008). Delcroix, M., et al. Linear prediction-based dereverberation with advanced speech enhancement and recognition technologies for the REVERB challenge, Proc. REVERB (2014). Delcroix, M., et al. Strategies for distant speech recognition in reverberant environments, EURASIP Journal ASP (2015). Doclo, S., et al. GSVD-based optimal filtering for single and multi-microphone speech enhancement,. IEEE Trans. SP (2002). Erdogan, H., et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, Proc. ICASSP (2015). (Haykin 96) Haykin, S. Adaptive Filter Theory (3rd Ed.), Prentice-Hall, Inc., Upper Saddle River, NJ, USA (1996). (Heymann 15) (Heymann 16) (Higuchi 16) Heymann, J., et al. BLSTM supported GEV Beamformer front-end for the 3rd CHiME challenge, Proc. ASRU (2015). Heymann, J., et al. Neural network based spectral mask estimation for acoustic beamforming, Proc. ICASSP (2016). Higuchi, T., et al. Robust MVDR beamforming using time frequency masks for online/offline ASR in noise, Proc. ICASSP (2016)

145 References (SE-Front-end 2/4) (Hori 15) (Jukic 14) (Kinoshita 07) (Knapp 76) (Lebart 01) (Li 13) Hori, T., et al. The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition, Proc. of ASRU (2015). Jukic, A., et al. Speech dereverberation using weighted prediction error with Laplacian model of the desired signal, Proc. ICASSP (2014). Kinoshita, K., et al., Multi-step linear prediction based speech dereverberation in noisy reverberant environment, Proc. Interspeech (2007). Knapp, C.H., et al. The generalized correlation method for estimation of time delay, IEEE Trans. ASSP (1976). Lebart, K., et al. A new method based on spectral subtraction for speech Dereverberation, Acta Acoust (2001). Li, B. and Sim, K.C., Improving robustness of deep neural networks via spectral masking for automatic speech recognition, Proc. ASRU, (Lim 79) Lim, J.S., et al. Enhancement and bandwidth compression of noisy speech, Proc. IEEE (1979). (Mandel 10) (Nakatani 10) (Narayanan 13) (Ozerov 10) (Sawada 04) Mandel, M.I., et al. Model-based expectation maximization source separation and localization, IEEE Trans. ASLP (2010). Nakatani, T., et al. Speech Dereverberation based on variance-normalized delayed linear prediction, IEEE Trans. ASLP (2010). Narayanan, A., et al. Ideal ratio mask estimation using deep neural networks for robust speech recognition, Proc. ICASSP (2013). Ozerov, A., et al. Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation, in IEEE Trans. ASLP (2010). Sawada, H., et al. A robust and precise method for solving the permutation problem of frequencydomain blind source separation, IEEE Trans. SAP (2004)

146 References (SE-Front-end 3/4) (Seltzer 13) (Soudem 10) (Souden 13) (Tachioka 14) (Van Trees 02) Seltzer, M. L. Et al. An investigation of noise robustness of deep neural networks, Proc. ICASSP, 2013 Souden, M., et al. On optimal frequency-domain multichannel linear filtering for noise reduction, IEEE Trans. ASLP (2010). Souden, M., et al. A multichannel MMSE-based framework for speech source separation and noise reduction, IEEE Trans. ASLP (2013). Tachioka, Y., et al. Dual system combination approach for various reverberant environments with dereverberation techniques, Proc. REVERB (2014). Van Trees, H.L. Detection, estimation, and modulation theory. Part IV., Optimum array processing, Wiley-Interscience, New York (2002). (Van Veen 88) Van Veen, B.D., et al. Beamforming: A versatile approach to spatial filtering, IEEE ASSP (1988). (Virtanen 07) (Wang 06) (Wang 16) (Waritz 07) (Weninger 14) (Weninger 15) Virtanen, T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Trans. ASLP (2007). Wang, D. L., & Brown, G. J. (Eds.). Computational auditory scene analysis: Principles, algorithms, and applications, Hoboken, NJ: Wiley/IEEE Press (2006). Wang, Z.-Q., et al., A Joint Training Framework for Robust automatic speech recognition, IEEE/ACM Trans. ASLP (2016). Warsitz, E., et al. Blind acoustic beamforming based on generalized eigenvalue decomposition, IEEE Trans. ASLP (2007). Weninger, F., et al. The MERL/MELCO/TUM system for the REVERB challenge using deep recurrent neural network feature enhancement, Proc. REVERB (2014). Weninger, F., et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, Proc. LVA/ICA (2015)

147 References (SE-Front-end 4/4) (Xiao 16) (Xu 15) (Yoshioka 12) (Yoshioka 12b) (Yoshioka 15) Xiao, X., et al. Deep beamforming networks for multi-channel speech recognition, Proc. ICASSP (2016). Xu, Y., et al. A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. ASLP (2015). Yoshioka, T., et al. Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening, IEEE Trans. ASLP (2012). Yoshioka, T.,, et al. Making machines understand us in reverberant rooms: robustness against reverberation for automatic speech recognition, IEEE Signal Process. Mag. (2012). Yoshioka, T., et al. The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices, Proc. ASRU (2015)

148 150

149

150 152

151 Integration of front-end and back-end with deep network Techniques discussed so far are designed separately. SE front-end, feature extraction, and acoustic model Next, we examine techniques that perform global optimization of these tasks. Towards end-to-end acoustic modeling: integrated SE, FE, and AM. Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model

152 Integrated beamforming and acoustic modeling We will focus on the integration of beamforming front-end with acoustic modeling. Y j (t, f) SE front-end X(t, f) Feature extraction o t O Recognizer W My name is Model adaptation Acoustic model Lexicon Language model Multi-channel dereverberation Multi-channel noise reduction (Beamforming) Single-channel noise reduction

153 Review of traditional processing pipeline of beamforming for ASR Y j (t, f) X(t, f) o t Acoustic p(s t = k o t ) Feature Beamforming model neural Extraction nets To decoder Delay-and-sum Weighted delay-and-sum MVDR (Minimum Variance Distortionless Response) Filter banks MFCCs DNN LSTM CNN Limitations: Not fine-tuned for ASR optimized for signal level criterion (e.g. SNR) rather than ASR criterion (e.g. WER). No learning capability cannot leverage on history data (experience) for better performance

154 Towards end-to-end acoustic modeling Y j (t, f) X(t, f) o t Acoustic p(s t = k o t ) Beamforming Feature model neural neural net Extraction net DNN LSTM CNN All processing steps are implemented as neural nets learn beamforming and feature extraction from data (may incorporate prior knowledge in model design). global optimization for ASR. Focus on neural nets implementation of beamforming learning based beamforming Filter banks MFCCs DNN LSTM CNN DNN LSTM CNN From ASR flow direction of gradient

155 Taxonomy of Beamforming Beamforming Generative: use a generative model to guide BF filter optimization. Discriminative: use ASR cost to guide filter learning or prediction. Traditional DS, MVDR, MWF, (Section 2.2) Generative Learning based Discriminative This section will cover the 5 blue leaf nodes in details. Max. likelihood beamforming (Section 3.1) BF filter learning BF filter prediction BF: beamforming With phase information (Section 3.2) No phase information (Section 3.3) Direct filter prediction (Section 3.4) Filter prediction with mask (Section 3.5)

156 Use of prior knowledge in learning based beamforming Classical array processing theory provides us valuable knowledge about how to design beamforming. Questions for model design: Shall we use our knowledge about classic beamforming in learning based methods? How much knowledge and how to incorporate it into the model design? Delay-and-sum, MVDR, etc Mask prediction by DNN Filter prediction by DNN, ML beamforming Multi-channel acoustic model More signal processing More deep learning

157 3.1 Generative Approach - Maximum Likelihood Filter Estimation

158 Generative approach Observed features depend on both steering vector and state sequence. The task is to infer the steering vector (the beamforming filter) and state sequence from observed features. In the training phase, we learn acoustic model p O S; θ) from single-channel clean speech. O is the feature vector sequence, e.g. MFCCs or filterbanks. S is the state sequence. θ represents the acoustic model, e.g. an HMM/GMM model. In the test phase, we estimate S and w by maximizing the posterior probability p(w, S O; θ). w O S Steering vector Observed features State sequence

159 Parameter Estimation (Seltzers 03/04) It is difficult to jointly estimate S and w from p(w, S O; Λ). We can estimate S and w in alternate order, i.e. fix w and estimate S fix S and estimate w until convergence. Fix w to its current estimate and obtain S by maximizing p(s w, O; Λ). If word label is available supervised estimation no need to decode O. If word label is not available unsupervised estimation equivalent to Viterbi decoding of features generated from beamformed signals using filter weights w. Fix S to its current estimate and obtain w by maximizing p(w መS, O; Λ). ML estimation to be described in next slide. w O S Steering vector Observed features State sequence

160 Maximum Likelihood Estimation (LIMABEAM) (Seltzers 03/04) Given S, the filter weight can be obtained by using the maximum likelihood criterion, assuming noninformative prior for w: w = arg max w log p(o w, S; θ) Filter weights w can be in either time domain or frequency domain. The estimation problem can be solved iteratively with the expectation-maximization (EM) algorithm (see section 1.2. for the EM algorithm). In E-step, update state alignment and Gaussian posterior probabilities using current w. In M-step, update filter weights by using gradient ascend and current Gaussian posteriors. Time domain LIMABEAM represented as computational graph Negative log likelihood on HMM/GMM Tunable filter weights in time domain MFCC Feature Extraction Short-Time Fourier Transform (STFT) w Filter-and-sum in Time Domain Single-channel enhanced Fourier coefficients Single-channel enhanced waveform HMM State sequence x(t) = w j (t) y j (t) j Multi-channel speech signals Red arrows show the back propagation of gradient for tuning the filter weights

Performance of LIMABEAM (Seltzers 03/04) CMU WSJ PDA corpus Array with 4 distant mics (see figure). Simultaneous close-talk recording. Room T60=0.27s. SNR=13dB (PDA-B subset).

161 Performance of LIMABEAM (Seltzers 03/04) CMU WSJ PDA corpus Array with 4 distant mics (see figure). Simultaneous close-talk recording. Room T60=0.27s. SNR=13dB (PDA-B subset). Speaker reads text from the test set of WSJ0. Variable position of PDA from utterance to utterance. ASR settings HMM/GMM acoustic model trained from 7000 clean sentences, 39D MFCC features (CMN). LIMABEAM settings First pass decoding on delay-and-sum (DS) features to obtain hypothesized text that is used to obtain HMM state alignment. LIMABEAM weights initialized as DS weights (20 taps in time domain for each channel). WER on PDA-B subset (Seltzer 04) Settings WER 1 distant mic 68.3% Delay-and-sum 58.9% LIMABEAM (Unsupervised) 39.8% Close talk 27.5%

162 Frequency Domain Implementation for CHiME4 (Xiao 16b) ML estimation is revised as follows: w = arg max w log p(o w ; θ) + T 2 log σ o α 2 w w 0 2 HMM/GMM is replaced by a GMM (1024 Gaussians) no need to find S avoid multi-pass decoding in unsupervised mode. Log determinant of feature covariance matrix σ o is used to account for feature space change due to beamforming. L2 regularization is used to avoid overfitting. w 0 is initial weights. α 2 w w 0 2 L2 Regularization Cost T 2 log σ o log p(o w ; θ) Negative Log Determinant of Covariance w Tunable Beamforming Weights Feature Extraction Beamforming in Frequency Domain Negative Log Likelihood on HMM/GMM or UBM/GMM MFCC or log Mel filterbanks Single-channel enhanced Fourier coefficients X t, f = w f H y t,f Multi-channel Fourier coefficients Short-Time Fourier Transform (STFT) Multi-channel speech signals Blue boxes are cost functions

163 Frequency Domain Implementation for CHiME4 (Xiao 16b) Two ways of parameterization of the beamforming filter weights. Option 1: Real + imaginary parts of weights w j f = a j f + ib j (f). Number of free parameters = 2JF, J is the number of channels, and F is the number of frequency bins. Option 2: TDOA + gain w f = [g 1 (f)e 2πfΔτ 1,, g J (f)e 2πfΔτ J] T. Assume TDOA is frequency-independent to reduce the number of free parameters. We may first estimate frequency-independent gain (less parameters, more reliable estimation) and then use it to initialize the frequency-dependent gain (more parameters, use L 2 regularization). Two Feature spaces MFCC features (39D) is used for GMM training (diagonal covariance matrices) and ML filter estimation. Log Mel filterbank features (40D) is used for decoding with the DNN based acoustic model

164 Feature Extraction Steps Implemented as linear transform y t = Ax t A is a 40x257 matrix, where 40 is the number of Mel filterbanks, and 257 is the number of frequency bins. Parameters of A is shown as follows: 1 Mel filterbank index Frequency bin Obtain the power spectrum from complex spectrum. Forward pass: y t = x t x t, where is elementwise multiplication and x t is complex conjugate of x t. Backward pass: Δx t = 2(Δy t x t )

165 Feature Extraction Steps Forward pass: Δx t = σ2 τ=1 τ(x t+τ x t τ ) / (2 σ2 τ=1 τ 2 ), where Δx t and x t are the delta and original features, respectively. Implemented as affine transform of feature trajectories: Δx = Ax, where x = x 1,, x T T, A is a T T sparse transform (see Xiao 16d). Backward pass is straightforward as it is a linear transform. Both delta and acceleration features are used. Forward pass: y t = x t μ x, where μ x is the mean of x t over time. Backward pass: x t = y t μ yt, where μ yt is the mean of y t. Forward pass: y = log x. Backward pass: x = y /x. A small constant is added to prevent numerical instable gradient due to 1/x

166 Feature Extraction Steps Implemented as a diagonal affine transform. Parameters obtained from a portion of training data using initial beamforming filter weights. These parameters are fixed during training. Forward pass: y t = x T t L T,, x t+l T, where 2L+1 is the context size. Backward pass: collect gradient x t from corresponding dimensions of y t of neighboring frames

167 Results of Modified LIMABEAM on CHiME4 (Xiao 16b) Modified LIMABEAM implemented in the SignalGraph toolkit: A Matlab based deep learning toolkit for signal processing tasks. Available online in a few days after the tutorial. Modified LIMABEAM performs slightly worse (14.5%) than BeamformIt (13.6%). Perhaps due to the local optimal problem of EM algorithm. When initialized with MVDR weights, modified LIMABEAM reduces WER from 9.5% to 9.2%. Analysis Filter weights are optimized for GMM instead of DNN acoustic model not optimal performance. Not using first pass decoding for generating state alignment less accurate Gaussian posteriors? WER on CHiME4 6-ch track et05-real on DNN SMBR (6-ch pooled training) and trigram LM. Methods Settings WER Single channel Delay-andsum (BeamformIt) MVDR Modified LIMABEAM Single channel distant microphone. Weighted delay-and-sum with TDOA tracking and channel gain estimation. MVDR using LSTM predicted time-freq. mask. TDOAs initialized as 0, gains initialized as 1. Weights initialized by MVDR beamforming 21.6% 13.6% 9.5% 14.5% 9.2%

168 Summary of Generative Approach Advantage of generative approaches Rely only on clean acoustic model for filter estimation robust to unseen noise types. Do not need to use clean-noisy parallel speech for training. Limitations Filter estimation is time consuming. Mismatched models in filter estimation (GMM) and decoding (DNN). Ways for improvement: Better gemerative model for capturing a distribution of speech signals. Better weight initialization (EM is sensitive to initialization). Better and faster optimization methods. Optimizing directly for DNN used in decoder. Multi-condition environment adaptive training (similar to speaker adaptive training)

169 Summary of Generative Approach Delay-and-sum, MVDR: uses signal processing theory to estimate filters. Maximum likelihood beamforming: uses beamforming equation, but uses machine learning for filter estimation. Multi-channel acoustic model: do not even use beamforming equation. Black box approach. More signal processing More machine learning

170 3.2 Discriminative Approach - filter learning with phase information

171 Filter learning with phase information Both magnitude and phase information are available for the acoustic model. Use time domain raw waveforms as inputs. Or use frequency domain magnitude+phase or real+imaginary parts as inputs. Let the network learn a bank of spatial filters that are optimized for ASR. The learned spatial filters usually look at different directions of arrival (DOA) (Hoshen 15, Sainath 15). The information from all directions are fed to the rest of acoustic model. Let the model decide how to use these information for best ASR performance. Array inputs Spatial Filter 1 Spatial Filter P Rest of Acoustic Model Speech class posteriors

172 Time domain filter learning (Hoshen 15, Sainath 15) Filter-and-sum approximated by multichannel temporal convolution: Input size is J M matrix of raw waveform samples. J is #channels, M is #samples. P filters w 1 to w P, each with weight matrix of size J N, and N < M. Convolve along time axis. P up to 256. Max-pool the output of each filter. The output of convolutional layer (1 P) is used for acoustic modeling. Equivalent to using a bank of P beamforming filters, each looking at the an arrival angle and a frequency band. Acoustic Model Network (convolution + LSTM + DNN) y 1 R M N+1 w 1 R J N ASR Cross Entropy Cost Pooling and nonlinearity Speech state posterior z R 1 P y P R M N+1 w P R J N Multi-channel temporal convolution Multi-channel speech waveform of a frame: x R J M J: number of channels M: frame size in samples. P: number of filters N: filter length in samples

173 Learned beamforming filters (2ch) (Sainath 15) Selected unfactored filters Filters are 25ms long. There are 256 of them in the network. Filters exhibits strong frequency selectivity and also spatial selectivity

174 Time domain filter learning (Hoshen 15, Sainath 15) Advantages: Global optimization of beamforming, feature learning, and acoustic modeling. Limitations: Fixed looking directions. Good for 2-channel case as the beam is wide and few looking directions are enough. Large arrays have narrower beams and may require a large number of filters. Filters have both spatial and frequency selectivity need large number of filters to cover all direction/frequency combinations. Dependency on fixed array geometry need to retrain if geometry changes. Instead of using filterbanks, features are learned from multi-channel data cannot share the acoustic model with single channel data need to maintain multiple AMs

175 Factored time domain filter learning (Sainath 16) Simultaneous spatial and frequency selectivity need large number of filters to cover all spatial directions and frequency bands. Use factorized filters: spatial filter layer (tconv1) followed by temporal filter layer (tconv2). tconv1 contains very few short multi-channel filters (P up to 10, N=80 taps, or 5ms) focus on spatial filtering. Output of tconv1 are P waveforms y 1 to y P, each beamformed with a predefined looking direction. tconv2 has a large number of single-channel filters (F=128) with long filter length (L=400 taps, or 25ms) focus on temporal filtering. tconv2 is shared by all P versions of waveforms from tconv1. Acoustic Model Network (convolution + LSTM + DNN) Pooling and nonlinearity y 1 R M w 1 R J N ASR Cross Entropy Cost Speech state posterior o R F P z R M L+1 F P tconv2: single channel temporal convolution w R L F y P R M w P R J N tconv1: multi-channel temporal convolution Multi-channel speech waveform of a frame: x R J M

Filters are 5ms long. Only 10 of them in the network.

176 Learned beamforming filters (2ch) Selected unfactored filters All 10 factored filters (tconv1) Filters are 25ms long. There are 256 of them in the network. Filters exhibits strong frequency selectivity and also spatial selectivity. Filters are 5ms long. Only 10 of them in the network. Less obvious frequency selectivity for some filters better factorization. However, all the filters still exhibit both kinds of selectivity the factorization is not perfect

177 ASR results (Hoshen 15, Sainath 15/16) For 1-ch case, waveform input (i.e. automatic feature learning) outperforms filterbanks. 2ch Unfactored filter learning outperforms 8ch delay-and-sum and MVDR beamformers. (MVDR performance heavily depends on statistics estimation) Factored filters outperforms unfactored ones significantly. Multi-task learning works well. Conclusion: filter learning significantly outperforms traditional beamforming methods. Experimental settings 2000 hours of simulated training data, 3 million sentences (voice queries) Reverberation simulation: 100 room configurations, T60 in [0.4s 0.9s]. Distance between source and array in [1m 4m]. Noise: SNR in [0dB 20dB], average = 12dB. Noise samples from Youtube and daily life recordings. 8 channel uniform linear array data simulated (2cm distance between 2 mics). 20 hours simulated test data (no overlap in T60 and SNR setings). Array geometry identical to training data. Methods Word Error Rates (%) Cross entropy Sequential 1-ch, filterbanks ch, waveform Delay-and-sum, 8ch, waveform MVDR, 8ch Unfactored filter learning (2ch) Factored filter learning Factored filter learning + MSE training (MTL)

178 Summary Learning fixed spatial filters Each filter looks at one direction, the filters together cover all possible directions. Each filter also do bandpass filtering, especially in unfactored filter learning. Use little knowledge of classic beamforming Only the multi-channel temporal convolution is motivated by filter-and-sum. Joint filter learning and feature learning more flexible model perhaps more potential gain. Drawbacks: Do not make use of theoretical knowledge of signal processing. Array geometry dependent network needs to be retrained for a new geometry. Feature learning Every instance of model uses different features hard to share models between corpus

179 Summary Delay-and-sum, MVDR: uses signal processing theory to estimate filters. Maximum likelihood beamforming: uses beamforming equation, but uses machine learning for filter estimation. Filter learning: do not even use beamforming equation. Black box approach. More signal processing More deep learning

180 3.3 Discriminative Approach - filter learning without phase information

181 Filter learning without phase information (Marino 11, Swietojanski 13, Liu 14, Swietojanski 14a) Concatenate speech features (e.g. log mel filterbank) for each channel at the input of the acoustic model. y t,1 y t,2 o t = y t,1 y t,2 Acoustic modeling network With fully connected networks (Swietojanski 13, Liu 14) With CNNs (Swietojanski 14a) Features derived from power phase information is ignored. Only make use of power difference information of channels not enough information for spatial filtering

182 CNN-based multi-channel input (feature domain) Each channel considered as a different feature map input to a CNN acoustic model Conventional CNN Input maps y t,f,j w j h t,f h t,f = σ w j y t,f,j + b j Input maps Channel wise convolution y t,f,j w w w (Swietojanski 14a) h t,f,j max(h t,f,j ) j h t,f,j = σ w y t,f,j + b Process each channel with different filters w j Sum across channels Similar to beamforming but - Filter shared across time-frequency bins - Input does not include phase information Process each channel with same filter w Max pooling across channels Select the most reliable channel for each time-frequency bin

183 Results for AMI corpus Results from (Swietojanski 14a) DNN CNN Single distant mic 53.1 % 51.3 % Multi-channel input (4ch) 51.2 % 50.4 % Multi-channel input (4ch) channel-wise convolution % Beamformit (8ch) 49.5 % 46.8 % - Inputting multi-channel improves over single-channel input - Beamforming seems to perform better possibly because it exploits phase difference across channels Back-end configuration: - 1 CNN layer followed by 5 fully connected layers - Input feature 40 log mel filterbank + delta + delta-delta

184 Summary Using multi-channel power features improve ASR performance. Phase is not used not performing spatial filtering. Black box approach not clear about how the model make use of multichannel information. Delay-and-sum, MVDR Maximum likelihood beamforming Filter learning without phase information More signal processing More deep learning

185 3.4 Discriminative Approach - filter prediction by neural networks

186 Beamforming filter prediction (Xiao 16a/c, Li 16) Filter learning uses a bank of pre-defined spatial filters, each looking at different spatial directions filters not customized to test sentences. Filter prediction estimates one spatial filter for a test sentence. STFT domain filter prediction (Xiao 16a/c, work started in JSALT 2015) Time domain filter prediction (Li 16) W 1 (f) Y j t, f W j (f) W J (f) Beamforming filter predictor network + X t, f = W j f Y j (t, f) j Classic beamforming formula is used

Frequency Domain Filter Prediction (Xiao 16a/c) Average beamforming weights 4112D 588D 4112D Mean pooling Filter prediction network Extracting Generalized Cross Correlation Multi-channel speech

187 Frequency Domain Filter Prediction (Xiao 16a/c) Average beamforming weights 4112D 588D 4112D Mean pooling Filter prediction network Extracting Generalized Cross Correlation Multi-channel speech signals of an utterance 8D GCC feature vectors (reshaped to 21x28) of 4 DOA angles (45, 145, 225, 315). 21D Real part 257D 28D Imaginary part 257D Real/imaginary beamforming weights of all frames (frame rate = 10Hz) Illustration generated from 8-mic circular array with 20cm diameter. Input to network is phasecarrying features Generalized cross correlation Spatial covariance matrix Output of network is filter weights. Predict real and imaginary parts separately. 8ch x 257 bins x 2 = 4112 real values. Mean pooling over a sentence for stable filter prediction. Can be removed if speaker is moving

Phase-carrying features (Xiao 16a/c) Phase differences between channels tell us spatial location of target and interference. Generalized cross correlation (GCC, Knapp 76, Xiao 16a, see section 2.

188 Phase-carrying features (Xiao 16a/c) Phase differences between channels tell us spatial location of target and interference. Generalized cross correlation (GCC, Knapp 76, Xiao 16a, see section 2.2.1) 28 microphone pairs from 8 microphones. 21 features for every microphone pair for 16kHz fs and 20cm max microphone distance. Total features per frame is 21x28=588 features. Normalized spatial covariance matrix (SCM) ( Xiao'16c, see section 2.2.2) 28 complex values (upper triangular) from SCM of every frequency bin. Need to reduce dimensionality by averaging neighboring bins. PHAT τ = IFFT ψ yi y j R f = 1 T t Y i f Y j f Y i f Y j f H y t,f y t,f 21D GCC features GCC feature vectors (reshaped to 21x28) of 4 DOA angles (45, 145, 225, 315). Different DOA angles have different patterns good for predicting DOA and filter weights. SCM features 8x8 SCM of two time-frequency bins Power normalization reduces variability due to speech content make phase information more obvious. Upper triangle (inside red lines) are used as features. 28D

189 System Diagram (Xiao 16a/c) Cost function is cross entropy of ASR. Acoustic model and beamforming filter prediction networks are jointly optimized. Beamforming in frequency domain. Complex-valued beamforming weights ASR Cross Entropy Cost Acoustic Model DNN Feature Extraction Beamforming in Frequency Domain Speech state posterior Log Mel filterbank features Single-channel enhanced Fourier coefficients Multi-channel Fourier coefficients Filter prediction network Short-Time Fourier Transform (STFT) Multi-channel speech signals

190 Training Procedures (Xiao 16a/c) Difficult to train the global network due to deep structure Use existing beamforming methods to pre-train the filter predicting network. 1. Pre-training 2. Fine-tuning ASR Cross Entropy Cost Target beamforming weights, e.g. delay-and-sum weights Acoustic Model DNN Speech state posterior Log Mel filterbank features Mean square error Filter prediction network Complex-valued beamforming weights Feature Extraction Beamforming in Frequency Domain Single-channel enhanced Fourier coefficients Multi-channel Fourier coefficients Filter prediction network Short-Time Fourier Transform (STFT) Multi-channel speech signals of an utterance Multi-channel speech signals

191 Experimental settings on AMI meeting corpus 8-channel table-top circular array, 20cm diameter. Filter prediction network: 3x1024 DNN. 315 o 0 o 45 o Pre-trained by 100 hours of 8-ch simulated data (WSJCAM0 + reverb + noise) 270 o 90 o Fine tuning on on 75 hours of AMI data. Simulated data and real data share the same array geometry. 225 o 180 o 135 o Acoustic model: 6x2048 DNN. Trained by 75 hours of enhanced AMI data. A circular array with 20cm diameter and 8 microphones 10 hours eval data on AMI

192 Predicted filter s beam patterns (Xiao 16a) Beam patterns for simulated sentence, DOA = 166 degrees (denoted by the red square). 8-channel circular microphone array, 20cm diameter. Delay-and-Sum with true DOA. DNN DNN (Pre-trained by by delay-and sum) sum) DNN DNN (refined by by speech enhancement cost) cost) DNN DNN (refined by by ASR ASR cross cross entropy cost) cost) Pre-trained network predicts similar filter as delay-and-sum good student learns well from teacher. DOA is estimated correctly at all frequencies. CE refined network predicts different DOAs differently from pre-trained network

193 Predicted filter s beam patterns (Xiao 16a) Beam patterns for real sentence (from AMI eval set), no ground truth DOA. All predicted filters predict to similar direction, about 110 degrees. Moderately different beam patterns between different stages of training. DNN (Pre-trained by delay-and sum) DNN (refined by speech enhancement cost) DNN (refined by ASR cross entropy cost)

Enhanced speech (Xiao 16a) 8000Hz 0Hz Close-talk mic (IHM) 1 st far-talk mic (SDM1) Beamformed by pretrained network Beamformed by enhancement refined network Spectrograms from AMI real eval set.

194 Enhanced speech (Xiao 16a) 8000Hz 0Hz Close-talk mic (IHM) 1 st far-talk mic (SDM1) Beamformed by pretrained network Beamformed by enhancement refined network Spectrograms from AMI real eval set. Direct path signal is enhanced more than reflected signals improved direct-to-reverberant energy ratio. See frames near 140 for reduced reverberation. Beamformed by ASR cross entropy refined network Frame index

195 Results on the AMI corpus (Xiao 16a/c) Word Error Rates (%) on AMI eval set. Trigram LM is used. Spontaneous speech + nonnative speakers high WER even for close-talk microphone. Train on simulated data, test on real data. Fine-tuning with ASR cost function Methods Phase-carrying features GCC 1 ch (close-talk) 25.5 % 1 ch (far-talk) 53.8 % Delay-and-sum, 8ch (BeamformIt) Filter predictor network, 8ch, pre-trained by delay-and-sum and simulated data Filter predictor network, 8ch, refined by ASR cost on real data 47.9 % SCM 47.2 % 45.9% 44.7 % 44.1%

Time Domain Filter Prediction (Li 16) Li 16 applied the filter prediction approach to the time domain with a more advanced acoustic model setting. Filter prediction.

196 Time Domain Filter Prediction (Li 16) Li 16 applied the filter prediction approach to the time domain with a more advanced acoustic model setting. Filter prediction. Use LSTMs to predict FIR filter weights for a two-channel array ( taps long for 16kHz sampling rate) First LSTM layer shared by the 2 channels. Second LSTM layer splitted. Online prediction, no mean pooling. Gated feedback. Inputs to LSTM include signal samples and feedback from last hidden layer of acoustic model. Multi-task learning. Also reconstruct clean speech from early stage of acoustic model. No pre-training, use ASR cost function all the way (2000 hours training data)

197 Time Domain - Results (Li 16) Data and settings same as the filter learning experiments Observations Using multi-task learning improves performance significantly (1.0% absolute). Gated feedback improves moderately (0.2% absolute). Frequency domain filter prediction is 0.5% worse than time domain filter prediction. Perhaps because there are too many parameters to predict in freq. domain and mean pooling is not used. Methods Word Error Rate (%) Cross entropy Sequential Single channel Delay-andsum Unfactored filter learning Comparison with Filter Learning Similar performance between filter learning and filter prediction. Factored filter learning Filter prediction

198 Summary Network is able to learn the nonlinear mapping between input signals to filter weights (a highly nonlinear function) Waveform to time domain weights (Li'16) GCC or SCM features to frequency domain weights (Xiao'16a/b) Limitations: Array geometry dependent network needs retraining for new geometry. Training recipe to be added to SignalGraph in a few days after the tutorial Delay-and-sum, MVDR, etc More signal processing is used in filter prediction than in filter learning. Filter prediction Filter learning More signal processing More deep learning

199 3.5 Discriminative Approach - mask prediction for filter estimation

200 Mask estimation by neural networks Time frequency (TF) mask defines whether a TF bin in the spectrogram is dominated by speech or noise. The mask can be predicted by a neural network, and then used for speech enhancement. Mask is used as a filter and multiplied with noisy log power spectrum. In this section, we review the use of estimated mask for beamforming filter estimation. Before that, let s first review the use of mask estimation in single-channel speech enhancement. Mask estimation for single channel speech enhancement Mask m t Mask predictor network Extract log power spectrum Enhanced spectrogram Elementwise Multiplication Single-channel speech waveform

201 Single-channel joint optimization (Wang 16) Mask m t DNN-based mask estimator Elementwise Multiplication Enhanced log power spectrum Feature Extraction Acoustic Model Features Speech class posterior probabilities ASR based cost function Mask estimator can be pretrained by using ideal binary/ratio mask as target. The estimator can be fine tuned for ASR by using ASR cost function. Extract log power spectrum ASR cost function True label Single-channel speech waveform

202 Experiments on CHiME 2 Results from (Wang 16) System CE smbr Baseline (No SE front-end) 16.2 % 13.9 % Mask estimation using CE 14.8 % 13.4 % Mask estimation + retraining 15.5 % 13.9 % Joint training of mask estimation and acoustic model 14.0 % 12.1 % Large DNN-based acoustic model 15.2 % - Enhancement DNN - Predict mask (CE Objective function) - Features: Log power spectrum Acoustic model DNN - Log Mel Filterbanks - Trained on noisy speech with cross entropy (CE) or smbr objective function

203 Mask estimation for beamforming Use estimated mask to derive filter parameters. Compared to filter prediction network, mask prediction approach Uses more signal processing prior knowledge. Easier to achieve array geometry independency. Filter weights Filter weights Filter prediction network More specific role of neural net TF mask Classic beamforming estimation equation Mask prediction network Array signals Array signals

204 MVDR filter estimation using mask (Heymann 15, Hori 15, Heymann 16, Xiao 16b/17) The estimated masks can also be used to compute the spatial covariance matrix (SCM) of speech and noise (Similar equations are used in section 2.2.2): R speech f = σ t M S H (t,f)y t,f y t,f σ t M S (t,f) R noise f = σ t M N H (t,f)y t,f y t,f σ t M N (t,f) M N and M S are noise and speech masks respectively. y t,f is the N 1 vector of observed Fourier transform coefficients of time t and frequency f, N is the number of channels. Use single mask value for all channels may not be optimal, may improve it in the future. Beamforming filters can be derived from the SCMs, e.g. Used to derive beamforming filters Estimated mask Neural networks based mask predictor w f MVDR = R f noise 1 R f speech u trace[ R f noise 1 R f speech ] u is a column vector that specifies which channel is reference. The first channel is used as reference here. It s possible to select the reference based on SNR or TDOA information. Also possible to use other beamformings such as generalized eigenvalue (GEV, Heymann 15/16). Noisy log power spectrum

205 LSTM-based mask estimation for beamforming (Xiao 16b/17) Speech Mask Noise Mask Bemforming in frequency domain X t, f = w f H y t,f Pooling Pooling w f MVDR = R f noise 1 R f speech trace[ R f noise 1 R f speech ] u LSTM mask estimator (Process channels separately) Extract log power spectrum Apply STFT on channels Obtain MVDR beamforming weights R f noise R f speech Estimate Spatial Covariance Matrix Combine the advantage of both neural network (for mask prediction) and traditional signal processing (portability). Masks derived from 1ch signals Con: does not exploit spatial information for mask estimation Pro: array geometry independent

206 Predicting speech and noise masks separately We can obtain noise mask from speech mask: m n t,f s = 1 m t,f We can also obtain noise mask independently use different projection transforms for speech and noise masks. (Xiao 16b/17) x t x t x t LSTM h t Affine transform & sigmoid Affine transform & sigmoid m t s m t n

207 Refining LSTM Mask Predictor by ASR Cost (Xiao 16b,17) Speech Mask Noise Mask Pooling Pooling LSTM mask estimator (Process channels separately) Bemforming in frequency domain Filter weights Obtain MVDR beamforming weights Estimate Spatial Covariance Matrix Spatial covariance Enhanced STFT coefficients Feature Extraction Acoustic Model Features Speech class posterior probabilities ASR based cost function Cross entropy cost Extract log power spectrum Apply STFT on channels True label Mask predictor directly minimizes ASR cost function. Multi-channel speech signals of an utterance

208 Training Procedures (Xiao 16b,17) Use ideal binary mask to pre-train the mask estimator from simulated data. 1. Pre-training 2. Fine-tuning ASR Cross Entropy Cost Ideal binary mask of simulated data Acoustic Model DNN Speech state posterior Single-channel enhanced Fourier coefficients Mean square error Complex-valued beamforming weights Beamforming in Frequency Domain Mask prediction network MVDR filter estimation equation Mask prediction network Multi-channel Fourier coefficients Short-Time Fourier Transform (STFT) Single-channel speech signals of an utterance Multi-channel speech signals

209 Multi-pass mask estimation at run-time Mask estimation and beamforming can be performed iteratively in alternate order. Enhanced signal Noisy signal Mask estimation Beamforming Both mask and signal can be improved until convergence. 2-3 passes are enough. Estimated mask Speech Mask Noise Mask LSTM mask estimator + pooling (Process channels separately) Beamforming Obtain MVDR beamforming weights Estimate Spatial Covariance Matrix Speech Mask Noise Mask LSTM mask estimator Beamforming Obtain MVDR beamforming weights Estimate Spatial Covariance Matrix Speech Mask Noise Mask LSTM mask estimator Beamforming Obtain MVDR beamforming weights Estimate Spatial Covariance Matrix Final enhanced speech Extract log power spectrum Extract log power spectrum Extract log power spectrum Apply STFT on channels Multi-channel speech signals of an utterance Unfolded network of multi-pass mask estimation

Experimental settings on CHiME-4 corpus 6-channel rectangular array mounted on a notepad computer (See section 1.1). Channel 2 not facing speaker low SNR.

210 Experimental settings on CHiME-4 corpus 6-channel rectangular array mounted on a notepad computer (See section 1.1). Channel 2 not facing speaker low SNR. Near field scenario, different gains between channels. Mask prediction network (1 layer LSTM with 1024 memory cells) Pre-trained by 100 hours of 1-ch simulated data (WSJ0 + reverb + noise) Fine tuning on 18 hours of CHiME-4 training data data using cross entropy cost. Acoustic model: 7x2048 DNN. Trained by 15 hours simulated + 3 hours real data (CHiME-4 training data). Sequentially trained. 1 hours eval data 6 microphones mounted on a notepad Real data are recorded in different environments: bus, cafeteria, pedestrian, and street

Estimated masks Masks estimated by LSTM that is pre-trained to predict speech ideal binary mask. Masks estimated by LSTM that is refined to reduce ASR cross entropy cost.

211 Estimated masks Masks estimated by LSTM that is pre-trained to predict speech ideal binary mask. Masks estimated by LSTM that is refined to reduce ASR cross entropy cost. Masks estimated by LSTM that is refined to reduce ASR cross entropy cost. Speech and noise masks are separately estimated. Observation Noise masks generated by LSTM trained with ASR cost is more sparse than that generated by IBM-approximating LSTM. Split masks (separately estimated noise and speech masks) are more sparser. Possible explanation: it s good to only use confident TF bins for estimating spatial covariance matrices

212 Experiments on CHiME-4 (6 channel) Xiao 16b Settings WER (%) #ch for mask prediction ASR cost Split Mask Pooling #Pass Eval Real data 1-channel track 21.6 Weighted delay-and-sum (BeamformIt, TDOA tracking + gain estimation) 13.6 Traditional MVDR (TDOA tracking + gain and noise estimation) 12.0 Frequency domain filter prediction (MSE training) 12.9 Observation Mask prediction + MVDR formula produces good gain. Filter prediction not working well due to large gain difference between channels. Outperform delay-and-sum slightly. Use only the first channel Estimate masks for all 6 channels, then pool the masks No Yes No Yes No median Effect of ASR cost function Gain of separate noise and speech masks Gain of mask pooling Gain of multi-pass processing

213 Summary of Mask Estimation Network Advantages Single-channel mask estimation array geometry independent MVDR beamforming equation optimal noise reduction with no distortion on speech, if speech and noise statistics are accurately estimated. Combining of the strengths of neural nets and traditional multi-channel beamforming. Limitations: Not using inter-channel phase information sub-optimal performance. More limited use of neural networks less potential for improvement? Joint training recipe (Matlab) available in SignalGraph: Will add more comments, documentation, and bug fix

214 Summary of Integrated Approach Neural networks are good mappers Input to mask, input to filter weights, input to features, etc. Main challenge: how to incorporate human knowledge (signal processing) and deep learning to achieve: Optimal ASR performance, Interpretability, Generalization to unseen room acoustics and noises, Array geometry independency. Delay-and-sum, MVDR, etc Mask prediction + MVDR/GEV Filter prediction, Maximum Likelihood Filter learning with/without phase More signal processing More deep learning

215 Table of Beamforming Methods Methods Generative model? Fixed looking directions? Use phase Info? Geometry independent? Good for unseen room and noise? Joint training of SE and AM? Use noise information References Delay-and-sum No No MVDR/GEV/max No training SNR No Yes Yes Yes from data Yes Capon 69, Van Veen 88 Maximum likelihood Filter learning without phase Filter learning with phase Filter prediction Mask prediction Yes No No looking direction Yes No No Yes May be not Yes Maybe, depends on how well the model generalizes Optimized for GMM Yes, SE directly optimized for DNN AM Implicitly use noise through noisy inputs Explicitly use noise estimate Van Veen 88, Anguera 07 Seltzer 03, 04, Xiao 16b Marino 11, Swietojanski 13, Liu 14, Swietojanski 1 Hoshen 15, Sainath 15, 16 Time domain: Li 16, Freq. domain: Xiao 16a,16c Non-joint training: Heymann 15,16, Erdogan 16, Joint CE training: Xiao 16b,17 Red colored properties are considered as disadvantageous, while greens are considered advantageous. Implementation of MLBF, filter prediction, and mask prediction approaches are available in SignalGraph --- a Matlab-based toolkit for applying deep learning technologies to signal processing

216 References (Anguera 07) Anguera, X. et al. Acoustic beamforming for speaker diarization of meetings, IEEE Trans. ASLP (2007). (Capon 69) Capon J. "High-resolution frequency-wavenumber spectrum analysis." Proceedings of the IEEE (1969). (Doclo 02) (Erdogan 15) (Heymann 15) (Heymann 16) (Higuchi 16) (Hori 15) Doclo, S., et al. GSVD-based optimal filtering for single and multi-microphone speech enhancement,. IEEE Trans. SP (2002). Erdogan, H., et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, Proc. ICASSP (2015). Heymann, J. et al. BLSTM supported GEV Beamformer front-end for the 3rd CHiME challenge, Proc. ASRU (2015). Heymann, J. et al. Neural network based spectral mask estimation for acoustic beamforming, Proc. ICASSP (2016). Higuchi, T. et al. Robust MVDR beamforming using time frequency masks for online/offline ASR in noise, Proc. ICASSP (2016). Hori, T., et al. The MERL/SRI system for the 3rd CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition, Proc. of ASRU (2015). (Hoshen 15) Hoshen, Y. et al. Speech acoustic modeling from row multichannel waveforms, Proc. ICASSP (2015) (Knapp 76) Knapp, C.H. et al. The generalized correlation method for estimation of time delay, IEEE Trans. ASSP (1976). (Li 16) (Liu 14) (Marino 11) (Narayanan 13) Li, B. et al. Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition, Proc. INTERSPEECH (2016) Liu, Y., et al. Using neural network front-ends on far field multiple microphones based speech recognition, Proc. ICASSP (2014). Marino, D., et al. "An analysis of automatic speech recognition with multiple microphones.," Proc. Interspeech (2011). Narayanan, A. et al. Ideal ratio mask estimation using deep neural networks for robust speech recognition, Proc. ICASSP (2013)

217 References (Sainath 15) Sainath, T. et al. Speaker location and microphone spacing invariant acoustic modeling from raw multichannel waveforms, Proc. ASRU (2015). (Sainath 16) Sainath, T. et al. Factored spatial and spectral multichannel raw waveform CLDNNs, Proc. Interspeech (2016). (Seltzer 03) Seltzer, M., et al. Microphone Array Processing for Robust Speech Recognition, PhD thesis (2003). (Seltzer 04) Seltzer, M., et al. Likelihood-maximizing beamforming for robust hands-free speech recognition, IEEE Trans. ASP (2004). (Swietojanski 13) Swietojanski, P., et al. Hybrid acoustic models for distant and multichannel large vocabulary speech recognition, Proc. ASRU (2013). (Swietojanski 14a) Swietojanski, P., et al. Convolutional neural networks for distant speech recognition, IEEE Sig. Proc. Letters (2014). (Swietojanski 14b) Swietojanski, P., et al. Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models, Proc. SLT (2014). (Van Veen 88) Van Veen, B.D. et al. Beamforming: A versatile approach to spatial filtering, IEEE ASSP (1988). (Wang 16) Wang, Z.-Q. et al., A Joint Training Framework for Robust automatic speech recognition, IEEE/ACM Trans. ASLP (2016). (Weninger 15) Weninger, F., et al, Speech enhancement with LSTM recurrent neural networks and its application to noiserobust ASR, in Proc. Latent Variable Analysis and Signal Separation (2015). (Xiao 15) Xiao, X. et al. A learning-based approach to direction of arrival estimation in noisy and reverberant environments, Proc. ICASSP (2015). (Xiao 16a) Xiao, X. et al. Deep beamforming networks for multi-channel speech recognition, Proc. ICASSP (2016). (Xiao 16b) Xiao, X. et al. A Study of Learning Based Beamforming Methods for Speech Recognition, Proc. CHiME-4 (2016). (Xiao 16c) Xiao, X. et al. Beamforming Networks Using Spatial Covariance Features for Far-field Speech Recognition, Proc. APSIPA (2016). (Xiao 16d) Xiao, X. et al. Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation, EURASIP Journal on Advances in Signal Processing (2016). (Xiao 17) Xiao, X. et al. On the time-frequency mask estimation for MVDR beamforming with application in robust speech recognition, submitted to ICASSP (2017). (Xu 15) Xu, Y. et al. A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Trans. ASLP (2015)

218 220

219 Recent outcomes

Recent activities on multi-microphone speech recognition ICSI meeting RATS 2005 2010

220 Recent activities on multi-microphone speech recognition ICSI meeting RATS CHiME-4 challenge April to August 2016 We will introduce latest activities

221 CHiME-4 challenge CHiME-4 revisits the datasets originally recorded for CHiME- 3 increases the level of difficulty by constraining the number of mics available for testing. Three tracks: 6ch (Equivalent to CHiME-3 challenge) 2ch 1ch Based on the outcomes of CHiME-3 challenge (CHiME-3 2nd best system[hori 15]), state-ofthe-art baseline recipe is created It s in the official Kaldi repository Kaldi github Everyone creates this baseline

222 Baseline enhancement BeamformIt Developed for NIST RT project by X. Anguera (ICSI) Weighted delay and sum beamformer based on GCC Original 5 th channel signal BeamformIt

223 Baseline features, AM, and LM (Section 1) Features So called fmllr features MFCC Context Exp. LDA STC fmllr Context Exp. o t DNN Needs GMM construction step Acoustic model (nnet1 in Kaldi) Noisy data training using 5 th channel (Not enhancement data training) 7layer-2048neuron-DNN with sequence discriminative training (smbr) Language model 3-gram LM (provided by WSJ0) 5-gram LM (SRILM) and RNNLM rescoring (Mikorov s RNNLM)

224 Baseline results (6ch track) CHiME-3 baseline CHiME-4 baseline Dev Real Dev Simu Test Real Test Simu Significant error reduction Everyone can start from this baseline

225 Baseline results 35 Channels (DNN+RNNLM) ch track 2ch track 6ch track Dev Real Dev Simu Test Real Test Simu

226 6ch WER (Test Real) Best system: 2.24% (cf: CHiME-3 best: 5.83) 8 among 15 systems outperform CHiME-3 best system CHiME-4 results

227 2ch WER Best system: 3.91% (6ch track best: 2.24%) Two systems outperform CHiME-3 6 channel best performance CHiME-4 results

228 1ch WER Best systems: ~9.2% (6ch track best: 2.24%, 2ch track best: 3.91%) Still large gap between single and multi channel systems CHiME-4 results

229 Successful front-ends Mask based beamforming (Section or Section 3.5 to obtain robust special correlation) Spatial clustering DNN/LSTM masking Top 3 systems (and many other systems) use this technique Variants of beamformer (Section 2.2.2) MVDR beamformer Max-SNR beamformer Generalized Sidelobe Canceller (GSC) System combination of various beamformer outputs

230 Successful back-ends Acoustic modeling (Section 1.3) 6 channel training Use of enhanced data (mix it to 6 channel training or adaptation) Advanced acoustic model (but limited) CNN BLSTM Model adaptation (simple retraining of model parameters) Language modeling (Section 1.2) RNNLM -> LSTM New trend Joint training (Section 3)

231 Problem solved? Multi-channel, single device, limited vocabulary (Tablet and smart phone scenarios) Yes Single-channel scenario No Multiple device, open vocabulary, very distant speech recognition????

232 Conclusion Combining SE and ASR techniques greatly improves performance in severe conditions ASR back-end technologies Feature extraction/transformation RNN/LSTM/TDNN/CNN based acoustic modeling Model adaptation, SE front-end technologies Microphone array, Neural network-based speech enhancement, Introduction of deep learning had a great impact on multimicrophone speech recognition Large performance improvement Reshuffling the importance of technologies Joint training can further change the paradigm Reaching sufficient performance under reasonable conditions There remains many challenges and opportunities for further improvement

233 Toward joint optimization? Separate optimization Joint optimization SE front-end ASR back-end SE front-end ASR back-end Both components are designed with different objective functions Potentially SE front-end can be made more robust to unseen acoustic conditions (noise types, different mic configurations) Not optimal for ASR Both components are optimized with the same objective functions Potentially more sensitive to mismatch between training and testing acoustic conditions Optimal for ASR Joint training is a recent active research topic Currently integrate front-end and acoustic model Combined with end-to-end approaches it could introduce higher level cues to the SE front-end (linguistic info )

234 Dealing with uncertainties Advanced GMM-based systems exploited the uncertainty of the SE front-end during decoding (Uncertainty decoding) Provided a way to interconnect speech enhancement front-end and ASR back-end optimized with different criteria Exploiting uncertainty within DNN-based ASR systems has not been sufficiently explored yet Joint training is one option Gaussian neural network? Are there other?

235 More severe constraints Limited number of microphones Best performances are obtained when exploiting multi-microphones 1ch 17.4 % 2ch 12.7 % 8ch 9.0 % Lapel 8.3 % Headset 5.9 % REVERB challenge Remains a great gap between performance with a single-microphone Developing more powerful single-channel approaches remains an important research topic Many systems assume batch processing or utterance batch processing Need further research for online & real-time processing

236 More diverse acoustic conditions More challenging situations are waiting to be tackled Dynamic conditions Multiple speakers Moving speakers, Various conditions Variety of microphone types/numbers/configurations Variety of acoustic conditions, rooms, noise types, SNRs, More realistic conditions Spontaneous speech Unsegmented data Microphone failures, Asyncronizations New directions Distributed mic arrays, New technologies may be needed to tackle these issues New corpora are needed to evaluate these technologies

237 Larger multi-microphone speech corpora Some industrial players have access to large amount of field data most publicly available DSR corpora are relatively small scale It has some advantages, Lower barrier of entry to the field Faster experimental turnaround New applications start with limited amount of available data But Are the developed technologies still relevant when training data cover a large variety of conditions? Could the absence of large corpora hinder the development of data demanding new technologies? There is a need to create larger publicly available multi-microphone corpus

DSR data simulation Low cost way to obtain large amount of data covering many conditions Only solution to obtain noisy/clean parallel corpora Distant microphone signals can be simulated as = +

238 DSR data simulation Low cost way to obtain large amount of data covering many conditions Only solution to obtain noisy/clean parallel corpora Distant microphone signals can be simulated as = + Microphone signal clean speech measured room impulse response measured noise Good simulation requires measuring the room impulse responses and the noise signals in the same rooms with the same microphone array Still Some aspect are not modeled e.g. head movements It is difficult to measure room impulse response in public spaces, How to generate multiple microphone signals

Recent advances in distant speech recognition

Interspeech 2016 tutorial: Recent advances in distant speech recognition Marc Delcroix NTT Communication Science Laboratories Shinji Watanabe Mitsubishi Electric Research Laboratories (MERL) Table of contents